Drop first observation if second observation is before a certain date

Stijn Martens

Join Date: Oct 2017
Posts: 2

Drop first observation if second observation is before a certain date

03 Oct 2017, 07:25

Hello everyone,

I am working with Stata to finish my thesis. I want to use a Repeat Sales Regression which is a methodology commonly used in the real estate literature. Before I can start my analysis I am preparing my dataset. I am curious about a phenomen that happened in September 2008. Therefore, I want to look at properties which are sold for the first time before September 2008, and then look at properties which are sold after September 2008.

The point I am struggling with is that some properties are sold more than once before September 2008, which makes only the last sale before September 2008 the most useful one. The first two observations can be dropped.

I can give you an example of what my dataset looks like. Addressid is the property id that is sold, which can be multiple times over the period. Salesdate is the actual date the property was sold. And Salesnr is the number of times the property is sold.

Addressid	Salesdate (yyyy-mm-dd)	Salesnr
001	20020308	1
001	20120509	2
002	20010201	1
002	20050518	2
002	20110315	3
003	20030711	1
003	20060801	2
003	20090108	3
004	20050423	1
004	20091106	2
004	20120829	3

For addressid 001, no further action is warranted because one sale is before September 2008, and one after. However, both addressid 002 and 003 have two sales before September 2008. Now I want to drop the first observation because this one is not useful for my analysis, but I can't drop salesnr == 1, because then I would also drop the right value for addressid == 001. Furthermore, addressid 004 has multiple sales after September 2008 but these are still relevant for my analysis.

I tried using the following function, but I cannot add weights:

by addressid: drop addressid[_n==1] if Salesdate[_n==2] < 20080900

I really hope you guys can help me. I've tried a couple of things but they don't seem to work.

Kind regards,
Stijn Martens

Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35809

03 Oct 2017, 08:07

That is some way away from legal drop syntax as you can drop variables or observations but not both at the same time. The fact that you're mixing syntaxes fantastically (what Stata guesses are weights are mixes of true-or-false expressions and subscripts, based perhaps on other software you know) is just what bites first.

I see what you're doing with your dates as long integers but at some point that won't help you, such as if you want the difference between two daily dates when 20080831 and 20080901 don't differ by 1 but by 70! So, I advise using Stata's own system for daily dates.

With some engineering to take your helpful data example (helpful, not as helpful as using dataex (SSC) such as we request) this shows some technique:

Code:

clear 
input str3 Addressid str8 sSalesdate Salesnr
001    20020308    1
001    20120509    2
002    20010201    1
002    20050518    2
002    20110315    3
003    20030711    1
003    20060801    2
003    20090108    3
004    20050423    1
004    20091106    2
004    20120829    3
end 

gen Salesdate = daily(sSalesdate, "YMD") 
format Salesdate %td 

gen tokeep = Salesdate >= mdy(9, 1, 2008) 
bysort Addressid (Salesdate) : replace tokeep = 1 if tokeep[_n+1] 
list, sepby(Addressid) 

     +----------------------------------------------------+
     | Addres~d   sSales~e   Salesnr   Salesdate   tokeep |
     |----------------------------------------------------|
  1. |      001   20020308         1   08mar2002        1 |
  2. |      001   20120509         2   09may2012        1 |
     |----------------------------------------------------|
  3. |      002   20010201         1   01feb2001        0 |
  4. |      002   20050518         2   18may2005        1 |
  5. |      002   20110315         3   15mar2011        1 |
     |----------------------------------------------------|
  6. |      003   20030711         1   11jul2003        0 |
  7. |      003   20060801         2   01aug2006        1 |
  8. |      003   20090108         3   08jan2009        1 |
     |----------------------------------------------------|
  9. |      004   20050423         1   23apr2005        1 |
 10. |      004   20091106         2   06nov2009        1 |
 11. |      004   20120829         3   29aug2012        1 |
     +----------------------------------------------------+

keep if tokeep 
drop tokeep 
list, sepby(Addressid) 

     +-------------------------------------------+
     | Addres~d   sSales~e   Salesnr   Salesdate |
     |-------------------------------------------|
  1. |      001   20020308         1   08mar2002 |
  2. |      001   20120509         2   09may2012 |
     |-------------------------------------------|
  3. |      002   20050518         2   18may2005 |
  4. |      002   20110315         3   15mar2011 |
     |-------------------------------------------|
  5. |      003   20060801         2   01aug2006 |
  6. |      003   20090108         3   08jan2009 |
     |-------------------------------------------|
  7. |      004   20050423         1   23apr2005 |
  8. |      004   20091106         2   06nov2009 |
  9. |      004   20120829         3   29aug2012 |
     +-------------------------------------------+

Comment

Stijn Martens

Join Date: Oct 2017

Posts: 2
#3

03 Oct 2017, 10:32

Awesome, thank you so much Mr. Cox!
Comment

Announcement

Drop first observation if second observation is before a certain date

Comment

Comment