Generating a new row for specific year

Stephen Ch

Join Date: Apr 2022
Posts: 67

Generating a new row for specific year

22 Feb 2024, 12:28

Hi all, I have the following panel data set with -ID YEAR CITY STATE- variables.

For the purpose of the study, I need do the following:

1. Identify -ID- that does not have -YEAR- 2004.
2. Create an artificial row for this -ID- with -ID YEAR CITY STATE- that copies from whichever is the most recent entry.
3. If there is -ID- with 2004 year entry, skip.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long id double year str195 city str8 state
1004 2000 "Wood Dale" "IL"
1004 2001 "Wood Dale" "IL"
1004 2002 "Wood Dale" "IL"
1004 2008 "Wood Dale" "IL"
1004 2009 "Wood Dale" "IL"
1004 2010 "Wood Dale" "IL"
1005 2007 "Beaconsfield" "MA"
1005 2008 "Beaconsfield" "MA"
1005 2009 "Beaconsfield" "MA"
1005 2010 "Beaconsfield" "MA"
end

I tried the following construction but I am not sure how to create the row I want:

Code:

bysort id (year): gen obs = _n
local varid id
* I think I need a counter variable like ...local i = 1

foreach y of local varid{
    * if not have 2004
    if max(year) <2004 | min(year) >2004 {
       * what command should I have here to insert a row with year = 2004 and to keep values in -city- -state-?
    }
}

Any advice/help would be greatly appreciated!

Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 29796

22 Feb 2024, 12:56

Code:

isid id year, sort
by id (year), sort: egen has_2004 = max(year == 2004)
by id (year): gen expander = cond(!has_2004 & _n == 1, 2, 1)
expand expander
by id (expander year), sort: replace year = 2004 if _n == _N & !has_2004

isid id year, sort
ds id year has_2004, not
foreach v of varlist `r(varlist)' {
    by id (year): replace `v' = `v'[_n-1] if year == 2004 & !has_2004
}
drop expander

NOTE: Code assumes, and verifies, that id and year uniquely identify observations in the data.

Last edited by Clyde Schechter; 22 Feb 2024, 13:02.

Comment

Ken Chui

Join Date: Aug 2014
Posts: 1054

22 Feb 2024, 13:04

Code:

clear
input long id double year str20 city str8 state
1004 2000 "Wood Dale" "IL"
1004 2001 "Wood Dale" "IL"
1004 2002 "Wood Dale" "IL"
1004 2008 "Wood Dale" "IL"
1004 2009 "Wood Dale" "IL"
1004 2010 "Wood Dale" "IL"
1005 2007 "Beaconsfield" "MA"
1005 2008 "Beaconsfield" "MA"
1005 2009 "Beaconsfield" "MA"
1005 2010 "Beaconsfield" "MA"
1006 2003 "Sometown" "VA"
1006 2004 "Sometown" "VA"
1006 2005 "Sometown" "VA"
1006 2016 "Sometown" "VA"
end
* Create some values
set seed 1744237
gen x = runiform()

preserve
* 2004 indicator
gen has_2004 = (year == 2004)
* Select states with no 2004
collapse (max) year has_2004, by(id city state)
drop if has_2004 == 1
replace year = 2004
save temp01, replace
restore

* Append empty 2004 rows
append using temp01
* Copy values
bysort state city id (year): replace x = x[_n-1] if year == 2004 & has_2004 == 0

Outcome:

Code:

     +----------------------------------------------------------+
     |   id   year           city   state          x   has_2004 |
     |----------------------------------------------------------|
  1. | 1004   2000      Wood Dale      IL   .0248728          . |
  2. | 1004   2001      Wood Dale      IL   .5002839          . |
  3. | 1004   2002      Wood Dale      IL   .7049134          . |
  4. | 1004   2004      Wood Dale      IL   .7049134          0 |
  5. | 1004   2008      Wood Dale      IL   .0677949          . |
  6. | 1004   2009      Wood Dale      IL   .9536643          . |
  7. | 1004   2010      Wood Dale      IL   .1169867          . |
     |----------------------------------------------------------|
  8. | 1005   2004   Beaconsfield      MA          .          0 |
  9. | 1005   2007   Beaconsfield      MA   .0696377          . |
 10. | 1005   2008   Beaconsfield      MA    .972551          . |
 11. | 1005   2009   Beaconsfield      MA    .945122          . |
 12. | 1005   2010   Beaconsfield      MA   .1402545          . |
     |----------------------------------------------------------|
 13. | 1006   2003       Sometown      VA   .4325173          . |
 14. | 1006   2004       Sometown      VA   .3057961          . |
 15. | 1006   2005       Sometown      VA   .9123544          . |
 16. | 1006   2016       Sometown      VA   .4709794          . |
     +----------------------------------------------------------+

Crossed with #2. The solution is #2 is less convoluted.

Comment

Stephen Ch

Join Date: Apr 2022

Posts: 67
#4

22 Feb 2024, 15:24

Thanks to both Clyde and Ken for help!

A quick question to Ken -- is there a way to restrict this application of "inserting a row" to those IDs that are between years of 2000 and 2010?

In other words, if there is an ID that do not have values in between (inclusive) 2000 and 2010, I don't want the operation to take place.
Comment
Ken Chui

Join Date: Aug 2014

Posts: 1054
#5

22 Feb 2024, 15:50

Originally posted by Stephen Ch View Post

Thanks to both Clyde and Ken for help!

A quick question to Ken -- is there a way to restrict this application of "inserting a row" to those IDs that are between years of 2000 and 2010?

In other words, if there is an ID that do not have values in between (inclusive) 2000 and 2010, I don't want the operation to take place.

Just want to clarify one thing first. When you said "that copies from whichever is the most recent entry", the most recent entry also includes future data? (e.g. copy 2005's data into 2004 if 2004 is missing?)
Comment
Stephen Ch

Join Date: Apr 2022

Posts: 67
#6

22 Feb 2024, 16:27

Thanks for asking the clarification question.

I want to restrict this operation for the IDs that have entries for years between 2000 and 2010.

For example, if ID 1004 has entries 1998, 2012, then I don't want to create a 2004 entry with id, city, and state.

If ID 1005 has 1998, 2003, and 2012, then I do want to create a 2004 entry with id, city, and state from the 2012 city state data (e.g., it is possible city and state entries are different from 1998 or 2003).

Originally posted by Ken Chui View Post

Just want to clarify one thing first. When you said "that copies from whichever is the most recent entry", the most recent entry also includes future data? (e.g. copy 2005's data into 2004 if 2004 is missing?)
Comment
Stephen Ch

Join Date: Apr 2022

Posts: 67
#7

22 Feb 2024, 16:32

Ken, also when you use -collapse- I am losing a lot of variables from the original data set; is there a way to preserve them?

Originally posted by Ken Chui View Post

Just want to clarify one thing first. When you said "that copies from whichever is the most recent entry", the most recent entry also includes future data? (e.g. copy 2005's data into 2004 if 2004 is missing?)
Comment
Ken Chui

Join Date: Aug 2014

Posts: 1054
#8

22 Feb 2024, 16:35

Originally posted by Stephen Ch View Post

Thanks for asking the clarification question.

I want to restrict this operation for the IDs that have entries for years between 2000 and 2010.

For example, if ID 1004 has entries 1998, 2012, then I don't want to create a 2004 entry with id, city, and state.

If ID 1005 has 1998, 2003, and 2012, then I do want to create a 2004 entry with id, city, and state from the 2012 city state data (e.g., it is possible city and state entries are different from 1998 or 2003).

But if an ID has 1998, 2005, and 2012, then are you planning to:
Copy the 1998 data as 2004, OR

Copy the 2005 data as 2004, OR

Not impute for 2004 at all?

Thanks!
Comment
Ken Chui

Join Date: Aug 2014

Posts: 1054
#9

22 Feb 2024, 16:37

Originally posted by Stephen Ch View Post

Ken, also when you use -collapse- I am losing a lot of variables from the original data set; is there a way to preserve them?

That's why it's wrapped between a pair of commands called "preserve" and "restore". If you run the whole set command on a do-file between these two commands, you'll see that your original data would not be affected.
Comment
Stephen Ch

Join Date: Apr 2022

Posts: 67
#10

22 Feb 2024, 16:42

Hi Ken, this is very helpful, thank you. Thanks for demystifying the observation drop concern with -collapse-!

To answer your question:

if an ID has 1998, 2005, and 2012, then since 2005 is between 2000 and 2010, I do want to create 2004 entry based on 2012 information. However, I only want to keep ID, city, and state entries in that observation granted there are other pieces of information (e.g., zip code).

Thanks!

Originally posted by Ken Chui View Post

But if an ID has 1998, 2005, and 2012, then are you planning to:
Copy the 1998 data as 2004, OR

Copy the 2005 data as 2004, OR

Not impute for 2004 at all?

Thanks!
Comment
Stephen Ch

Join Date: Apr 2022

Posts: 67
#11

22 Feb 2024, 16:50

Ken, there is also a possibility that the ID 1004 can be in two different locations at different or same times (just wanted to clarify)

Originally posted by Ken Chui View Post

But if an ID has 1998, 2005, and 2012, then are you planning to:
Copy the 1998 data as 2004, OR

Copy the 2005 data as 2004, OR

Not impute for 2004 at all?

Thanks!
Comment
Stephen Ch

Join Date: Apr 2022

Posts: 67
#12

23 Feb 2024, 13:53

Hi Clyde,

Thanks for this clean suggested code.

Is there a way to restrict this operation for the id-year observations that are between 2000 and 2010?

Consider two examples:

id 1002 that appears in 1999 2012
id 1003 that appears in 1998 2005 2006 2019

I would not like to apply your code to id 1002, because I see no observation between (inclusive) 2000 and 2010.

However, for id 1003, I have 2005 and 2006 observations. Hence, I would like to repliace the city and state entry from max(year) (i.e., most recent) of id 1003, which is 2019 entries, and make the artificial 2004 observation.

Originally posted by Clyde Schechter View Post

Code:

isid id year, sort by id (year), sort: egen has_2004 = max(year == 2004) by id (year): gen expander = cond(!has_2004 & _n == 1, 2, 1) expand expander by id (expander year), sort: replace year = 2004 if _n == _N & !has_2004 isid id year, sort ds id year has_2004, not foreach v of varlist `r(varlist)' { by id (year): replace `v' = `v'[_n-1] if year == 2004 & !has_2004 } drop expander

NOTE: Code assumes, and verifies, that id and year uniquely identify observations in the data.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#13

23 Feb 2024, 14:17

I believe the following will do what you want. It is not really tested because your -dataex- example data has no id's that would be excluded by this new condition. But I think I have it right.

Code:

by id (year), sort: egen has_2004 = max(year == 2004) by id (year), sort: egen has_in_range_years = max(inrange(year, 2000, 2010)) by id (year): gen expander = cond(has_in_range_years &!has_2004 & _n == 1, 2, 1) expand expander by id (expander year), sort: replace year = 2004 if _n == _N /// & has_in_range_years & !has_2004 isid id year, sort ds id year has_2004, not foreach v of varlist `r(varlist)' { by id (year): replace `v' = `v'[_n-1] if year == 2004 & !has_2004 } drop expander

Also, note that this code replaces the variables in the newly created 2004 record with the data from the preceding year, as you originally specified in #1. I see that you have changed that to another criterion which, to be honest I did not follow. Hopefully you can fix up that part of it yourself. If you need help doing that, please explain the new source of the replacement data in clear and consistent terms, and provide some examples.
Comment
Stephen Ch

Join Date: Apr 2022

Posts: 67
#14

23 Feb 2024, 14:45

Hi Clyde,

Thanks once again! Your code works but it does not transfer the "city" and "state" information to the newly created entry for 2004.
Is there a way to do this?

Originally posted by Clyde Schechter View Post

I believe the following will do what you want. It is not really tested because your -dataex- example data has no id's that would be excluded by this new condition. But I think I have it right.

Code:

by id (year), sort: egen has_2004 = max(year == 2004) by id (year), sort: egen has_in_range_years = max(inrange(year, 2000, 2010)) by id (year): gen expander = cond(has_in_range_years &!has_2004 & _n == 1, 2, 1) expand expander by id (expander year), sort: replace year = 2004 if _n == _N /// & has_in_range_years & !has_2004 isid id year, sort ds id year has_2004, not foreach v of varlist `r(varlist)' { by id (year): replace `v' = `v'[_n-1] if year == 2004 & !has_2004 } drop expander

Also, note that this code replaces the variables in the newly created 2004 record with the data from the preceding year, as you originally specified in #1. I see that you have changed that to another criterion which, to be honest I did not follow. Hopefully you can fix up that part of it yourself. If you need help doing that, please explain the new source of the replacement data in clear and consistent terms, and provide some examples.
Comment

Stephen Ch

Join Date: Apr 2022
Posts: 67

#15

23 Feb 2024, 15:23

Ken, I am trying both your way and Clyde's suggestion.

Is there a way to restrict your code for years between 2000 and 2010.

To respond to your earlier question:

Consider two examples:

id 1002 that appears in 1999 2012
id 1003 that appears in 1998 2005 2006 2019

I would not like to apply your code to id 1002, because I see no observation between (inclusive) 2000 and 2010.

However, for id 1003, I have 2005 and 2006 observations. Hence, I would like to repliace the city and state entry from max(year) (i.e., most recent) of id 1003, which is 2019 entries, and make the artificial 2004 observation.

Thanks!

Originally posted by Ken Chui View Post

Code:

clear
input long id double year str20 city str8 state
1004 2000 "Wood Dale" "IL"
1004 2001 "Wood Dale" "IL"
1004 2002 "Wood Dale" "IL"
1004 2008 "Wood Dale" "IL"
1004 2009 "Wood Dale" "IL"
1004 2010 "Wood Dale" "IL"
1005 2007 "Beaconsfield" "MA"
1005 2008 "Beaconsfield" "MA"
1005 2009 "Beaconsfield" "MA"
1005 2010 "Beaconsfield" "MA"
1006 2003 "Sometown" "VA"
1006 2004 "Sometown" "VA"
1006 2005 "Sometown" "VA"
1006 2016 "Sometown" "VA"
end
* Create some values
set seed 1744237
gen x = runiform()

preserve
* 2004 indicator
gen has_2004 = (year == 2004)
* Select states with no 2004
collapse (max) year has_2004, by(id city state)
drop if has_2004 == 1
replace year = 2004
save temp01, replace
restore

* Append empty 2004 rows
append using temp01
* Copy values
bysort state city id (year): replace x = x[_n-1] if year == 2004 & has_2004 == 0

Outcome:

Code:

+----------------------------------------------------------+
| id year city state x has_2004 |
|----------------------------------------------------------|
1. | 1004 2000 Wood Dale IL .0248728 . |
2. | 1004 2001 Wood Dale IL .5002839 . |
3. | 1004 2002 Wood Dale IL .7049134 . |
4. | 1004 2004 Wood Dale IL .7049134 0 |
5. | 1004 2008 Wood Dale IL .0677949 . |
6. | 1004 2009 Wood Dale IL .9536643 . |
7. | 1004 2010 Wood Dale IL .1169867 . |
|----------------------------------------------------------|
8. | 1005 2004 Beaconsfield MA . 0 |
9. | 1005 2007 Beaconsfield MA .0696377 . |
10. | 1005 2008 Beaconsfield MA .972551 . |
11. | 1005 2009 Beaconsfield MA .945122 . |
12. | 1005 2010 Beaconsfield MA .1402545 . |
|----------------------------------------------------------|
13. | 1006 2003 Sometown VA .4325173 . |
14. | 1006 2004 Sometown VA .3057961 . |
15. | 1006 2005 Sometown VA .9123544 . |
16. | 1006 2016 Sometown VA .4709794 . |
+----------------------------------------------------------+

Crossed with #2. The solution is #2 is less convoluted.

Announcement

Generating a new row for specific year

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment