Panel Data: Creating new observations

Carlos Teigimiz

Join Date: May 2016

Posts: 26
#1

Panel Data: Creating new observations

26 May 2016, 17:19

Dear Statafriends,

I got a tricky issue. The thing is, I have to create new observations, not variables.
Obviously, creating new observations in Stata is neither comfortable nor does it make sense in most occassions, but this time it is necessary.

I already found this helpful thread, but my issue is more complex: http://www.stata.com/statalist/archi.../msg00875.html

My Data
I have panel data, identified by company_id and month_id.

uniq_company: unique value, one for each company_id

no_company: total number of distinct company_ids

You can find a small subset below.
My Dataset

Code:

clear input double company_id float(month_id uniq_company no_company) 10001 337 10001 11306 10001 338 . 11306 10001 339 . 11306 10001 340 . 11306 10001 341 . 11306 10001 342 . 11306 10001 343 . 11306 10001 344 . 11306 10001 345 . 11306 10001 346 . 11306 10001 347 . 11306 10001 348 . 11306 10001 349 . 11306 10002 414 10002 11306 10002 415 . 11306 10002 416 . 11306 10002 417 . 11306 10002 418 . 11306 10002 419 . 11306 10002 420 . 11306 10002 421 . 11306 10002 422 . 11306 10002 423 . 11306 10002 424 . 11306 10002 425 . 11306 10002 426 . 11306 10002 427 . 11306 10002 428 . 11306 10002 429 . 11306 10003 426 10003 11306 10003 427 . 11306 10003 428 . 11306 10003 429 . 11306 10003 430 . 11306 10003 431 . 11306 end format %tm month_id

What do I have to do?
Create 2 new observations, before the first month_id of each company_id.
It does not matter what happens with uniq_company no_company, they are only support-variables,
but I have to create two new (previous) month_id s for each company.

Code:

clear input double company_id float(month_id uniq_company no_company) 10001 335 . 11306 10001 336 . 11306 10001 337 10001 11306 10001 338 . 11306 10001 339 . 11306 . . . . 10002 412 . 11306 10002 413 . 11306 10002 414 10002 11306 10002 415 . 11306 10002 416 . 11306 end format %tm month_id

Why do I need this?
The intuition is, that I use this list as a criterion which is merged with other datasets and defines whether to discard a specific company_id in a specific month.
Unfortunately, the current, non-adjusted data would lag everything by 2 months, which implies that my list would tell the other dataset to exclude some months that would be necessary, creating a bias in my subsequently used data.

My Approach up to now
First, created the necessary number of empty cells for at least 1 month [set obs `=_N+no_company'] (how can I double that number?).
Then I tried to fill the empty cells with company_id and month_id less 1 month. I tried every way I could think of, this seemed to be the most promising, but it just does not work.

Code:

sort company_id month_id generate uniq_company = . by company_id: replace uniq_company = company_id if _n==1 egen no_company = count(uniq_company) set obs `=_N+no_company' by company_id month_id: replace company_id = uniq_company[_n==1] if missing(company_id) by company_id month_id: replace month_id = month_id[_n==1] - 1 if missing(month_id)

Any suggestions would be highly welcome!

Thank you,
Carlos

Last edited by Carlos Teigimiz; 26 May 2016, 17:21.
Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 30101

26 May 2016, 17:55

I think it's easier to create them separately and then append them to the original data:

Code:

clear
input double company_id float(month_id uniq_company no_company)
10001 337 10001 11306
10001 338     . 11306
10001 339     . 11306
10001 340     . 11306
10001 341     . 11306
10001 342     . 11306
10001 343     . 11306
10001 344     . 11306
10001 345     . 11306
10001 346     . 11306
10001 347     . 11306
10001 348     . 11306
10001 349     . 11306
10002 414 10002 11306
10002 415     . 11306
10002 416     . 11306
10002 417     . 11306
10002 418     . 11306
10002 419     . 11306
10002 420     . 11306
10002 421     . 11306
10002 422     . 11306
10002 423     . 11306
10002 424     . 11306
10002 425     . 11306
10002 426     . 11306
10002 427     . 11306
10002 428     . 11306
10002 429     . 11306
10003 426 10003 11306
10003 427     . 11306
10003 428     . 11306
10003 429     . 11306
10003 430     . 11306
10003 431     . 11306
end
format %tm month_id

// SET ASIDE THE ORIGINAL DATA
tempfile original
save `original'

// KEEP ONLY FIRST MONTH FOR EACH COMPANY ID
by company_id (month_id), sort: keep if _n == 1
// GENERATE TWO ADDITIONAL OBSERVATIONS FOR EACH
expand 3
// POPULATE month_id WITH THE TWO PRECEDING MONTHS
by company_id (month_id), sort: replace month_id = month_id[_n-1]-1 if _n > 1

// PUT IT ALL TOGETHER AND SORT IT INTO ORDER
append using `original'
sort company_id month_id

And thank you for using -dataex- and code blocks; it makes working with your information so much easier!

Comment

Carlos Teigimiz

Join Date: May 2016

Posts: 26
#3

26 May 2016, 18:19

Wow, awesome code.
Thanks so much Clyde!

One edit: if anybody else also needs this. Do not forget - duplicates drop - to get rid of double initial values

Last edited by Carlos Teigimiz; 26 May 2016, 18:46.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35698

27 May 2016, 00:59

Here's another way to do it in place:

Code:

clear
input double company_id float(month_id uniq_company no_company)
10001 337 10001 11306
10001 338     . 11306
10001 339     . 11306
10002 414 10002 11306
10002 415     . 11306
10002 416     . 11306
end
format %tm month_id

bysort company_id (month_id) : gen first = _n == 1
expand 3 if first
bysort company_id (month_id) : replace month_id = month_id - 2 if _n == 1
by company_id : replace month_id = month_id - 1 if _n == 2
 
list, sepby(company_id)

     +---------------------------------------------------+
     | compan~d   month_id   uniq_c~y   no_com~y   first |
     |---------------------------------------------------|
  1. |    10001    1987m12      10001      11306       1 |
  2. |    10001     1988m1      10001      11306       1 |
  3. |    10001     1988m2      10001      11306       1 |
  4. |    10001     1988m3          .      11306       0 |
  5. |    10001     1988m4          .      11306       0 |
     |---------------------------------------------------|
  6. |    10002     1994m5      10002      11306       1 |
  7. |    10002     1994m6      10002      11306       1 |
  8. |    10002     1994m7      10002      11306       1 |
  9. |    10002     1994m8          .      11306       0 |
 10. |    10002     1994m9          .      11306       0 |
     +---------------------------------------------------+

Last edited by Nick Cox; 27 May 2016, 01:20.

Comment

Jiong Yao

Join Date: Oct 2019

Posts: 25
#5

15 Oct 2019, 01:27

[QUOTE=Nick Cox;n1342828]Here's another way to do it in place:

[CODE]
clear
input double company_id float(month_id uniq_company no_company)
10001 337 10001 11306
10001 338 . 11306
10001 339 . 11306
10002 414 10002 11306
10002 415 . 11306
10002 416 . 11306
end
format %tm month_id

bysort company_id (month_id) : gen first = _n == 1
expand 3 if first
bysort company_id (month_id) : replace month_id = month_id - 2 if _n == 1
by company_id : replace month_id = month_id - 1 if _n == 2

list, sepby(company_id)

+---------------------------------------------------+
| compan~d month_id uniq_c~y no_com~y first |
|---------------------------------------------------|
1. | 10001 1987m12 10001 11306 1 |
2. | 10001 1988m1 10001 11306 1 |
3. | 10001 1988m2 10001 11306 1 |
4. | 10001 1988m3 . 11306 0 |
5. | 10001 1988m4 . 11306 0 |
|---------------------------------------------------|
6. | 10002 1994m5 10002 11306 1 |
7. | 10002 1994m6 10002 11306 1 |
8. | 10002 1994m7 10002 11306 1 |
9. | 10002 1994m8 . 11306 0 |
10. | 10002 1994m9 . 11306 0 |
+---------------------------------------------------+

ccode_1 ccode_2 year_1
2 20 2002
2 20 2003
2 20 2004
2 20 2005
2 20 2006
2 20 2007

Hi Nick...me again. my question is how to expand three months afterwards? actually, in may case, My data ends at 2007. i want to expand the observations to 2010. the first line of the command should be this: bysort ccode_1 ccode_2 (year_1): gen last = _n == _N. However, i cannot figure out the second line of command. thanks in advance!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#6

15 Oct 2019, 03:45

Sorry, but to me #5 is a real mess. The data you give at the start don't seem to have any bearing on the final question. Three months? three years? Too much of a a puzzle. If someone else can follow what you want, well and good. Otherwise I suggest you clean up, post again and get

Code:

delimiters right. Use the # key in the editor to get mark-up automatically.
Comment
Jiong Yao

Join Date: Oct 2019

Posts: 25
#7

16 Oct 2019, 10:33

Originally posted by Nick Cox View Post

Sorry, but to me #5 is a real mess. The data you give at the start don't seem to have any bearing on the final question. Three months? three years? Too much of a a puzzle. If someone else can follow what you want, well and good. Otherwise I suggest you clean up, post again and get

Code:

delimiters right. Use the # key in the editor to get mark-up automatically.

Hi Nick. i figured out this later on, just too tired to post it.
Here is my code on how to expand observations from 2007 to 2008. my data ends in 2007.
#
ccode1 ccode2 year

2 20 2002

2 20 2003

2 20 2004

2 20 2005

2 20 2006

2 20 2007

"ccode" stand for country code and 2 is for US and 20 is for Canand.
below is my code
******
bysort ccode1 ccode2 (year): gen last = _n == _N
bysort ccode1 ccode2 (year): replace year = year +3 if _n == _N
bysort ccode1 ccode2 (year): replace year = year +3 if _n == _N-1
bysort ccode1 ccode2 (year): replace year = year +3 if _n == _N-2
drop if year < 2008
drop last
save "/Users/yaojiong/Desktop/dyadic mid1.dta", replace
clear
****
basically, this helps me to generate observations from 2008-10.
then, just append with the original data.
do you think it is right? Thanks in advance.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#8

16 Oct 2019, 12:25

Thanks for the extra detail but I am still in a fog about what your problem is and why your code is a solution.

Data examples are best provided with dataex: see https://www.statalist.org/forums/help#stata

Here is some technique for adding years 2008, 2009, 2010 to a dataset with year 2002 to 2007.

Code:

clear input ccode1 ccode2 year 2 20 2002 2 20 2003 2 20 2004 2 20 2005 2 20 2006 2 20 2007 end bysort ccode1 ccode2 (year) : gen toexpand = _n == _N expand 4 if toexpand bysort ccode1 ccode2 (year) : replace year = year[1] + _n - 1 drop toexpand list, sepby(ccode*) +------------------------+ | ccode1 ccode2 year | |------------------------| 1. | 2 20 2002 | 2. | 2 20 2003 | 3. | 2 20 2004 | 4. | 2 20 2005 | 5. | 2 20 2006 | 6. | 2 20 2007 | 7. | 2 20 2008 | 8. | 2 20 2009 | 9. | 2 20 2010 | +------------------------+

1. That said, there is no obvious point in bulking out a dataset with identifiers and a time variable unless you have other variables to add with non-missing values.

2. Further, in your own code you drop data for before 2008, so why do that? If you just want a dataset with 3 years for each panel, there will be code to do that directly. But you don't want that, so see my code instead. Yet comment #1 still applies.
Comment
Jiong Yao

Join Date: Oct 2019

Posts: 25
#9

16 Oct 2019, 12:56

Originally posted by Nick Cox View Post

Thanks for the extra detail but I am still in a fog about what your problem is and why your code is a solution.

Data examples are best provided with dataex: see https://www.statalist.org/forums/help#stata

Here is some technique for adding years 2008, 2009, 2010 to a dataset with year 2002 to 2007.

Code:

clear input ccode1 ccode2 year 2 20 2002 2 20 2003 2 20 2004 2 20 2005 2 20 2006 2 20 2007 end bysort ccode1 ccode2 (year) : gen toexpand = _n == _N expand 4 if toexpand bysort ccode1 ccode2 (year) : replace year = year[1] + _n - 1 drop toexpand list, sepby(ccode*) +------------------------+ | ccode1 ccode2 year | |------------------------| 1. | 2 20 2002 | 2. | 2 20 2003 | 3. | 2 20 2004 | 4. | 2 20 2005 | 5. | 2 20 2006 | 6. | 2 20 2007 | 7. | 2 20 2008 | 8. | 2 20 2009 | 9. | 2 20 2010 | +------------------------+

1. That said, there is no obvious point in bulking out a dataset with identifiers and a time variable unless you have other variables to add with non-missing values.

2. Further, in your own code you drop data for before 2008, so why do that? If you just want a dataset with 3 years for each panel, there will be code to do that directly. But you don't want that, so see my code instead. Yet comment #1 still applies.

Thanks Nick. You code works perfectly fine. yes, i want to add other variables to the dataset. one more question, could you explain the logic of the third line of the command, especially " replace year = year[1] + _n - 1"? Thanks again for your help!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#10

16 Oct 2019, 13:43

See any explanation of the by: prefix command such as my paper in Stata Journal 2(1) from 2002.

With that prefix year[1] is the first value in each group and _n runs 1 upwards within each group. So the code gives a sequence of integers as result.
1 like
Comment

Guest

#11

20 Jul 2021, 06:45

Originally posted by Nick Cox View Post

Here's another way to do it in place:

Code:

clear
input double company_id float(month_id uniq_company no_company)
10001 337 10001 11306
10001 338 . 11306
10001 339 . 11306
10002 414 10002 11306
10002 415 . 11306
10002 416 . 11306
end
format %tm month_id

bysort company_id (month_id) : gen first = _n == 1
expand 3 if first
bysort company_id (month_id) : replace month_id = month_id - 2 if _n == 1
by company_id : replace month_id = month_id - 1 if _n == 2

list, sepby(company_id)

+---------------------------------------------------+
| compan~d month_id uniq_c~y no_com~y first |
|---------------------------------------------------|
1. | 10001 1987m12 10001 11306 1 |
2. | 10001 1988m1 10001 11306 1 |
3. | 10001 1988m2 10001 11306 1 |
4. | 10001 1988m3 . 11306 0 |
5. | 10001 1988m4 . 11306 0 |
|---------------------------------------------------|
6. | 10002 1994m5 10002 11306 1 |
7. | 10002 1994m6 10002 11306 1 |
8. | 10002 1994m7 10002 11306 1 |
9. | 10002 1994m8 . 11306 0 |
10. | 10002 1994m9 . 11306 0 |
+---------------------------------------------------+

Really helpful, thanks. My question would be if it is somehow possible to grasp the last observation of a group, not the first. Lets say the observations of different groups (maybe countries) start at different years. With _n == 1 and
year_id + 1 (for example) we would not always end at the end of the groups observations. But if we could say _n = "last observation" one could easily add a further year (maybe to predict one year ahead). Is this somehow possible ?

Last edited by sladmin; 11 Apr 2022, 07:23. Reason: anonymize original poster

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35698
#12

21 Jul 2021, 00:49

This variation presumably starts with

Code:

bysort company_id (month_id) : gen last = _n == _N

See

Code:

help _variables

for basic information and nearby sections of [U] for elaboration.
Comment
Guest
#13

21 Jul 2021, 08:11

Originally posted by Nick Cox View Post

This variation presumably starts with

Code:

bysort company_id (month_id) : gen last = _n == _N

See

Code:

help _variables

for basic information and nearby sections of [U] for elaboration.

Code:

egen id = group(Country) xtset id Year, yearly sort id Year bysort id (Year) : gen first = _n == _N if Year > 2018 expand 2 if first == 1 bysort id (Year) : replace Year = Year +1 if _n == _N replace LMR = . if Year > 2019 gen lag_LMR = l.LMR

Thank you that was really helpful. So this is my line of code but if i want to execute the last command "gen lag_LMR = l.LMR" Stata says "not sorted". Everything looks sorted and also sorting again isnt helping. Could you tell me why ? Thanks in advance.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#14

21 Jul 2021, 08:31

I think Stata is working on the principle that you have explicitly sorted your data since the last xtset. In principle you could (also) have changed the data so that xtset no longer holds I guess that if you xtset again, the calculation will then go through.
Comment
Guest
#15

21 Jul 2021, 09:55

Originally posted by Nick Cox View Post

I think Stata is working on the principle that you have explicitly sorted your data since the last xtset. In principle you could (also) have changed the data so that xtset no longer holds I guess that if you xtset again, the calculation will then go through.

It worked. Thank you very much.
Comment

ccode1	ccode2	year
2	20	2002
2	20	2003
2	20	2004
2	20	2005
2	20	2006
2	20	2007

Announcement