Create ID for panel dataset

Olivier Hoya

Join Date: Feb 2016
Posts: 12

Create ID for panel dataset

10 Apr 2017, 04:48

Hi everyone,

I’m having trouble creating a unique ID for each product in a data set. The data set is structured as follows:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte cluster int year str5 id str4 id2 int price int quant
1 2007 "D27AF" "A12"   12 700
1 2009 "D27AF" "A12"   14 724
1 2011 "D27AF" "A12"    8 824
1 2013 "D27AF" "A12"    6 900
1 2015 "D27AF" "A12"    4 1024
1 2007 "A12AB" "B78"   75 84
1 2009 "A12AB" "B78"   24 88
1 2011 "A12AB" "B78"  132 120
1 2013 "A12AB" "B78"   75 400
1 2015 "A74ZF" "B78"   12 280
9 2004 "112"   "BD18"  89 45
9 2006 "112"   "BD64"  89 78
9 2008 "112"   "BD18"  89 800
9 2010 "112"   "BD18"  67 824
9 2012 "112"   "BD32"  50 825
9 2014 "112"   "BD18"  34 890
9 2016 "112"   "BD18"  23 1000
end

Each line corresponds to product X in year Y. A product can belong to one of 13 clusters (variable: cluster). For ease of illustration, here I just picked two products from cluster 1, and one from cluster 9.

The prices of products in each cluster were collected at regular intervals, but in different years. Data were collected every two years between 2007 and 2015 for cluster 1, and every two years between 2004 and 2016 for cluster 9. (The years differ for some of the other clusters, which is not shown above.)

For many of the products, a unique identifier (variable: id) was created which remained constant over time (e.g., D27AF for the first product above). The difficulty, however, is that id did not remain constant over time for all products (e.g., for the second product above, the id changed from A12AB to A74ZF in 2015). For those products, another identifier (variable: id2) was created which remained constant over time (e.g., B78 for the second product).

There are three possible scenarios: (1) id is constant over time, but not id2 (product 3 above); (2) id2 is constant over time, but not id (product 1 above); or (3) both are constant over time (product 2 above).

So what I need to do is to create a new variable (variable: id3) to track the same product over time. In other words, I need to an identifier based on either id or id2 (or both) remaining constant over time within a given cluster. Again, the observation years varied from cluster to cluster. So the new variable should look like id3 below:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte cluster int year str5 id str4 id2 int price int quant byte id3
1 2007 "D27AF" "A12"   12 700 1
1 2009 "D27AF" "A12"   14 724 1
1 2011 "D27AF" "A12"    8 824 1
1 2013 "D27AF" "A12"    6 900 1
1 2015 "D27AF" "A12"    4 1024 1
1 2007 "A12AB" "B78"   75 84 2
1 2009 "A12AB" "B78"   24 88 2
1 2011 "A12AB" "B78"  132 120 2
1 2013 "A12AB" "B78"   75 400 2
1 2015 "A74ZF" "B78"   12 280 2
9 2004 "112"   "BD18"  89 45 3
9 2006 "112"   "BD64"  89 78 3
9 2008 "112"   "BD18"  89 800 3
9 2010 "112"   "BD18"  67 824 3
9 2012 "112"   "BD32"  50 825 3
9 2014 "112"   "BD18"  34 890 3
9 2016 "112"   "BD18"  23 1000 3
end

Here is the code I have so far, just based on tracking "id" over time, not id2 (Note: the first step with generating and replacing the variable check is to keep only those products for which we have pricing data in each year, i.e., 5 years for cluster 1, 7 for clusters 2, 3, and 4, etc.). This code correctly calculates Laspeyres indices based on products with the same id over time. I now want to extend this code to include products where id did not remain constant, but id2 did (since if both id and id2 remained constant, this would be captured by the code below):

Code:

* identify total number of observations for each medicine
sort cluster id year
by cluster id: generate total = _N
sort cluster id2 year
by cluster id2: generate total1 = _N

* identify complete series (drop those with missing years)
gen check = 1 if cluster==1 & total==5
replace check = 1 if cluster == 2 & (total == 7 | total1 == 7)
replace check = 1 if cluster == 3 & (total == 7 | total1 == 7)
replace check = 1 if cluster == 4 & (total == 7 | total1 == 7)
replace check = 1 if cluster == 6 & (total == 4 | total1 == 4)
replace check = 1 if cluster == 7 & (total == 5 | total1 == 5)
replace check = 1 if cluster == 8 & (total == 6 | total1 == 6)
replace check = 1 if cluster == 9 & (total == 7 | total1 == 7)
replace check = 1 if cluster == 10 & (total == 3 | total1 == 3)
replace check = 1 if cluster == 11 & (total == 3 | total1 == 3)
replace check = 1 if cluster == 12 & (total == 4 | total1 == 4)
replace check = 1 if cluster == 13 & (total == 4 | total1 == 4)

drop if check==.

* create panel id
by cluster id: generate count = _n

* set the time variable
egen new_id = group(id)
tsset new_id count

* calculate Laspeyres values
gen byte baseyear = 1 if count == 1
by new_id (baseyear), sort: gen lasp_input = price * quant[1]
by cluster count, sort: egen sum_lasp = sum(lasp_input)

* calculate Laspeyres indices
collapse (mean) sum_lasp baseyear, by(cluster count year)
sort cluster count
by cluster: gen lasp = sum_lasp / sum_lasp[1]

* plot indices
line lasp year, by(cluster) xlabel(2003(3)2016)

Please let me know if you need any other information or something explained more clearly.

Thank you very much.

Last edited by Olivier Hoya; 10 Apr 2017, 05:42.

Tags: None

Daniel Bela

Join Date: Apr 2014
Posts: 246

10 Apr 2017, 05:53

Hi Olivier,

the following code works for me:

Code:

// create data
clear
input byte cluster int year str5 id str4 id2 int price int quant
1 2007 "D27AF" "A12"   12 700
1 2009 "D27AF" "A12"   14 724
1 2011 "D27AF" "A12"    8 824
1 2013 "D27AF" "A12"    6 900
1 2015 "D27AF" "A12"    4 1024
1 2007 "A12AB" "B78"   75 84
1 2009 "A12AB" "B78"   24 88
1 2011 "A12AB" "B78"  132 120
1 2013 "A12AB" "B78"   75 400
1 2015 "A74ZF" "B78"   12 280
9 2004 "112"   "BD18"  89 45
9 2006 "112"   "BD64"  89 78
9 2008 "112"   "BD18"  89 800
9 2010 "112"   "BD18"  67 824
9 2012 "112"   "BD32"  50 825
9 2014 "112"   "BD18"  34 890
9 2016 "112"   "BD18"  23 1000
end

// generate a temporary identifier
egen temp_id=group(cluster id id2)

// correct the temporary identifier if observation is in the same cluster, and only one of id or id2 matches those of the observation above
sort cluster id id2 year
replace temp_id=temp_id[_n-1] if (cluster==cluster[_n-1] & (id==id[_n-1] | id2==id2[_n-1]))

// generate id3, just to fill in gaps
egen id3=group(temp_id)
drop temp_id

Hopefully, it will also work for you, or did I miss any point of the question?

Regards
Bela

Comment

Olivier Hoya

Join Date: Feb 2016

Posts: 12
#3

10 Apr 2017, 06:25

Hi Bela,

Many thanks for taking the time to help. Unfortunately, I'm not getting the intended result. I've visually inspected the new ID variable and quickly spot odd results. For example, there are a number of unique products (i.e., only one product with a particular id3), even though the minimum number of observations per product should be the number of years within a cluster.

I will try to diagnose the problem now, but I wanted to let you know in case you know what might be causing this.

Thanks again for your input.

Last edited by Olivier Hoya; 10 Apr 2017, 06:30.
Comment

Jorrit Gosens

Join Date: Jan 2015
Posts: 1019

10 Apr 2017, 07:47

Somethign like this, replaces id and id2 with the last value recorded for each:

Code:

gen idnew = id
bys id2 (year): replace idnew = id[_N]

gen id2new = id2
bys id (year): replace id2new = id2[_N]

gen id3 = idnew + "/" + id2new

Code:

. list

     +----------------------------------------------------------------------------+
     | cluster   year      id    id2   price   quant   idnew   id2new         id3 |
     |----------------------------------------------------------------------------|
  1. |       9   2004     112   BD18      89      45     112     BD18    112/BD18 |
  2. |       9   2006     112   BD64      89      78     112     BD18    112/BD18 |
  3. |       9   2008     112   BD18      89     800     112     BD18    112/BD18 |
  4. |       9   2010     112   BD18      67     824     112     BD18    112/BD18 |
  5. |       9   2012     112   BD32      50     825     112     BD18    112/BD18 |
     |----------------------------------------------------------------------------|
  6. |       9   2014     112   BD18      34     890     112     BD18    112/BD18 |
  7. |       9   2016     112   BD18      23    1000     112     BD18    112/BD18 |
  8. |       1   2007   A12AB    B78      75      84   A74ZF      B78   A74ZF/B78 |
  9. |       1   2009   A12AB    B78      24      88   A74ZF      B78   A74ZF/B78 |
 10. |       1   2011   A12AB    B78     132     120   A74ZF      B78   A74ZF/B78 |
     |----------------------------------------------------------------------------|
 11. |       1   2013   A12AB    B78      75     400   A74ZF      B78   A74ZF/B78 |
 12. |       1   2015   A74ZF    B78      12     280   A74ZF      B78   A74ZF/B78 |
 13. |       1   2007   D27AF    A12      12     700   D27AF      A12   D27AF/A12 |
 14. |       1   2009   D27AF    A12      14     724   D27AF      A12   D27AF/A12 |
 15. |       1   2011   D27AF    A12       8     824   D27AF      A12   D27AF/A12 |
     |----------------------------------------------------------------------------|
 16. |       1   2013   D27AF    A12       6     900   D27AF      A12   D27AF/A12 |
 17. |       1   2015   D27AF    A12       4    1024   D27AF      A12   D27AF/A12 |
     +----------------------------------------------------------------------------+

Comment

Daniel Bela

Join Date: Apr 2014

Posts: 246
#5

10 Apr 2017, 10:00

Originally posted by Olivier Hoya View Post

Many thanks for taking the time to help. Unfortunately, I'm not getting the intended result. I've visually inspected the new ID variable and quickly spot odd results. For example, there are a number of unique products (i.e., only one product with a particular id3), even though the minimum number of observations per product should be the number of years within a cluster.

I'm not sure I understand the problem; can you provide a new data example of observations where my code produces odd results? Unless Jorrit Gosens' straightforward solution does not already fix the problem, that is.

Regards
Bela
Comment

Daniel Bela

Join Date: Apr 2014
Posts: 246

10 Apr 2017, 12:43

After reviewing my code, I think it is possible that a sorting issue may arise when using it; I assumed that

Code:

sort cluster id id2 year

always yields a result that leaves all observations that belong to a product in one joint block. If this is not the case for your whole dataset, the -replace- will not work as expected. You may want to skip the -sort- statement (if your data is already sorted correctly, as your example suggests).

You could also try to generate the id's out of the data structure itself. If every -cluster- has a fixed (and reliably present) number of years per product, and data is already sorted correctly, something like this should also work (note that it will mess things up as soon as there are single years missing for a product):

Code:

// create data
clear
input byte cluster int year str5 id str4 id2 int price int quant
1 2007 "D27AF" "A12"   12 700
1 2009 "D27AF" "A12"   14 724
1 2011 "D27AF" "A12"    8 824
1 2013 "D27AF" "A12"    6 900
1 2015 "D27AF" "A12"    4 1024
1 2007 "A12AB" "B78"   75 84
1 2009 "A12AB" "B78"   24 88
1 2011 "A12AB" "B78"  132 120
1 2013 "A12AB" "B78"   75 400
1 2015 "A74ZF" "B78"   12 280
9 2004 "112"   "BD18"  89 45
9 2006 "112"   "BD64"  89 78
9 2008 "112"   "BD18"  89 800
9 2010 "112"   "BD18"  67 824
9 2012 "112"   "BD32"  50 825
9 2014 "112"   "BD18"  34 890
9 2016 "112"   "BD18"  23 1000
end

// generate obs number inside cluster
bysort cluster : generate clustobs=_n

// generate common number of waves per cluster
generate factor=.
quietly : levelsof cluster , local(clusters)
foreach clust of local clusters {
    quietly : levelsof year if cluster==`clust' , local(years)
    replace factor=wordcount(`"`years'"') if cluster==`clust'
}

// calculate individual begin and end of range that holds each product
generate rangebegin=cond(clustobs<=factor,1,factor*(ceil(clustobs/factor))-factor)
generate rangeend=cond(clustobs<=factor,factor,factor*(ceil(clustobs/factor)))

// generate id3, drop temporary variables
egen id3=group(cluster rangebegin rangeend)
drop clustobs factor rangebegin rangeend

But maybe this approach is a little to complicated, when Jorrit already presented a much simpler solution.

Regards
Bela

Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

10 Apr 2017, 13:52

I think that this is something that can be done with group_id from SSC:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte cluster int year str5 id str4 id2 int price int quant
1 2007 "D27AF" "A12"   12 700
1 2009 "D27AF" "A12"   14 724
1 2011 "D27AF" "A12"    8 824
1 2013 "D27AF" "A12"    6 900
1 2015 "D27AF" "A12"    4 1024
1 2007 "A12AB" "B78"   75 84
1 2009 "A12AB" "B78"   24 88
1 2011 "A12AB" "B78"  132 120
1 2013 "A12AB" "B78"   75 400
1 2015 "A74ZF" "B78"   12 280
9 2004 "112"   "BD18"  89 45
9 2006 "112"   "BD64"  89 78
9 2008 "112"   "BD18"  89 800
9 2010 "112"   "BD18"  67 824
9 2012 "112"   "BD32"  50 825
9 2014 "112"   "BD18"  34 890
9 2016 "112"   "BD18"  23 1000
end

* generate a new identifier based on groups of -cluster id-
egen id3 = group(cluster id)

* further group id3 if they match within groups of -cluster id2-
group_id id3, matchby(cluster id2)

sort cluster id3 year
list, sepby(id3)

Comment

Daniel Bela

Join Date: Apr 2014

Posts: 246
#8

11 Apr 2017, 02:11

Originally posted by Robert Picard View Post

I think that this is something that can be done with group_id from SSC [...]

Wow, Robert. I never stumpled upon -group_id- before, albeit I could have needed it earlier. Thanks for the hint!

Regards
Bela
Comment
Olivier Hoya

Join Date: Feb 2016

Posts: 12
#9

11 Apr 2017, 02:51

Thank you very much for all the help.

Jorrit's solution worked great, as did Bela's once I dropped the sort line. I haven't tried your solution Robert, since I don't have the group_id command installed (and don't need it for the time being).

Thanks again.
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

#10

11 Apr 2017, 09:00

I was at the airport yesterday and about to board so I didn't read the whole thread and just went by the first two data examples and thought that group_id could be useful and wrote a quick example.

To expand further, grouping identifiers with problems like this is rather tricky and it's possible that you incorrectly grouped identifiers without noticing. You should at least confirm that when grouped by id the value of id3 does not change. This could happen if the bridge variable id2 is used more than once to group the same product. Here's a simplified problem where all but the last observation are for the same product. The bridge variable links two or more observations with different id codes and this happens more than once. As you can see, I adapted Jorrit's solution to this problem to show that is will not work in this case.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str5 id str2 bridge
"AAAAA" "X1"
"BBBBB" "X1"
"BBBBB" "X2"
"CCCCC" "X2"
"CCCCC" "X2"
"CCCCC" "X3"
"DDDDD" "X3"
"EEEEE" "X4"
end

egen newid = group(id)
group_id newid, matchby(bridge)

* adapt Jorrit's solution to this problem
gen idnew = id
bys bridge (id): replace idnew = id[_N]
gen id2new = bridge
bys id (bridge): replace id2new = bridge[_N]
gen id3 = idnew + "/" + id2new

* test that the grouping went as planned
bysort id (id3): gen bad = id3[1] != id3[_N]

sort newid id
list, sepby(newid)

and the results are:

Code:

. list, sepby(newid)

     +----------------------------------------------------------+
     |    id   bridge   newid   idnew   id2new        id3   bad |
     |----------------------------------------------------------|
  1. | AAAAA       X1       1   BBBBB       X1   BBBBB/X1     0 |
  2. | BBBBB       X1       1   BBBBB       X2   BBBBB/X2     1 |
  3. | BBBBB       X2       1   CCCCC       X2   CCCCC/X2     1 |
  4. | CCCCC       X2       1   CCCCC       X3   CCCCC/X3     1 |
  5. | CCCCC       X2       1   CCCCC       X3   CCCCC/X3     1 |
  6. | CCCCC       X3       1   DDDDD       X3   DDDDD/X3     1 |
  7. | DDDDD       X3       1   DDDDD       X3   DDDDD/X3     0 |
     |----------------------------------------------------------|
  8. | EEEEE       X4       5   EEEEE       X4   EEEEE/X4     0 |
     +----------------------------------------------------------+

Comment

Jorrit Gosens

Join Date: Jan 2015

Posts: 1019
#11

11 Apr 2017, 09:20

It's a fair point and good to warn Olivier about it.
Do note that the solution with -group_id- would also run into trouble here. It's just a different problem.
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#12

11 Apr 2017, 09:31

I don't understand, how would group_id run into trouble here?
Comment

Jorrit Gosens

Join Date: Jan 2015
Posts: 1019

#13

11 Apr 2017, 10:10

Because the problem described in #10 supposes overlaps between the two id variables that the original problem description did not have.
Not sure if thats very clearly described, but with an example, starting from the original data example, and replacing one value of id2 so that is is used for more than one product:

Code:

replace id2="A12" in 6
egen id3 = group(cluster id)
group_id id3, matchby(cluster id2)
sort cluster id3 year
list, sepby(id3)

results in:

Code:

     +-----------------------------------------------------+
     | cluster   year      id    id2   price   quant   id3 |
     |-----------------------------------------------------|
  1. |       1   2007   A12AB    A12      75      84     1 |
  2. |       1   2007   D27AF    A12      12     700     1 |
  3. |       1   2009   D27AF    A12      14     724     1 |
  4. |       1   2009   A12AB    B78      24      88     1 |
  5. |       1   2011   A12AB    B78     132     120     1 |
  6. |       1   2011   D27AF    A12       8     824     1 |
  7. |       1   2013   A12AB    B78      75     400     1 |
  8. |       1   2013   D27AF    A12       6     900     1 |
  9. |       1   2015   A74ZF    B78      12     280     1 |
 10. |       1   2015   D27AF    A12       4    1024     1 |
     |-----------------------------------------------------|
 11. |       9   2004     112   BD18      89      45     4 |
 12. |       9   2006     112   BD64      89      78     4 |
 13. |       9   2008     112   BD18      89     800     4 |
 14. |       9   2010     112   BD18      67     824     4 |
 15. |       9   2012     112   BD32      50     825     4 |
 16. |       9   2014     112   BD18      34     890     4 |
 17. |       9   2016     112   BD18      23    1000     4 |
     +-----------------------------------------------------+

.

edit: so it would be grouping together two groups, A12 and B78. Confusion may come from the fact that, as i understand it, cluster is not a unique product identifier.

Last edited by Jorrit Gosens; 11 Apr 2017, 10:14.

Comment

Robert Picard

Join Date: Mar 2014

Posts: 1536
#14

11 Apr 2017, 10:34

You are ignoring the fact that the original post implied that id2 can be used to conclude that the product is the same if id2 codes match. So the first two obs listed have the same value for A12 and so A12AB D27AF are the same product. If this is not the true then there's a problem with the identifiers that cannot be resolved by any approach since they are inconsistently grouped in id and id2.

I see no confusion about cluster as both cluster and id2 are grouped together for the purposes of grouping identifiers. When both cluster and id2 are the same, the product is considered the same, and all id codes inherit the same id3 code.

What's tricky is that you have to group identifiers by iterating until they stop changing.
Comment
Jorrit Gosens

Join Date: Jan 2015

Posts: 1019
#15

11 Apr 2017, 10:57

The reason I A12 now identifies two groups is because I made a slight change to the data to make the point about both methods failing when either id value is used fro several products. In teh first line:

Code:

replace id2="A12" in 6

Wasn't really needed maybe. The example in #10 shows -newid- also groups together 4 different values of id into one group. So it looks like group_id would have similar difficulties with the more complicated problem in #10
Comment

Announcement

Create ID for panel dataset

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment