Hi everyone,
I’m having trouble creating a unique ID for each product in a data set. The data set is structured as follows:
Each line corresponds to product X in year Y. A product can belong to one of 13 clusters (variable: cluster). For ease of illustration, here I just picked two products from cluster 1, and one from cluster 9.
The prices of products in each cluster were collected at regular intervals, but in different years. Data were collected every two years between 2007 and 2015 for cluster 1, and every two years between 2004 and 2016 for cluster 9. (The years differ for some of the other clusters, which is not shown above.)
For many of the products, a unique identifier (variable: id) was created which remained constant over time (e.g., D27AF for the first product above). The difficulty, however, is that id did not remain constant over time for all products (e.g., for the second product above, the id changed from A12AB to A74ZF in 2015). For those products, another identifier (variable: id2) was created which remained constant over time (e.g., B78 for the second product).
There are three possible scenarios: (1) id is constant over time, but not id2 (product 3 above); (2) id2 is constant over time, but not id (product 1 above); or (3) both are constant over time (product 2 above).
So what I need to do is to create a new variable (variable: id3) to track the same product over time. In other words, I need to an identifier based on either id or id2 (or both) remaining constant over time within a given cluster. Again, the observation years varied from cluster to cluster. So the new variable should look like id3 below:
Here is the code I have so far, just based on tracking "id" over time, not id2 (Note: the first step with generating and replacing the variable check is to keep only those products for which we have pricing data in each year, i.e., 5 years for cluster 1, 7 for clusters 2, 3, and 4, etc.). This code correctly calculates Laspeyres indices based on products with the same id over time. I now want to extend this code to include products where id did not remain constant, but id2 did (since if both id and id2 remained constant, this would be captured by the code below):
Please let me know if you need any other information or something explained more clearly.
Thank you very much.
I’m having trouble creating a unique ID for each product in a data set. The data set is structured as follows:
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input byte cluster int year str5 id str4 id2 int price int quant 1 2007 "D27AF" "A12" 12 700 1 2009 "D27AF" "A12" 14 724 1 2011 "D27AF" "A12" 8 824 1 2013 "D27AF" "A12" 6 900 1 2015 "D27AF" "A12" 4 1024 1 2007 "A12AB" "B78" 75 84 1 2009 "A12AB" "B78" 24 88 1 2011 "A12AB" "B78" 132 120 1 2013 "A12AB" "B78" 75 400 1 2015 "A74ZF" "B78" 12 280 9 2004 "112" "BD18" 89 45 9 2006 "112" "BD64" 89 78 9 2008 "112" "BD18" 89 800 9 2010 "112" "BD18" 67 824 9 2012 "112" "BD32" 50 825 9 2014 "112" "BD18" 34 890 9 2016 "112" "BD18" 23 1000 end
The prices of products in each cluster were collected at regular intervals, but in different years. Data were collected every two years between 2007 and 2015 for cluster 1, and every two years between 2004 and 2016 for cluster 9. (The years differ for some of the other clusters, which is not shown above.)
For many of the products, a unique identifier (variable: id) was created which remained constant over time (e.g., D27AF for the first product above). The difficulty, however, is that id did not remain constant over time for all products (e.g., for the second product above, the id changed from A12AB to A74ZF in 2015). For those products, another identifier (variable: id2) was created which remained constant over time (e.g., B78 for the second product).
There are three possible scenarios: (1) id is constant over time, but not id2 (product 3 above); (2) id2 is constant over time, but not id (product 1 above); or (3) both are constant over time (product 2 above).
So what I need to do is to create a new variable (variable: id3) to track the same product over time. In other words, I need to an identifier based on either id or id2 (or both) remaining constant over time within a given cluster. Again, the observation years varied from cluster to cluster. So the new variable should look like id3 below:
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input byte cluster int year str5 id str4 id2 int price int quant byte id3 1 2007 "D27AF" "A12" 12 700 1 1 2009 "D27AF" "A12" 14 724 1 1 2011 "D27AF" "A12" 8 824 1 1 2013 "D27AF" "A12" 6 900 1 1 2015 "D27AF" "A12" 4 1024 1 1 2007 "A12AB" "B78" 75 84 2 1 2009 "A12AB" "B78" 24 88 2 1 2011 "A12AB" "B78" 132 120 2 1 2013 "A12AB" "B78" 75 400 2 1 2015 "A74ZF" "B78" 12 280 2 9 2004 "112" "BD18" 89 45 3 9 2006 "112" "BD64" 89 78 3 9 2008 "112" "BD18" 89 800 3 9 2010 "112" "BD18" 67 824 3 9 2012 "112" "BD32" 50 825 3 9 2014 "112" "BD18" 34 890 3 9 2016 "112" "BD18" 23 1000 3 end
Code:
* identify total number of observations for each medicine sort cluster id year by cluster id: generate total = _N sort cluster id2 year by cluster id2: generate total1 = _N * identify complete series (drop those with missing years) gen check = 1 if cluster==1 & total==5 replace check = 1 if cluster == 2 & (total == 7 | total1 == 7) replace check = 1 if cluster == 3 & (total == 7 | total1 == 7) replace check = 1 if cluster == 4 & (total == 7 | total1 == 7) replace check = 1 if cluster == 6 & (total == 4 | total1 == 4) replace check = 1 if cluster == 7 & (total == 5 | total1 == 5) replace check = 1 if cluster == 8 & (total == 6 | total1 == 6) replace check = 1 if cluster == 9 & (total == 7 | total1 == 7) replace check = 1 if cluster == 10 & (total == 3 | total1 == 3) replace check = 1 if cluster == 11 & (total == 3 | total1 == 3) replace check = 1 if cluster == 12 & (total == 4 | total1 == 4) replace check = 1 if cluster == 13 & (total == 4 | total1 == 4) drop if check==. * create panel id by cluster id: generate count = _n * set the time variable egen new_id = group(id) tsset new_id count * calculate Laspeyres values gen byte baseyear = 1 if count == 1 by new_id (baseyear), sort: gen lasp_input = price * quant[1] by cluster count, sort: egen sum_lasp = sum(lasp_input) * calculate Laspeyres indices collapse (mean) sum_lasp baseyear, by(cluster count year) sort cluster count by cluster: gen lasp = sum_lasp / sum_lasp[1] * plot indices line lasp year, by(cluster) xlabel(2003(3)2016)
Thank you very much.
Comment