Hello,
I am trying to generate variables but since some of the commands use - collapse -, I am using preserve and restore at the same time.
However, I see that if I use preserve and restore, generated variables are lost after.
Below is the sample of my dataset (I had too many variables so some of them are dropped here):
And my commands are as follows:
As you may see above, I used quite often preserve and restore to compute share of missing values of each variables, but then I couldn't export all the output in the excel file (share of missing values - that are computed).
Maybe the above code is not efficient at all and have many mistakes.. I am a bit beginner and these commands could make you confused (since I used too much gl/loc.. sorry I just maybe have to leave as it is for now)
Could someone pleaseeee help me with this issue?
I am a bit in hurry but I have no clue of making this efficiently..
Thank you
I am trying to generate variables but since some of the commands use - collapse -, I am using preserve and restore at the same time.
However, I see that if I use preserve and restore, generated variables are lost after.
Below is the sample of my dataset (I had too many variables so some of them are dropped here):
Code:
* Example generated by -dataex-. For more info, type help dataex clear input long ID int year float ind_2 int L long(K_tan K_intan) int birthyr long labour_cost double gross_output 1 2011 49 3 17045 0 2010 8146 186417 1 2012 49 10 62261 0 2010 59612 783986 1 2013 49 12 607367 0 2010 38053 827543 1 2014 49 12 1158161 0 2010 41192 1047041 1 2015 49 19 1059901 0 2010 80598 1681563 1 2016 49 20 1665408 0 2010 142738 2129586 1 2017 49 38 2726397 0 2010 225986 3683801 1 2018 49 50 185213 0 2010 306521 1196452 1 2019 49 58 4671962 0 2010 352238 7132984 1 2020 49 73 328650 0 2010 447460 1217299 1 2021 49 61 0 153 2010 563207 5128763 2 2002 45 3 0 0 2000 14995 21235 2 2003 45 2 0 0 2000 2772 6077 2 2010 45 3 0 0 2000 5472 51969 2 2011 49 8 629 0 2000 33387 1147442 2 2012 49 3 441 0 2000 14899 330445 3 2012 47 1 0 0 2010 99 109 3 2013 47 1 0 0 2010 117 1806 3 2014 47 1 0 0 2010 581 1899 3 2015 47 1 794 25 2010 119 2195 3 2016 47 1 180 0 2010 977 962 3 2020 47 1 0 0 2010 189 478 4 2014 49 1 0 0 2014 3051 32310 4 2015 49 1 0 0 2014 76 22862 4 2016 49 1 0 0 2014 7530 24850 4 2017 49 1 0 0 2014 7710 14800 4 2018 49 1 0 0 2014 970 8370 5 2005 74 8 9949 2274 2005 20575 90212 5 2006 74 15 20227 2910 2005 74110 216460 5 2007 74 17 16116 2462 2005 99327 300687 5 2008 74 21 36979 1508 2005 122379 346879 5 2009 74 14 30855 1300 2005 88978 230137 5 2010 74 12 17847 490 2005 51632 180377 5 2011 74 11 25261 41 2005 71498 222823 5 2012 74 9 29349 2 2005 53191 174089 5 2013 74 10 22771 2 2005 55979 185864 5 2014 74 10 30818 2 2005 65695 209759 5 2015 74 7 21419 0 2005 61093 165402 5 2016 74 9 16497 0 2005 66471 207433 5 2017 74 10 23627 0 2005 72950 226992 5 2018 74 9 24914 0 2005 80243 222916 5 2019 74 9 34455 0 2005 86144 259266 5 2020 74 9 45173 0 2005 72859 262643 5 2021 74 9 68867 0 2005 80536 256532 6 2015 68 1 0 0 2015 105 0 6 2016 68 2 1425349 0 2015 2767 862687 6 2017 68 3 1905736 0 2015 3428 2113314 6 2018 68 4 1152312 0 2015 3698 1749990 6 2019 68 2 47 0 2015 3360 160634 6 2020 68 1 1 0 2015 2910 301451 6 2021 68 1 1 0 2015 5605 0 7 2001 46 2 0 0 2000 1101 2339 7 2002 46 2 0 0 2000 1468 1785 8 2007 45 2 34299 0 1996 13117 201808 8 2008 45 4 28108 0 1996 17734 140613 8 2009 45 5 20827 0 1996 20849 87805 8 2010 45 9 130760 0 1996 41803 150710 8 2011 45 17 127824 0 1996 74873 215586 8 2012 45 17 131904 0 1996 94513 354845 8 2013 45 23 136737 0 1996 164641 410049 8 2014 45 25 201024 0 1996 235548 561338 8 2015 45 29 182862 0 1996 354721 842975 8 2016 45 26 716969 0 1996 419840 1119406 8 2017 45 32 790568 0 1996 561354 961521 8 2018 45 35 752523 3605 1996 639588 941013 8 2019 45 36 749347 2403 1996 685477 1099951 8 2020 45 31 621511 1202 1996 650221 1647524 8 2021 45 30 540758 0 1996 734375 1674707 9 2011 22 5 0 0 2005 40572 46956 9 2012 25 31 0 10299 2005 291612 369526 9 2013 62 57 20557 0 2005 866119 1058617 9 2014 62 78 53913 0 2005 2152651 2456436 9 2015 62 108 133875 0 2005 3418292 3881834 9 2016 62 142 247571 0 2005 5165333 6549122 9 2017 62 243 297798 0 2005 8939130 11768150 9 2018 62 350 770733 0 2005 12275976 16810625 9 2019 62 356 583773 0 2005 15286289 18394524 9 2020 62 417 678427 0 2005 18677638 24021346 9 2021 62 459 808029 0 2005 30797789 37185523 11 2011 22 2 5178 0 2011 4842 55665 11 2012 22 3 5700 0 2011 17622 89856 11 2013 22 4 6897 0 2011 24119 115440 11 2014 22 6 52975 1762 2011 38371 235647 11 2015 22 8 162773 990 2011 58896 245016 11 2016 22 8 119182 1035 2011 90268 421182 11 2017 22 6 75751 1164 2011 91232 394008 11 2018 22 6 90856 7940 2011 86930 395675 11 2019 22 4 45934 10373 2011 54458 633982 11 2020 22 6 46266 13766 2011 121345 598356 11 2021 22 7 30226 9469 2011 137234 672991 22 2010 47 1 0 0 2010 241 538 22 2011 47 1 0 0 2010 522 354 22 2012 47 2 0 0 2010 4044 2550 33 2000 46 18 265527 0 1996 35783 19730 33 2001 46 22 282713 2301 1996 43497 92707 33 2002 46 25 438715 5903 1996 48014 597168 33 2003 46 27 388077 27406 1996 70145 398267 33 2004 46 26 502560 29300 1996 64270 7013794 33 2005 46 34 1792054 15224 1996 106997 5401058 33 2006 46 31 686742 3170 1996 128305 -139586 end
Code:
** Variables gl id ID gl yr year gl ind ind_2digit gl ind_old gl ind_new $ind gl birth birth gl L L_empl gl pers L_perseng gl LC labour_cost gl GO gross_output gl VA value_added gl II intermed gl Ktan K_tan gl Kintan K_intan gl Itan invest_tan gl Iintan invest_intan gl BG business_group gl vars_L L_empl labour_cost wage gl vars_K capital_tan capital_intan gl vars_I invest_tan invest_intan gl vars_output gross_output value_added gen wage = $LC / $L gen K_L = $Ktan / $L gen GO_L = $GO / $L gen VA_L = $VA / $L gl vars_ratio_L K_L GO_L VA_L gen LC_GO = $LC / $GO gen LC_VA = $LC / $VA gl vars_ratio_LC LC_GO LC_VA * Generating tables - Share of missing obs. gl vars_to_check ${L} /// ${LC} /// wage /// ${Ktan} /// ${Kintan} /// ${Itan} /// ${Iintan} /// ${GO} /// ${VA} /// VA_L ** Birth year missing // 1. By firm-level preserve egen missing_f_birth = min(missing(${birth})), by(${id}) bys ${id}: keep if _n == 1 egen nb_missing_birth = total(missing_f_birth) gen s_m_f_birth = nb_missing_birth / _N drop missing_f_birth nb_missing_birth restore // 2. By observation-level count if missing(${birth}) local miss = r(N) count local total_nb = r(N) gen s_m_ob_birth = `miss'/`total_nb' ** Computing share of missing values // 1. Observation-level foreach var of global vars_to_check{ bys ${yr}: egen miss_ob_`var' = total(missing(`var')) gen s_m_ob_`var' = miss_ob_`var' / _N format s_m_ob_`var' %9.2f } // 2. Firm-level // 2-1. Always missing (==1 across ID) foreach var of global vars_to_check{ preserve egen f_miss_`var' = max(missing(`var')), by(${id}) bys ${id}: keep if _n == 1 egen nb_f_miss_`var' = total(f_miss_`var') gen s_m_f_`var' = nb_f_miss_`var' / _N restore } // 2-2. Above 50% missing (==1 across ID) foreach var of global vars_to_check{ preserve bys ${id}: gen halfobs = _N / 2 // Count nb of missing obs of variable by ID egen miss_`var' = total(missing(`var')), by(${id}) * == 1 if nb of missing obs is larger than 50% of its nb of observations bys ${id}: gen missing_`var' = (miss_`var' >= halfobs) * drop duplicates bys ${id}: keep if _n == 1 * count total nb of firms with missing variable egen nb_missing_`var' = total(missing_`var') * Compute share gen s_m_f_v2_`var' = nb_missing_`var' / _N restore } ** Industry code missing preserve collapse (count) ${ind_new} ${ind_old} ${id}, by(${yr}) local ind_all ${ind_new} ${ind_old} foreach var of local ind_all { capture confirm variable `var' if !_rc { bysort ${yr}: gen s_m_`var' = 1 - (`var' != .) / ${id} } } restore ** Exporting share of missing values preserve keep ${yr} s_m_* // s_m_f_ & s_m_ob_ ds s_m_* local share `r(varlist)' foreach var of local share { local new = substr("`var'", 5, .) ren `var' `new' } bys ${yr}: keep if _n == 1 export excel using "missing_values.xlsx", firstrow(variables) replace restore
As you may see above, I used quite often preserve and restore to compute share of missing values of each variables, but then I couldn't export all the output in the excel file (share of missing values - that are computed).
Maybe the above code is not efficient at all and have many mistakes.. I am a bit beginner and these commands could make you confused (since I used too much gl/loc.. sorry I just maybe have to leave as it is for now)
Could someone pleaseeee help me with this issue?
I am a bit in hurry but I have no clue of making this efficiently..
Thank you
Comment