Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generate variables with preserve and restore

    Hello,

    I am trying to generate variables but since some of the commands use - collapse -, I am using preserve and restore at the same time.
    However, I see that if I use preserve and restore, generated variables are lost after.
    Below is the sample of my dataset (I had too many variables so some of them are dropped here):

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long ID int year float ind_2 int L long(K_tan K_intan) int birthyr long labour_cost double gross_output
    1 2011 49   3   17045     0 2010     8146   186417
    1 2012 49  10   62261     0 2010    59612   783986
    1 2013 49  12  607367     0 2010    38053   827543
    1 2014 49  12 1158161     0 2010    41192  1047041
    1 2015 49  19 1059901     0 2010    80598  1681563
    1 2016 49  20 1665408     0 2010   142738  2129586
    1 2017 49  38 2726397     0 2010   225986  3683801
    1 2018 49  50  185213     0 2010   306521  1196452
    1 2019 49  58 4671962     0 2010   352238  7132984
    1 2020 49  73  328650     0 2010   447460  1217299
    1 2021 49  61       0   153 2010   563207  5128763
    2 2002 45   3       0     0 2000    14995    21235
    2 2003 45   2       0     0 2000     2772     6077
    2 2010 45   3       0     0 2000     5472    51969
    2 2011 49   8     629     0 2000    33387  1147442
    2 2012 49   3     441     0 2000    14899   330445
    3 2012 47   1       0     0 2010       99      109
    3 2013 47   1       0     0 2010      117     1806
    3 2014 47   1       0     0 2010      581     1899
    3 2015 47   1     794    25 2010      119     2195
    3 2016 47   1     180     0 2010      977      962
    3 2020 47   1       0     0 2010      189      478
    4 2014 49   1       0     0 2014     3051    32310
    4 2015 49   1       0     0 2014       76    22862
    4 2016 49   1       0     0 2014     7530    24850
    4 2017 49   1       0     0 2014     7710    14800
    4 2018 49   1       0     0 2014      970     8370
    5 2005 74   8    9949  2274 2005    20575    90212
    5 2006 74  15   20227  2910 2005    74110   216460
    5  2007 74  17   16116  2462 2005    99327   300687
    5 2008 74  21   36979  1508 2005   122379   346879
    5 2009 74  14   30855  1300 2005    88978   230137
    5 2010 74  12   17847   490 2005    51632   180377
    5 2011 74  11   25261    41 2005    71498   222823
    5 2012 74   9   29349     2 2005    53191   174089
    5 2013 74  10   22771     2 2005    55979   185864
    5 2014 74  10   30818     2 2005    65695   209759
    5 2015 74   7   21419     0 2005    61093   165402
    5 2016 74   9   16497     0 2005    66471   207433
    5 2017 74  10   23627     0 2005    72950   226992
    5 2018 74   9   24914     0 2005    80243   222916
    5 2019 74   9   34455     0 2005    86144   259266
    5 2020 74   9   45173     0 2005    72859   262643
    5 2021 74   9   68867     0 2005    80536   256532
    6 2015 68   1       0     0 2015      105        0
    6 2016 68   2 1425349     0 2015     2767   862687
    6 2017 68   3 1905736     0 2015     3428  2113314
    6 2018 68   4 1152312     0 2015     3698  1749990
    6 2019 68   2      47     0 2015     3360   160634
    6 2020 68   1       1     0 2015     2910   301451
    6 2021 68   1       1     0 2015     5605        0
    7 2001 46   2       0     0 2000     1101     2339
    7 2002 46   2       0     0 2000     1468     1785
    8 2007 45   2   34299     0 1996    13117   201808
    8 2008 45   4   28108     0 1996    17734   140613
    8 2009 45   5   20827     0 1996    20849    87805
    8 2010 45   9  130760     0 1996    41803   150710
    8 2011 45  17  127824     0 1996    74873   215586
    8 2012 45  17  131904     0 1996    94513   354845
    8 2013 45  23  136737     0 1996   164641   410049
    8 2014 45  25  201024     0 1996   235548   561338
    8 2015 45  29  182862     0 1996   354721   842975
    8 2016 45  26  716969     0 1996   419840  1119406
    8 2017 45  32  790568     0 1996   561354   961521
    8 2018 45  35  752523  3605 1996   639588   941013
    8 2019 45  36  749347  2403 1996   685477  1099951
    8 2020 45  31  621511  1202 1996   650221  1647524
    8 2021 45  30  540758     0 1996   734375  1674707
    9 2011 22   5       0     0 2005    40572    46956
    9 2012 25  31       0 10299 2005   291612   369526
    9 2013 62  57   20557     0 2005   866119  1058617
    9 2014 62  78   53913     0 2005  2152651  2456436
    9 2015 62 108  133875     0 2005  3418292  3881834
    9 2016 62 142  247571     0 2005  5165333  6549122
    9 2017 62 243  297798     0 2005  8939130 11768150
    9 2018 62 350  770733     0 2005 12275976 16810625
    9 2019 62 356  583773     0 2005 15286289 18394524
    9 2020 62 417  678427     0 2005 18677638 24021346
    9 2021 62 459  808029     0 2005 30797789 37185523
    11 2011 22   2    5178     0 2011     4842    55665
    11 2012 22   3    5700     0 2011    17622    89856
    11 2013 22   4    6897     0 2011    24119   115440
    11 2014 22   6   52975  1762 2011    38371   235647
    11 2015 22   8  162773   990 2011    58896   245016
    11 2016 22   8  119182  1035 2011    90268   421182
    11 2017 22   6   75751  1164 2011    91232   394008
    11 2018 22   6   90856  7940 2011    86930   395675
    11 2019 22   4   45934 10373 2011    54458   633982
    11 2020 22   6   46266 13766 2011   121345   598356
    11 2021 22   7   30226  9469 2011   137234   672991
    22 2010 47   1       0     0 2010      241      538
    22 2011 47   1       0     0 2010      522      354
    22 2012 47   2       0     0 2010     4044     2550
    33 2000 46  18  265527     0 1996    35783    19730
    33 2001 46  22  282713  2301 1996    43497    92707
    33 2002 46  25  438715  5903 1996    48014   597168
    33 2003 46  27  388077 27406 1996    70145   398267
    33 2004 46  26  502560 29300 1996    64270  7013794
    33 2005 46  34 1792054 15224 1996   106997  5401058
    33 2006 46  31  686742  3170 1996   128305  -139586
    end
    And my commands are as follows:
    Code:
    ** Variables
    gl id                ID
    gl yr                year
    gl ind                ind_2digit
    gl ind_old            
    gl ind_new            $ind
    gl birth            birth
    gl L                 L_empl
    gl pers                L_perseng
    gl LC                labour_cost
    gl GO                 gross_output
    gl VA                value_added
    gl II                intermed
    gl Ktan                K_tan
    gl Kintan            K_intan
    gl Itan                invest_tan
    gl Iintan            invest_intan
    gl BG                business_group
    gl vars_L             L_empl labour_cost wage
    gl vars_K             capital_tan capital_intan
    gl vars_I             invest_tan invest_intan
    gl vars_output         gross_output value_added
    
    
    gen wage = $LC / $L
    gen K_L = $Ktan / $L
    gen GO_L = $GO / $L
    gen VA_L = $VA / $L
    
    gl vars_ratio_L  K_L GO_L VA_L 
    
    gen LC_GO = $LC / $GO
    gen LC_VA = $LC / $VA
    
    gl vars_ratio_LC LC_GO LC_VA
    
    * Generating tables - Share of missing obs.
    gl vars_to_check      ${L}             ///
                         ${LC}             ///
                         wage             ///
                         ${Ktan}         ///
                         ${Kintan}         ///
                         ${Itan}         ///
                         ${Iintan}       ///
                         ${GO}             ///
                         ${VA}            ///
                         VA_L             
                         
                         
    ** Birth year missing        
        // 1. By firm-level    
    preserve
    
    egen missing_f_birth = min(missing(${birth})), by(${id})    
    bys ${id}: keep if _n == 1
    egen nb_missing_birth = total(missing_f_birth)
    gen s_m_f_birth = nb_missing_birth / _N
    drop missing_f_birth nb_missing_birth
    
    restore
    
        // 2. By observation-level 
    count if missing(${birth})
    local miss = r(N)
    count
    local total_nb = r(N)
    gen s_m_ob_birth = `miss'/`total_nb'
    
    
    
    ** Computing share of missing values
    
        // 1. Observation-level
    foreach var of global vars_to_check{
        bys ${yr}: egen miss_ob_`var' = total(missing(`var'))
        gen s_m_ob_`var' = miss_ob_`var' / _N
        format s_m_ob_`var' %9.2f
    }
    
        // 2. Firm-level 
    
            // 2-1. Always missing (==1 across ID)
    foreach var of global vars_to_check{
        preserve
        
        egen f_miss_`var' = max(missing(`var')), by(${id})
        bys ${id}: keep if _n == 1
        egen nb_f_miss_`var' = total(f_miss_`var')
        gen s_m_f_`var' = nb_f_miss_`var' / _N
        
        restore
    }
        
            // 2-2. Above 50% missing (==1 across ID)
    foreach var of global vars_to_check{
        preserve
        
        bys ${id}: gen halfobs = _N / 2
        // Count nb of missing obs of variable by ID
        egen miss_`var' = total(missing(`var')), by(${id})
        * == 1 if nb of missing obs is larger than 50% of its nb of observations
        bys ${id}: gen missing_`var' = (miss_`var' >= halfobs)
        * drop duplicates
        bys ${id}: keep if _n == 1
        * count total nb of firms with missing variable
        egen nb_missing_`var' = total(missing_`var')
        * Compute share
        gen s_m_f_v2_`var' = nb_missing_`var' / _N
        
        restore
    }
        
        
        
    
    
    ** Industry code missing
    preserve
    
    collapse (count) ${ind_new} ${ind_old} ${id}, by(${yr}) 
    local ind_all ${ind_new} ${ind_old} 
    
    foreach var of local ind_all {
        capture confirm variable `var'
        if !_rc {
            bysort ${yr}: gen s_m_`var' = 1 - (`var' != .) / ${id}
        }
    }
    
    restore
    
    ** Exporting share of missing values
    preserve
    
    keep ${yr} s_m_*    // s_m_f_ & s_m_ob_
    ds s_m_*
    local share `r(varlist)'
    foreach var of local share {
        local new = substr("`var'", 5, .) 
        ren `var' `new'
    }
    bys ${yr}: keep if _n == 1 
    export excel using "missing_values.xlsx", firstrow(variables) replace
    
    restore


    As you may see above, I used quite often preserve and restore to compute share of missing values of each variables, but then I couldn't export all the output in the excel file (share of missing values - that are computed).
    Maybe the above code is not efficient at all and have many mistakes.. I am a bit beginner and these commands could make you confused (since I used too much gl/loc.. sorry I just maybe have to leave as it is for now)

    Could someone pleaseeee help me with this issue?
    I am a bit in hurry but I have no clue of making this efficiently..

    Thank you

  • #2
    can smn help me with this..

    Comment


    • #3
      Either you answered your own question, or no-one can understand what the question is.

      New variables created between preserve and restore will not be added to the original dataset. The whole point of restore is to go back to the dataset as it was at the preserve command.

      So, if you want to keep the work you did between preserve and restore it must be saved or exported in some other way. You may be trying to do that.

      Comment


      • #4
        I don't think you need to preserve and restore at all, to compute the shares of missing values.

        You just need to use the tag function of the egen command, and you'll be able to do whatever you need.

        For instance, section 2-1 of your code can be replaced by:

        Code:
        foreach var of global vars_to_check{  
            egen f_miss_`var' = max(missing(`var')), by(${id})
            egen byte tagged = tag(${id})
            egen nb_f_miss_`var' = total(f_miss_`var' & tagged)
            egen num_firms = total(tagged)
            gen s_m_f_`var' = nb_f_miss_`var' / num_firms
            drop f_miss_`var' nb_f_miss_`var' num_firms tagged
        }
        Incidentally, your comment seems to indicate that you want to compute the fraction of firms for which the variable is always missing. If that is the case, then you should replace max() with min() in the first call to egen above (and correspondingly, in your code as well). With max(), you are computing the fraction of firms for which it is missing at least one observation.

        Furthermore, I notice that you are creating variables that are essentially constants. This is a tremendous waste of memory, because you are storing a vector of observations (probably tens or hundreds of thousands long), all of which take on the same value. I would instead recommend storing just a local macro with the requisite number. The code above could then be changed to:
        Code:
        foreach var of global vars_to_check{  
            egen f_miss_`var' = max(missing(`var')), by(${id})
            egen byte tagged = tag(${id})
            count if tagged & f_miss_`var'
            local nb_f_miss_`var' = r(N)
            count if tagged
            local num_firms = r(N)
            local s_m_f_`var' = `nb_f_miss_`var'' / `num_firms'
            drop f_miss_`var' tagged
        }

        Comment


        • #5
          Thanks so much!! I got it

          Comment

          Working...
          X