Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Run a regression for each country-year-industry cross-section

    Hello everyone,

    It has been a while since I posted here, and I miss it

    I am currently working with a panel dataset that spans 15 years and includes firms from different countries, operating in different industries.

    I wish to run a regression for each group of firms that operate in the same country, within the same industry, and in a given year. Then I need to store the residuals obtained from each regression (I placed a condition that the regression should have a minimum of 10 observations).

    Before inserting the code of the loop that I wrote, I shall define a few variables:
    Y: dependent variable
    X: independent variable
    CountryCode: a string variable that stores the code of the corresponding country
    IndustryClass: takes the values 0-11, each number referring to an industry
    Year: takes the values 1991-2015
    Residuals: the variable in which I store the residuals estimated from the regressions

    Therefore, I wrote the following code (loop):

    Code:
    gen Residuals=.
    
    levelsof CountryCode, local(country)
    
    foreach c of local country {
          levelsof Year if CountryCode==`c', local(years)
          foreach y of local years {
               levelsof IndustryClass if CountryCode==`c' & Year==`y', local(industry)
               foreach i of local industry {
                    quietly count if !missing(Y, X)
                    if r(N) >= 10 {
                        capture regress Y X if CountryCode==`c' & Year==`y' &  IndustryClass==`i' 
                        predict uhat if e(sample), residuals
                        replace Residuals=uhat if e(sample)
                        drop uhat
                 }
             }
         }
    }

    The code I wrote does not return an error; however, it does nothing. That is, there are no regressions running following the execution of the code and thus the variable Residuals has no values.

    I truly appreciate your help in finding the wrong piece(s) in the code above.

    Thank you very much indeed.

    Mostafa



  • #2
    I don't see anything obviously wrong with the code. The use of -capture- in front of your regression is an unsafe practice, but it is not causing the problem you describe. My best guess is that there just aren't any combinations of country, year, and industry that have 10 or more observations with both X and Y non-missing. You can check that as follows:

    Code:
    by CountryCode IndustryClass Year, sort: egen available = total(!missing(X, Y))
    summ available
    If my hunch is correct, the maximum value of available in the -summ- output will be < 10.

    If you find that there are some groups for which the value of available is 10 or more, then please post back with example data, using the -dataex- command, to enable troubleshooting.

    That said, you can shorten and simplify this code considerably as:

    Code:
    capture program drop one_regression
    program define one_regression
        if _N >= 10 {
            regress Y X
            predict Residuals
        }
        exit
    end
    
    runby one_regression, by(CountryCode IndustryClass Year) status
    -runby- is written by Robert Picard and me, and is available from SSC.

    Added: The approach I wrote in the second code block does not fully enforce the requirement for at least 10 observations in the regression. Here is a correction:

    Code:
    capture program drop one_regression
    program define one_regression
        count if !missing(X, Y)
        if r(N) >= 10 {
            regress Y X
            predict Residuals
        }
        exit
    end
    
    runby one_regression, by(CountryCde IndustryClass Year) status
    Last edited by Clyde Schechter; 19 Feb 2022, 11:26.

    Comment


    • #3
      Hello Clyde,

      Thank you for your detailed reply.

      I read through the documentation of the -runby- command, and I really liked how it simplifies looping. However, the code that you have kindly provided did not create the variable "Residuals" although it returned an output showing the successful numbers of groups and observations. I will try applying a few changes to make it work as it makes things simpler.

      Regarding my case, I checked the number of observations within the country-year-industry groups, and the average is around 96 observations, ranging from 0 to 846. Accordingly, to facilitate troubleshooting, I copy and paste below part of the dataset using the -dataex- command:

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input str4 CountryCode float(Year IndustryClass Y X)
      "USA" 1988  9           .          .
      "USA" 1989  9   .06223206          .
      "USA" 1990  9   .05128117          .
      "USA" 1991  9   .04331039          .
      "USA" 1992  9    .0446404   .1232989
      "USA" 1993  9 .0013825138 -.05303686
      "USA" 1994  9   .04418078  .26600385
      "USA" 1995  9   .04576015 -.08783531
      "USA" 1996  9    .0657833  .52008635
      "USA" 1997  9  .065050505   .4882592
      "USA" 1998  9  .063185364   .3372959
      "USA" 1999  9    .0568946 -.26286253
      "USA" 2000  9   .06502338 -.19160226
      "USA" 2001  9   .04971404 -.23382543
      "USA" 2002  9  -.15628795  .14737502
      "USA" 2003  9 -.034037974  -.6939088
      "USA" 2004  9  .024447165          .
      "USA" 2005  9   .06012164          .
      "USA" 2006  9    .0672745          .
      "USA" 2007  9   .06735224          .
      "USA" 2008  9   .06177252          .
      "USA" 2009  9   .10781982          .
      "USA" 2010  9   .07807629          .
      "USA" 2011  9   .09402896          .
      "USA" 2012  9  .064509116          .
      "USA" 2013  9   .11333438          .
      "USA" 2014  9   .09227814          .
      "USA" 2015  9  -.05669359          .
      "USA" 1988  7           .          .
      "USA" 1989  7  -.15924017          .
      "USA" 1990  7  -.26331362          .
      "USA" 1991  7   -.3427602          .
      "USA" 1992  7   -.4025051  1.2222226
      "USA" 1993  7  -.16086596       2.54
      "USA" 1994  7  -.08814144  -.4915254
      "USA" 1988  6           .          .
      "USA" 1989  6   .09628886          .
      "USA" 1990  6    .1019243          .
      "USA" 1991  6   .10199297          .
      "USA" 1992  6  .067342624  .33796296
      "USA" 1993  6  .064123236  1.0069207
      "USA" 1994  6   .03751309   .2827584
      "USA" 1995  6   .04931698   .6666668
      "USA" 1996  6   .03064003    .464516
      "USA" 1997  6   .02609169     .30837
      "USA" 1998  6   .03381578  -.4023569
      "USA" 1999  6   .03075237   .6704223
      "USA" 2000  6   .13784003   3.418212
      "USA" 2001  6  -.06216614  -.8932519
      "USA" 2002  6   -.4142427  -.7070939
      "USA" 2003  6  -.08341127   .9734375
      "USA" 2004  6  .016640162          .
      "USA" 2005  6   .05831079          .
      "USA" 2006  6  .035371114          .
      "USA" 2007  6    .0644053          .
      "USA" 2008  6 -.019252984          .
      "USA" 2009  6   -.4951701          .
      "USA" 2010  6   .09582396          .
      "USA" 1988  6           .          .
      "USA" 1989  6  -.05259182          .
      "USA" 1990  6   -.2873082          .
      "USA" 1991  6   .08427544          .
      "USA" 1992  6           .         .4
      "USA" 1993  6  -.06618594 -.14285718
      "USA" 1994  6  .066498056   .4166668
      "USA" 1995  6   .05605729   .2205883
      "USA" 1988 10           .          .
      "USA" 1989 10   .07660044          .
      "USA" 1990 10   .06915995          .
      "USA" 1991 10   .01782257          .
      "USA" 1992 10   .02723884  .13714415
      "USA" 1993 10  .015502454 -.37008825
      "USA" 1994 10 -.005645466  .26943624
      "USA" 1995 10   .04302809  .27162102
      "USA" 1996 10  -.02025105  -.4081486
      "USA" 1997 10   .05468829   .8140792
      "USA" 1998 10   .04392335   .5857811
      "USA" 1999 10    .0388092 -.10637635
      "USA" 2000 10   .06095763 .071036465
      "USA" 2001 10 -.020218894  -.2927635
      "USA" 2002 10  -.07982095 -.43170065
      "USA" 2003 10   .03173443  .51298976
      "USA" 2004 10   -.3009293          .
      "USA" 2005 10   .06944719          .
      "USA" 2006 10   .03891128          .
      "USA" 2007 10 -.013075175          .
      "USA" 1988  3           .          .
      "USA" 1989  3   .06720754          .
      "USA" 1990  3   .04159546          .
      "USA" 1991  3    .1230861          .
      "USA" 1992  3   .07177333  -.1703802
      "USA" 1993  3   .14682074  1.4597417
      "USA" 1994  3  .068893716  .05396337
      "USA" 1995  3    .1024064  .14122626
      "USA" 1996  3   .11078658  .03368893
      "USA" 1997  3   .07985014  .14257754
      "USA" 1998  3   .06397835 -.12669948
      "USA" 1999  3   .10768937 .035178777
      "USA" 2000  3   .05099358 -.04518054
      "USA" 1988  0           .          .
      end

      Please let me know what you think/suggest.

      Thank you

      Comment


      • #4
        Well, I, too, get no calculation of the Residuals variable in your example data set. BUT:

        Code:
        . by CountryCode Year IndustryClass, sort: egen available = total(!missing(Y, X))
        
        . tab available
        
          available |      Freq.     Percent        Cum.
        ------------+-----------------------------------
                  0 |         48       48.00       48.00
                  1 |         46       46.00       94.00
                  2 |          6        6.00      100.00
        ------------+-----------------------------------
              Total |        100      100.00
        In other words, your example data has no combinations of country, year, and industry that contain 10 or more observations that are missing neither X nor Y. In fact, the most we get is 2 such observations in a group. So if this is representative of your data set as a whole, the deficiencies of the data set are the source of the problem. If there really are some country, year, and industry combinations that provide 10 or more observations with non-missing X and Y in your data set, find a few examples of those, and post back using those as your example data, and I will try to troubleshoot.

        Comment


        • #5
          Hi Clyde,

          I used the -dataex- command and it generated a random sample with 100 rows. Looking at the example data, I realized that it does not serve the purpose. I assure you that the there are hundreds of cross-sections with more than 10 observations. I even tried my code while replacing the 10 with 2, and still the code returned nothing.

          Now I will include a representative example of the data that has groups with sufficient observations:

          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input str4 CountryCode float(Year IndustryClass Y X)
          GBR    2002    2    .1159241    .2404826
          GBR    2002    2    .0519025    -.2404375
          GBR    2002    2    .0392666    -.4054651
          GBR    2002    2    .0644536    -.4519841
          GBR    2002    2    -.2139062    -.5906689
          GBR    2002    2    -.4994894    .4771613
          GBR    2002    2    -.6719396    -.3418402
          GBR    2002    2    .0420839    .4264729
          GBR    2002    2    .1169699    .629283
          GBR    2002    2    -.142264    -.0064658
          GBR    2002    2    .0339347    .0320483
          GBR    2002    2    .045648    .1288884
          GBR    2002    2    -.8332654    -1.120697
          GBR    2002    2    .0411922    .0414339
          GBR    2003    7    -.5091179    .2352562
          GBR    2003    7    .0069749    .2092371
          GBR    2003    7    -1.714969    -.4220859
          GBR    2003    7    -.9049655    -.1355457
          GBR    2003    7    -.2168745    -.7691326
          GBR    2003    7    .0747815    .651668
          GBR    2003    7    -.0018053    -.0599427
          GBR    2003    7    -.0839027    -1.634755
          GBR    2003    7    -.4681859    -1.523496
          GBR    2003    7    -.0473541    -.5599882
          GBR    2003    7    -.1110586    -.1241607
          GBR    2003    7    -.7824202    .2076395
          GBR    2003    7    .1709079    .6418538
          GBR    2003    7    -.3310359    -.9954275
          GBR    2003    7        -.0512938
          GBR    2003    7    -.3153055    .6618941
          GBR    2003    7    -.3187593    .6061358
          GBR    2003    7    .0159786    .0653822
          GBR    2003    7    .0368299    .2638562
          GBR    2003    7    .0377828    .4004664
          GBR    2003    9    .160566    .1914866
          GBR    2003    9    .0939567    -.1817136
          GBR    2003    9    .0426574    -.1493743
          GBR    2003    9    -.1403818    -.7770102
          GBR    2003    9    .0502837    -.5445192
          GBR    2003    9    .1838627    .4687566
          GBR    2003    9    -.623952    -.8704808
          GBR    2003    9        -.879477
          GBR    2003    9    .0185681    .3711399
          GBR    2003    9    .0802329    .1162626
          GBR    2003    9    .2632489    .6294841
          GBR    2003    9    .0055529    -.3821578
          GBR    2003    9    -.9116699    -1.540446
          GBR    2003    9    -.3420514    -.6931477
          GBR    2003    9        .116932
          GBR    2003    9    .0876811    .5596157
          GBR    2003    9    .1464253    .3963278
          GBR    2003    9    -.8771539    .5279653
          GBR    2003    9    .1880069    1.153007
          GBR    2003    9    .1282009    .3332643
          GBR    2003    9    -.0171097    .0142263
          USA    1997    3    .1285255    .5760407
          USA    1997    3    .0030957    -.0449664
          USA    1997    3    .0959533    .4503875
          USA    1997    3    .1062075    .860582
          USA    1997    3    .0825755    .174888
          USA    1997    3    .0707693    .1024471
          USA    1997    3    .0365976    .2435817
          USA    1997    3    .094494    .1770252
          USA    1997    3    -.2396185    .0749262
          USA    1997    3    .142941    .4338761
          USA    1997    3    .0189316    .051666
          USA    1997    3    -.031686    -.1938636
          USA    1997    3    .0707811    .2049058
          USA    1997    3    .0627474    -.4729725
          USA    1997    3    .0715723    2.65812
          USA    1997    3    .0437849    .6153453
          USA    1997    3    .1522825    .9699839
          USA    1997    3    .0789445    -.1212121
          USA    1997    3    .1059512    .6952382
          USA    1997    3    .1003311    .7615972
          USA    1997    3    -.0130689    .201256
          USA    1997    3    .0050605    .0999999
          USA    1997    3    .0720351    .6392083
          USA    1997    3    .0996295    .352968
          USA    1997    3    .0275869    -.2014408
          USA    1997    3    .0764069    .5151699
          USA    1997    3    .0729549    -.0313138
          USA    1997    3        .2999996
          USA    2001    6    .0686334    -.0622515
          USA    2001    6    .0126916    -.5956841
          USA    2001    6    -.0169416    -.5131766
          USA    2001    6    -.1509802    -.1319149
          USA    2001    6    .0094969    -.5140107
          USA    2001    6    .0239321    -.6478873
          USA    2001    6    .0182741    -.604
          USA    2001    6    .0085094    -.6801432
          USA    2001    6    -.0169373    -.2489796
          USA    2001    6    -.006119    -.7280717
          USA    2001    6    -.1173245    -.3411765
          USA    2001    6    -1.455618    -.48
          USA    2001    6    -.5506724    -.0106871
          USA    2001    6    .0273133    -.3322834
          USA    2001    6    -.0083304    -.1625345
          USA    2001    6    -.0168248    -.5876511
          USA    2001    6    .077697    2.834483
          USA    2001    6    -.7135432    -.419398
          USA    2001    6    .0512424    -.1119438
          USA    2001    6    .0059874    .6770645
          end
          I hope this helps.

          Thanks again.

          Comment


          • #6
            Well, with this data, the code runs and produces results on my setup. I cannot reproduce the problem you are having:

            Code:
            . capture program drop one_regression
            
            . program define one_regression
              1.     count if !missing(X, Y)
              2.     if r(N) >= 10 {
              3.         regress Y X
              4.         predict Residuals
              5.     }
              6.     exit
              7. end
            
            .
            . runby one_regression, by(CountryCode IndustryClass Year) status
            
              elapsed ----------- by-groups ----------    ------- observations ------       time
                 time      count     errors    no-data        processed         saved  remaining
            ------------------------------------------------------------------------------------
             00:00:00          5          0          0               99            99   00:00:00
            
            --------------------------------------
            Number of by-groups    =             5
            by-groups with errors  =             0
            by-groups with no data =             0
            Observations processed =            99
            Observations saved     =            99
            --------------------------------------
            
            .
            . list in 1/10, noobs clean
            
                Countr~e   Year   Indust~s           Y           X   Residuals  
                     GBR   2002          2    .1159241    .2404826   -.0217023  
                     GBR   2002          2    .0519025   -.2404375   -.1808962  
                     GBR   2002          2    .0392666   -.4054651   -.2355236  
                     GBR   2002          2    .0644536   -.4519841   -.2509223  
                     GBR   2002          2   -.2139062   -.5906689   -.2968297  
                     GBR   2002          2   -.4994894    .4771613     .056643  
                     GBR   2002          2   -.6719396   -.3418402   -.2144625  
                     GBR   2002          2    .0420839    .4264729    .0398642  
                     GBR   2002          2    .1169699     .629283    .1069983  
                     GBR   2002          2    -.142264   -.0064658    -.103447  
            
            .
            So I don't know what to tell you. My first suggestion is to try adding the -verbose- option to the -runby- command so you can see what is happening inside program one_regression. Perhaps you are getting some error messages that would be helpful to see. A second thought is to run -update all- to make sure your Stata installation is up to date and correctly synchronized between the executable and ado files.

            Comment


            • #7
              Not a current issue, but

              Code:
              predict Residuals
              should be more like

              Code:
              predict Residuals, res 
              for the name of the variable and its nature to match. The code in #1 is correct on this score.

              Comment


              • #8
                Yes, Nick is absolutely right. Just -predict Residuals- will calculate the predicted values, not the residuals. Not only is that not what was asked for, but it would be a misleading variable name. Sorry for my error.

                Comment


                • #9
                  Hello again Clyde and Nick,

                  Thank you for the informative discussion. Eventually the code provided by Clyde has worked and even my initial code has worked after I encoded the string variable CountryCode.

                  Both codes yield identical summary statistics of the residuals, yet I believe that Clyde's command is more efficient in terms of time (and length of text).

                  Comment


                  • #10
                    I believe that Clyde's command is more efficient in terms of time (and length of text).
                    Indeed, that is true. In fact, the -runby- command was developed precisely because looping over the values of a variable with -foreach- requires a very time-consuming -if variable == `value'- clause that made it slow in large data sets. -runby- eliminates this problem and thereby providing a substantial speed-up. As a side benefit, when there is more than one -by()- variable, -runby- eliminates the need to code against non-existent combinations of values. So it simplifies and shortens the code as well.

                    Comment


                    • #11
                      Hellow Clyde,

                      I hope that you are doing well.

                      I am using the -runby- command that you have developed, and the code (mentioned above in #6) is saving only processed observations and dropping the rest of the observations.

                      Below is part of the output I am receiving:

                      Code:
                              
                      Number of by-groups    =    4,643
                      by-groups with errors    =    433
                      by-groups with no data    =    0
                      Observations processed    =    452,402
                      Observations saved    =    401,534
                      I am wondering if there is an option that allows me to keep all observations, regardless whether they were processed or not.

                      Thank you very much.
                      Mostafa

                      Comment


                      • #12
                        No, there is no option to keep the data from the by-groups with errors. But you can -merge- your -runby- results back with the original data to achieve that.

                        Comment

                        Working...
                        X