Run a regression for each country-year-industry cross-section

Mostafa Harakeh

Join Date: Jul 2014

Posts: 91
#1

Run a regression for each country-year-industry cross-section

19 Feb 2022, 04:58

Hello everyone,

It has been a while since I posted here, and I miss it

I am currently working with a panel dataset that spans 15 years and includes firms from different countries, operating in different industries.

I wish to run a regression for each group of firms that operate in the same country, within the same industry, and in a given year. Then I need to store the residuals obtained from each regression (I placed a condition that the regression should have a minimum of 10 observations).

Before inserting the code of the loop that I wrote, I shall define a few variables:
Y: dependent variable
X: independent variable
CountryCode: a string variable that stores the code of the corresponding country
IndustryClass: takes the values 0-11, each number referring to an industry
Year: takes the values 1991-2015
Residuals: the variable in which I store the residuals estimated from the regressions

Therefore, I wrote the following code (loop):

Code:

gen Residuals=. levelsof CountryCode, local(country) foreach c of local country { levelsof Year if CountryCode==`c', local(years) foreach y of local years { levelsof IndustryClass if CountryCode==`c' & Year==`y', local(industry) foreach i of local industry { quietly count if !missing(Y, X) if r(N) >= 10 { capture regress Y X if CountryCode==`c' & Year==`y' & IndustryClass==`i' predict uhat if e(sample), residuals replace Residuals=uhat if e(sample) drop uhat } } } }

The code I wrote does not return an error; however, it does nothing. That is, there are no regressions running following the execution of the code and thus the variable Residuals has no values.

I truly appreciate your help in finding the wrong piece(s) in the code above.

Thank you very much indeed.

Mostafa
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29958
#2

19 Feb 2022, 10:32

I don't see anything obviously wrong with the code. The use of -capture- in front of your regression is an unsafe practice, but it is not causing the problem you describe. My best guess is that there just aren't any combinations of country, year, and industry that have 10 or more observations with both X and Y non-missing. You can check that as follows:

Code:

by CountryCode IndustryClass Year, sort: egen available = total(!missing(X, Y)) summ available

If my hunch is correct, the maximum value of available in the -summ- output will be < 10.

If you find that there are some groups for which the value of available is 10 or more, then please post back with example data, using the -dataex- command, to enable troubleshooting.

That said, you can shorten and simplify this code considerably as:

Code:

capture program drop one_regression program define one_regression if _N >= 10 { regress Y X predict Residuals } exit end runby one_regression, by(CountryCode IndustryClass Year) status

-runby- is written by Robert Picard and me, and is available from SSC.

Added: The approach I wrote in the second code block does not fully enforce the requirement for at least 10 observations in the regression. Here is a correction:

Code:

capture program drop one_regression program define one_regression count if !missing(X, Y) if r(N) >= 10 { regress Y X predict Residuals } exit end runby one_regression, by(CountryCde IndustryClass Year) status

Last edited by Clyde Schechter; 19 Feb 2022, 11:26.
Comment

Mostafa Harakeh

Join Date: Jul 2014
Posts: 91

19 Feb 2022, 14:52

Hello Clyde,

Thank you for your detailed reply.

I read through the documentation of the -runby- command, and I really liked how it simplifies looping. However, the code that you have kindly provided did not create the variable "Residuals" although it returned an output showing the successful numbers of groups and observations. I will try applying a few changes to make it work as it makes things simpler.

Regarding my case, I checked the number of observations within the country-year-industry groups, and the average is around 96 observations, ranging from 0 to 846. Accordingly, to facilitate troubleshooting, I copy and paste below part of the dataset using the -dataex- command:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str4 CountryCode float(Year IndustryClass Y X)
"USA" 1988  9           .          .
"USA" 1989  9   .06223206          .
"USA" 1990  9   .05128117          .
"USA" 1991  9   .04331039          .
"USA" 1992  9    .0446404   .1232989
"USA" 1993  9 .0013825138 -.05303686
"USA" 1994  9   .04418078  .26600385
"USA" 1995  9   .04576015 -.08783531
"USA" 1996  9    .0657833  .52008635
"USA" 1997  9  .065050505   .4882592
"USA" 1998  9  .063185364   .3372959
"USA" 1999  9    .0568946 -.26286253
"USA" 2000  9   .06502338 -.19160226
"USA" 2001  9   .04971404 -.23382543
"USA" 2002  9  -.15628795  .14737502
"USA" 2003  9 -.034037974  -.6939088
"USA" 2004  9  .024447165          .
"USA" 2005  9   .06012164          .
"USA" 2006  9    .0672745          .
"USA" 2007  9   .06735224          .
"USA" 2008  9   .06177252          .
"USA" 2009  9   .10781982          .
"USA" 2010  9   .07807629          .
"USA" 2011  9   .09402896          .
"USA" 2012  9  .064509116          .
"USA" 2013  9   .11333438          .
"USA" 2014  9   .09227814          .
"USA" 2015  9  -.05669359          .
"USA" 1988  7           .          .
"USA" 1989  7  -.15924017          .
"USA" 1990  7  -.26331362          .
"USA" 1991  7   -.3427602          .
"USA" 1992  7   -.4025051  1.2222226
"USA" 1993  7  -.16086596       2.54
"USA" 1994  7  -.08814144  -.4915254
"USA" 1988  6           .          .
"USA" 1989  6   .09628886          .
"USA" 1990  6    .1019243          .
"USA" 1991  6   .10199297          .
"USA" 1992  6  .067342624  .33796296
"USA" 1993  6  .064123236  1.0069207
"USA" 1994  6   .03751309   .2827584
"USA" 1995  6   .04931698   .6666668
"USA" 1996  6   .03064003    .464516
"USA" 1997  6   .02609169     .30837
"USA" 1998  6   .03381578  -.4023569
"USA" 1999  6   .03075237   .6704223
"USA" 2000  6   .13784003   3.418212
"USA" 2001  6  -.06216614  -.8932519
"USA" 2002  6   -.4142427  -.7070939
"USA" 2003  6  -.08341127   .9734375
"USA" 2004  6  .016640162          .
"USA" 2005  6   .05831079          .
"USA" 2006  6  .035371114          .
"USA" 2007  6    .0644053          .
"USA" 2008  6 -.019252984          .
"USA" 2009  6   -.4951701          .
"USA" 2010  6   .09582396          .
"USA" 1988  6           .          .
"USA" 1989  6  -.05259182          .
"USA" 1990  6   -.2873082          .
"USA" 1991  6   .08427544          .
"USA" 1992  6           .         .4
"USA" 1993  6  -.06618594 -.14285718
"USA" 1994  6  .066498056   .4166668
"USA" 1995  6   .05605729   .2205883
"USA" 1988 10           .          .
"USA" 1989 10   .07660044          .
"USA" 1990 10   .06915995          .
"USA" 1991 10   .01782257          .
"USA" 1992 10   .02723884  .13714415
"USA" 1993 10  .015502454 -.37008825
"USA" 1994 10 -.005645466  .26943624
"USA" 1995 10   .04302809  .27162102
"USA" 1996 10  -.02025105  -.4081486
"USA" 1997 10   .05468829   .8140792
"USA" 1998 10   .04392335   .5857811
"USA" 1999 10    .0388092 -.10637635
"USA" 2000 10   .06095763 .071036465
"USA" 2001 10 -.020218894  -.2927635
"USA" 2002 10  -.07982095 -.43170065
"USA" 2003 10   .03173443  .51298976
"USA" 2004 10   -.3009293          .
"USA" 2005 10   .06944719          .
"USA" 2006 10   .03891128          .
"USA" 2007 10 -.013075175          .
"USA" 1988  3           .          .
"USA" 1989  3   .06720754          .
"USA" 1990  3   .04159546          .
"USA" 1991  3    .1230861          .
"USA" 1992  3   .07177333  -.1703802
"USA" 1993  3   .14682074  1.4597417
"USA" 1994  3  .068893716  .05396337
"USA" 1995  3    .1024064  .14122626
"USA" 1996  3   .11078658  .03368893
"USA" 1997  3   .07985014  .14257754
"USA" 1998  3   .06397835 -.12669948
"USA" 1999  3   .10768937 .035178777
"USA" 2000  3   .05099358 -.04518054
"USA" 1988  0           .          .
end

Please let me know what you think/suggest.

Thank you

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29958
#4

19 Feb 2022, 15:58

Well, I, too, get no calculation of the Residuals variable in your example data set. BUT:

Code:

. by CountryCode Year IndustryClass, sort: egen available = total(!missing(Y, X)) . tab available available | Freq. Percent Cum. ------------+----------------------------------- 0 | 48 48.00 48.00 1 | 46 46.00 94.00 2 | 6 6.00 100.00 ------------+----------------------------------- Total | 100 100.00

In other words, your example data has no combinations of country, year, and industry that contain 10 or more observations that are missing neither X nor Y. In fact, the most we get is 2 such observations in a group. So if this is representative of your data set as a whole, the deficiencies of the data set are the source of the problem. If there really are some country, year, and industry combinations that provide 10 or more observations with non-missing X and Y in your data set, find a few examples of those, and post back using those as your example data, and I will try to troubleshoot.
Comment

Mostafa Harakeh

Join Date: Jul 2014
Posts: 91

19 Feb 2022, 17:24

Hi Clyde,

I used the -dataex- command and it generated a random sample with 100 rows. Looking at the example data, I realized that it does not serve the purpose. I assure you that the there are hundreds of cross-sections with more than 10 observations. I even tried my code while replacing the 10 with 2, and still the code returned nothing.

Now I will include a representative example of the data that has groups with sufficient observations:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str4 CountryCode float(Year IndustryClass Y X)
GBR    2002    2    .1159241    .2404826
GBR    2002    2    .0519025    -.2404375
GBR    2002    2    .0392666    -.4054651
GBR    2002    2    .0644536    -.4519841
GBR    2002    2    -.2139062    -.5906689
GBR    2002    2    -.4994894    .4771613
GBR    2002    2    -.6719396    -.3418402
GBR    2002    2    .0420839    .4264729
GBR    2002    2    .1169699    .629283
GBR    2002    2    -.142264    -.0064658
GBR    2002    2    .0339347    .0320483
GBR    2002    2    .045648    .1288884
GBR    2002    2    -.8332654    -1.120697
GBR    2002    2    .0411922    .0414339
GBR    2003    7    -.5091179    .2352562
GBR    2003    7    .0069749    .2092371
GBR    2003    7    -1.714969    -.4220859
GBR    2003    7    -.9049655    -.1355457
GBR    2003    7    -.2168745    -.7691326
GBR    2003    7    .0747815    .651668
GBR    2003    7    -.0018053    -.0599427
GBR    2003    7    -.0839027    -1.634755
GBR    2003    7    -.4681859    -1.523496
GBR    2003    7    -.0473541    -.5599882
GBR    2003    7    -.1110586    -.1241607
GBR    2003    7    -.7824202    .2076395
GBR    2003    7    .1709079    .6418538
GBR    2003    7    -.3310359    -.9954275
GBR    2003    7        -.0512938
GBR    2003    7    -.3153055    .6618941
GBR    2003    7    -.3187593    .6061358
GBR    2003    7    .0159786    .0653822
GBR    2003    7    .0368299    .2638562
GBR    2003    7    .0377828    .4004664
GBR    2003    9    .160566    .1914866
GBR    2003    9    .0939567    -.1817136
GBR    2003    9    .0426574    -.1493743
GBR    2003    9    -.1403818    -.7770102
GBR    2003    9    .0502837    -.5445192
GBR    2003    9    .1838627    .4687566
GBR    2003    9    -.623952    -.8704808
GBR    2003    9        -.879477
GBR    2003    9    .0185681    .3711399
GBR    2003    9    .0802329    .1162626
GBR    2003    9    .2632489    .6294841
GBR    2003    9    .0055529    -.3821578
GBR    2003    9    -.9116699    -1.540446
GBR    2003    9    -.3420514    -.6931477
GBR    2003    9        .116932
GBR    2003    9    .0876811    .5596157
GBR    2003    9    .1464253    .3963278
GBR    2003    9    -.8771539    .5279653
GBR    2003    9    .1880069    1.153007
GBR    2003    9    .1282009    .3332643
GBR    2003    9    -.0171097    .0142263
USA    1997    3    .1285255    .5760407
USA    1997    3    .0030957    -.0449664
USA    1997    3    .0959533    .4503875
USA    1997    3    .1062075    .860582
USA    1997    3    .0825755    .174888
USA    1997    3    .0707693    .1024471
USA    1997    3    .0365976    .2435817
USA    1997    3    .094494    .1770252
USA    1997    3    -.2396185    .0749262
USA    1997    3    .142941    .4338761
USA    1997    3    .0189316    .051666
USA    1997    3    -.031686    -.1938636
USA    1997    3    .0707811    .2049058
USA    1997    3    .0627474    -.4729725
USA    1997    3    .0715723    2.65812
USA    1997    3    .0437849    .6153453
USA    1997    3    .1522825    .9699839
USA    1997    3    .0789445    -.1212121
USA    1997    3    .1059512    .6952382
USA    1997    3    .1003311    .7615972
USA    1997    3    -.0130689    .201256
USA    1997    3    .0050605    .0999999
USA    1997    3    .0720351    .6392083
USA    1997    3    .0996295    .352968
USA    1997    3    .0275869    -.2014408
USA    1997    3    .0764069    .5151699
USA    1997    3    .0729549    -.0313138
USA    1997    3        .2999996
USA    2001    6    .0686334    -.0622515
USA    2001    6    .0126916    -.5956841
USA    2001    6    -.0169416    -.5131766
USA    2001    6    -.1509802    -.1319149
USA    2001    6    .0094969    -.5140107
USA    2001    6    .0239321    -.6478873
USA    2001    6    .0182741    -.604
USA    2001    6    .0085094    -.6801432
USA    2001    6    -.0169373    -.2489796
USA    2001    6    -.006119    -.7280717
USA    2001    6    -.1173245    -.3411765
USA    2001    6    -1.455618    -.48
USA    2001    6    -.5506724    -.0106871
USA    2001    6    .0273133    -.3322834
USA    2001    6    -.0083304    -.1625345
USA    2001    6    -.0168248    -.5876511
USA    2001    6    .077697    2.834483
USA    2001    6    -.7135432    -.419398
USA    2001    6    .0512424    -.1119438
USA    2001    6    .0059874    .6770645
end

I hope this helps.

Thanks again.

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 29958

19 Feb 2022, 19:18

Well, with this data, the code runs and produces results on my setup. I cannot reproduce the problem you are having:

Code:

. capture program drop one_regression

. program define one_regression
  1.     count if !missing(X, Y)
  2.     if r(N) >= 10 {
  3.         regress Y X
  4.         predict Residuals
  5.     }
  6.     exit
  7. end

.
. runby one_regression, by(CountryCode IndustryClass Year) status

  elapsed ----------- by-groups ----------    ------- observations ------       time
     time      count     errors    no-data        processed         saved  remaining
------------------------------------------------------------------------------------
 00:00:00          5          0          0               99            99   00:00:00

--------------------------------------
Number of by-groups    =             5
by-groups with errors  =             0
by-groups with no data =             0
Observations processed =            99
Observations saved     =            99
--------------------------------------

.
. list in 1/10, noobs clean

    Countr~e   Year   Indust~s           Y           X   Residuals  
         GBR   2002          2    .1159241    .2404826   -.0217023  
         GBR   2002          2    .0519025   -.2404375   -.1808962  
         GBR   2002          2    .0392666   -.4054651   -.2355236  
         GBR   2002          2    .0644536   -.4519841   -.2509223  
         GBR   2002          2   -.2139062   -.5906689   -.2968297  
         GBR   2002          2   -.4994894    .4771613     .056643  
         GBR   2002          2   -.6719396   -.3418402   -.2144625  
         GBR   2002          2    .0420839    .4264729    .0398642  
         GBR   2002          2    .1169699     .629283    .1069983  
         GBR   2002          2    -.142264   -.0064658    -.103447  

.

So I don't know what to tell you. My first suggestion is to try adding the -verbose- option to the -runby- command so you can see what is happening inside program one_regression. Perhaps you are getting some error messages that would be helpful to see. A second thought is to run -update all- to make sure your Stata installation is up to date and correctly synchronized between the executable and ado files.

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35436
#7

20 Feb 2022, 05:14

Not a current issue, but

Code:

predict Residuals

should be more like

Code:

predict Residuals, res

for the name of the variable and its nature to match. The code in #1 is correct on this score.
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29958
#8

20 Feb 2022, 09:57

Yes, Nick is absolutely right. Just -predict Residuals- will calculate the predicted values, not the residuals. Not only is that not what was asked for, but it would be a misleading variable name. Sorry for my error.
Comment
Mostafa Harakeh

Join Date: Jul 2014

Posts: 91
#9

22 Feb 2022, 04:14

Hello again Clyde and Nick,

Thank you for the informative discussion. Eventually the code provided by Clyde has worked and even my initial code has worked after I encoded the string variable CountryCode.

Both codes yield identical summary statistics of the residuals, yet I believe that Clyde's command is more efficient in terms of time (and length of text).
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29958
#10

22 Feb 2022, 10:33

I believe that Clyde's command is more efficient in terms of time (and length of text).

Indeed, that is true. In fact, the -runby- command was developed precisely because looping over the values of a variable with -foreach- requires a very time-consuming -if variable == `value'- clause that made it slow in large data sets. -runby- eliminates this problem and thereby providing a substantial speed-up. As a side benefit, when there is more than one -by()- variable, -runby- eliminates the need to code against non-existent combinations of values. So it simplifies and shortens the code as well.
Comment
Mostafa Harakeh

Join Date: Jul 2014

Posts: 91
#11

12 Jul 2023, 09:21

Hellow Clyde,

I hope that you are doing well.

I am using the -runby- command that you have developed, and the code (mentioned above in #6) is saving only processed observations and dropping the rest of the observations.

Below is part of the output I am receiving:

Code:

Number of by-groups = 4,643 by-groups with errors = 433 by-groups with no data = 0 Observations processed = 452,402 Observations saved = 401,534

I am wondering if there is an option that allows me to keep all observations, regardless whether they were processed or not.

Thank you very much.
Mostafa
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29958
#12

12 Jul 2023, 09:32

No, there is no option to keep the data from the by-groups with errors. But you can -merge- your -runby- results back with the original data to achieve that.
1 like
Comment

Announcement