SVY and cluster standard errors in a logit regression.

Ibai Ostolozaga Falcon

Join Date: May 2021
Posts: 36

SVY and cluster standard errors in a logit regression.

09 May 2023, 02:07

Dear Statalist,

I have just started to work with svy command in Stata, and I have the next question. I am using World Bank Enterprise Survey database. This database consist of multiple and different companies surveyed in different countries and in different years. For example, in 2017, the survey was run in Argentina to X number of companies. In 2020 in Brazil, and so on. I post a brief example of my data.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str26 country double(idstd wt strata)
"Argentina2006"              101979            141.544   1
"Argentina2017"              622549   20.0660400390625   4
"Azerbaijan2013"             529493 2.3714756965637207   1
"Bangladesh2007"             409000 1.6200000047683716  64
"Bangladesh2013"             532301 2.3065169977684166  29
"Bangladesh2013"             532456 1.1503794754784544  87
"Belarus2013"                527481  72.60031127929688  13
"Belarus2018"                650346                  5  60
"Bosnia and Herzegovina2019" 657780 20.233741760253906  20
"Bulgaria2007"               431085   5.15843880398182  61
"Bulgaria2007"               431795     13.08437190049 120
"Cambodia2013"               560556  2.632734775543213  24
"Cameroon2009"               462568  4.585154250709431   4
"Cameroon2016"               607791 1.7187775373458862  44
"Colombia2006"               110241              3.068  29
"Czech Republic2019"         678862 20.339733123779297  69
"Egypt2013"                  572806 11.473684310913086  75
"Egypt2013"                  572909                  1 142
"Egypt2016"                  613615   4.44444465637207 199
"Egypt2016"                  613746 16.952381134033203  81
"ElSalvador2016"             606070 2.8306150436401367   5
"Estonia2009"                439225 26.147592544555664  20
"Estonia2019"                671484 1.9218201637268066  21
"Ethiopia2015"               590415 3.7965028285980225  40
"France2021"                 724373 13.333333015441895 260
"Georgia2013"                529204  25.72206687927246  42
"Germany2021"                718636  57.46154022216797 140
"Germany2021"                718638  47.83333206176758  35
"Germany2021"                719627  74.83333587646484 420
"Ghana2013"                  557959 1.8677911008506343  88
"Grenada2010"                504793 1.3333333730697632   2
"Honduras2010"               500129                  1  14
"India2014"                  564739    1.7461017370224 747
"India2014"                  568059  2.488464832305908 335
"Indonesia2009"              467986  4.599999904632568 165
"Indonesia2015"              591661  1449.884521484375 117
"Indonesia2015"              591937  3.441725730895996 253
"Ireland2020"                717026  70.20689392089844  17
"Italy2019"                  659099  35.42856979370117  82
"Jordan2013"                 546286               14.5  18
"Jordan2019"                 662264  1.119837999343872  13
"Kazakhstan2019"             663996  3.459075927734375 109
"Kenya2007"                  426449   11.8100004196167  24
"Kenya2013"                  538802  7.996896743774414 102
"Kenya2018"                  629473 2.0451242923736572 217
"Latvia2019"                 668922      28.4072265625  21
"Madagascar2013"             558363   8.47048239996911  87
"Madagascar2013"             558599                  1  76
"Mali2010"                   486755 2.2758071422576904  15
"Mexico2006"                 125482              5.761  31
"Myanmar2014"                548499  16.72741338266788   3
"Netherlands2020"            714629  91.66666412353516  35
"Nigeria2007"                427780  6.940000057220459  30
"Nigeria2007"                428346  5.730000019073486  41
"Nigeria2014"                587926 1.0334123373031616  60
"Nigeria2014"                589135  4.526476860046387  16
"Pakistan2013"               581526 1.1056060791015625  16
"Pakistan2013"               581865  36.79707717895508  42
"Peru2010"                   492490 1.1947886943817139  16
"Philippines2015"            600865  85.79537200927734  87
"Poland2019"                 674995   4485.93212890625  59
"Russia2019"                 657198    9.2282133102417 222
"Serbia2009"                 440483   22.5316104888916  10
"Solomon Islands2015"        600027 1.4635134935379028   2
"Southsudan2014"             577125 2.6941452026367188  10
"Spain2021"                  725639  16.61111068725586  37
"SriLanka2011"               511960  6.950787544250488  83
"SriLanka2011"               511990  7.621916770935059  91
"SriLanka2011"               512017  23.72999382019043  22
"Sudan2014"                  580559 2.3765053749084473  29
"Tunisia2020"                711609                  1 100
"Türkiye2013"               555435    551.13818359375  81
"Türkiye2013"               555535  3.327641010284424 139
"Uganda2006"                  97468 3.6500000953674316  58
"Ukraine2013"                534719  44.36898422241211  52
"Ukraine2019"                677228 17.214284896850586  53
"Ukraine2019"                677439 3.4285714626312256 209
"Uruguay2010"                493899   9.15807056427002   1
"West Bank And Gaza2013"     528757 1.4929810762405396  29
"Zimbabwe2011"               513928               3.31  18
end

With all this on hand, first I set my survey structure like this:

Code:

 
 svyset idstd [pweight=wt], strata(strata) singleunit(scaled)

After that, I want to run a logit regression but using cluster standard errors at country level, because might be correlation within a country. Then I type this code:

Code:

 
 svy: logit collateral n_outcome age lnemployees i.ownership, vce(cluster country)

However, Stata tells me: option vce() of logit is not allowed with the svy prefix. So, looking at the design of the survey and therefore, at the command svyset, am I already considering standard errors clustered at country level and hence, adding vce(cluster ...) has no sense, or may I have to specify it but with another command?

Thank you in advanced!

Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 10298

09 May 2023, 05:42

If you have complex survey data, use the svy prefix and do not worry about clustering as svy will handle this. In the absence of stratification, clustering on the PSU variable+ pweights is equivalent to svy.

Code:

webuse nhanes2f, clear
regress zinc age age2 weight female black orace rural [pweight=finalwgt], cluster(location)
svyset location [pweight=finalwgt]
svy: regress zinc age age2 weight female black orace rural

Res.:

Code:

. regress zinc age age2 weight female black orace rural [pweight=finalwgt], cluster(location)
(sum of wgt is 104,176,071)

Linear regression                               Number of obs     =      9,189
                                                F(7, 61)          =      85.88
                                                Prob > F          =     0.0000
                                                R-squared         =     0.0698
                                                Root MSE          =     14.218

                              (Std. err. adjusted for 62 clusters in location)
------------------------------------------------------------------------------
             |               Robust
        zinc | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |  -.1701161   .0851811    -2.00   0.050    -.3404463    .0002142
        age2 |   .0008744   .0009049     0.97   0.338    -.0009351    .0026839
      weight |   .0535225    .012396     4.32   0.000     .0287352    .0783098
      female |  -6.134161   .4475023   -13.71   0.000    -7.028997   -5.239324
       black |  -2.881813   .8938939    -3.22   0.002    -4.669264   -1.094361
       orace |  -4.118051   1.117154    -3.69   0.000    -6.351939   -1.884163
       rural |  -.5386327   .6620974    -0.81   0.419    -1.862578    .7853128
       _cons |   92.47495   2.045053    45.22   0.000     88.38561    96.56429
------------------------------------------------------------------------------

. 
. svyset location [pweight=finalwgt]

Sampling weights: finalwgt
             VCE: linearized
     Single unit: missing
        Strata 1: <one>
 Sampling unit 1: location
           FPC 1: <zero>

. 
. svy: regress zinc age age2 weight female black orace rural
(running regress on estimation sample)

Survey: Linear regression

Number of strata =  1                            Number of obs   =       9,189
Number of PSUs   = 62                            Population size = 104,176,071
                                                 Design df       =          61
                                                 F(7, 55)        =       77.49
                                                 Prob > F        =      0.0000
                                                 R-squared       =      0.0698

------------------------------------------------------------------------------
             |             Linearized
        zinc | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |  -.1701161   .0851487    -2.00   0.050    -.3403814    .0001493
        age2 |   .0008744   .0009046     0.97   0.338    -.0009344    .0026832
      weight |   .0535225   .0123912     4.32   0.000     .0287447    .0783003
      female |  -6.134161   .4473318   -13.71   0.000    -7.028656   -5.239665
       black |  -2.881813   .8935533    -3.23   0.002    -4.668583   -1.095042
       orace |  -4.118051   1.116729    -3.69   0.000    -6.351088   -1.885014
       rural |  -.5386327   .6618451    -0.81   0.419    -1.862074    .7848084
       _cons |   92.47495   2.044274    45.24   0.000     88.38717    96.56273
------------------------------------------------------------------------------

.

Comment

Ibai Ostolozaga Falcon

Join Date: May 2021

Posts: 36
#3

09 May 2023, 09:02

Thank so much Andrew! Your answer is what I was looking for. However, as yo can see in my first post, I have a variable called strata which records the stratas of my survey. In the example that you posted there is another variable called stratid which records the stratum id. So, why don't include straid in the svyset command? That is,

Code:

svyset location [pweight=finalwgt], strata(stratid)

Furthermore, in my case, when I use the svy coding, I have to specify the singleunit because otherwise the standard errors will be missed.

Thank you again!

Last edited by Ibai Ostolozaga Falcon; 09 May 2023, 09:23.
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10298
#4

09 May 2023, 14:03

#2 simply illustrates the equivalence, which holds in the absence of stratification [hence leaving out the option strata()]. As I said, if your survey settings are correct, proceed.
Comment
Ibai Ostolozaga Falcon

Join Date: May 2021

Posts: 36
#5

10 May 2023, 01:38

I understood it, but in my case there is stratification, so I am not sure what I should do. As I showed in the first post, I have the variable strata, and the weights (variable wt) are based on these stratas. So I specified the survey as,

Code:

svyset idstd [pweight=wt], strata(strata) singleunit(scaled)

Where idstd is the id of the company. So if I try this, so as to get the standard errors clustered at country level,

Code:

logit collateral n_outcome age lnemployees i.ownership [pweight=wt], vce(cluster country)

and

Code:

svyset country [pweight=wt], strata(strata) singleunit(scaled) svy: logit collateral n_outcome age lnemployees i.ownership

I do not get the same standard errors. Looking at your answer I think that is because when I am using the stratification they are not equivalent, but I do not know how to proceed when I have stratification.

Thank you!
Comment
Ibai Ostolozaga Falcon

Join Date: May 2021

Posts: 36
#6

10 May 2023, 07:51

Originally posted by Andrew Musau View Post

#2 simply illustrates the equivalence, which holds in the absence of stratification [hence leaving out the option strata()]. As I said, if your survey settings are correct, proceed.

Thanks for the answer.
Comment
Josh Wimpey

Join Date: Mar 2020

Posts: 10
#7

23 Jun 2023, 17:56

The VCE Linearized option is set by default when using svy commands.

You can change the vce options but they come before the colon : in your command line rather than after

https://www.stata.com/manuals13/svysvy.pdf
Comment

Announcement

SVY and cluster standard errors in a logit regression.

Comment

Comment

Comment

Comment

Comment

Comment