Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • SVY and cluster standard errors in a logit regression.

    Dear Statalist,

    I have just started to work with svy command in Stata, and I have the next question. I am using World Bank Enterprise Survey database. This database consist of multiple and different companies surveyed in different countries and in different years. For example, in 2017, the survey was run in Argentina to X number of companies. In 2020 in Brazil, and so on. I post a brief example of my data.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str26 country double(idstd wt strata)
    "Argentina2006"              101979            141.544   1
    "Argentina2017"              622549   20.0660400390625   4
    "Azerbaijan2013"             529493 2.3714756965637207   1
    "Bangladesh2007"             409000 1.6200000047683716  64
    "Bangladesh2013"             532301 2.3065169977684166  29
    "Bangladesh2013"             532456 1.1503794754784544  87
    "Belarus2013"                527481  72.60031127929688  13
    "Belarus2018"                650346                  5  60
    "Bosnia and Herzegovina2019" 657780 20.233741760253906  20
    "Bulgaria2007"               431085   5.15843880398182  61
    "Bulgaria2007"               431795     13.08437190049 120
    "Cambodia2013"               560556  2.632734775543213  24
    "Cameroon2009"               462568  4.585154250709431   4
    "Cameroon2016"               607791 1.7187775373458862  44
    "Colombia2006"               110241              3.068  29
    "Czech Republic2019"         678862 20.339733123779297  69
    "Egypt2013"                  572806 11.473684310913086  75
    "Egypt2013"                  572909                  1 142
    "Egypt2016"                  613615   4.44444465637207 199
    "Egypt2016"                  613746 16.952381134033203  81
    "ElSalvador2016"             606070 2.8306150436401367   5
    "Estonia2009"                439225 26.147592544555664  20
    "Estonia2019"                671484 1.9218201637268066  21
    "Ethiopia2015"               590415 3.7965028285980225  40
    "France2021"                 724373 13.333333015441895 260
    "Georgia2013"                529204  25.72206687927246  42
    "Germany2021"                718636  57.46154022216797 140
    "Germany2021"                718638  47.83333206176758  35
    "Germany2021"                719627  74.83333587646484 420
    "Ghana2013"                  557959 1.8677911008506343  88
    "Grenada2010"                504793 1.3333333730697632   2
    "Honduras2010"               500129                  1  14
    "India2014"                  564739    1.7461017370224 747
    "India2014"                  568059  2.488464832305908 335
    "Indonesia2009"              467986  4.599999904632568 165
    "Indonesia2015"              591661  1449.884521484375 117
    "Indonesia2015"              591937  3.441725730895996 253
    "Ireland2020"                717026  70.20689392089844  17
    "Italy2019"                  659099  35.42856979370117  82
    "Jordan2013"                 546286               14.5  18
    "Jordan2019"                 662264  1.119837999343872  13
    "Kazakhstan2019"             663996  3.459075927734375 109
    "Kenya2007"                  426449   11.8100004196167  24
    "Kenya2013"                  538802  7.996896743774414 102
    "Kenya2018"                  629473 2.0451242923736572 217
    "Latvia2019"                 668922      28.4072265625  21
    "Madagascar2013"             558363   8.47048239996911  87
    "Madagascar2013"             558599                  1  76
    "Mali2010"                   486755 2.2758071422576904  15
    "Mexico2006"                 125482              5.761  31
    "Myanmar2014"                548499  16.72741338266788   3
    "Netherlands2020"            714629  91.66666412353516  35
    "Nigeria2007"                427780  6.940000057220459  30
    "Nigeria2007"                428346  5.730000019073486  41
    "Nigeria2014"                587926 1.0334123373031616  60
    "Nigeria2014"                589135  4.526476860046387  16
    "Pakistan2013"               581526 1.1056060791015625  16
    "Pakistan2013"               581865  36.79707717895508  42
    "Peru2010"                   492490 1.1947886943817139  16
    "Philippines2015"            600865  85.79537200927734  87
    "Poland2019"                 674995   4485.93212890625  59
    "Russia2019"                 657198    9.2282133102417 222
    "Serbia2009"                 440483   22.5316104888916  10
    "Solomon Islands2015"        600027 1.4635134935379028   2
    "Southsudan2014"             577125 2.6941452026367188  10
    "Spain2021"                  725639  16.61111068725586  37
    "SriLanka2011"               511960  6.950787544250488  83
    "SriLanka2011"               511990  7.621916770935059  91
    "SriLanka2011"               512017  23.72999382019043  22
    "Sudan2014"                  580559 2.3765053749084473  29
    "Tunisia2020"                711609                  1 100
    "Türkiye2013"               555435    551.13818359375  81
    "Türkiye2013"               555535  3.327641010284424 139
    "Uganda2006"                  97468 3.6500000953674316  58
    "Ukraine2013"                534719  44.36898422241211  52
    "Ukraine2019"                677228 17.214284896850586  53
    "Ukraine2019"                677439 3.4285714626312256 209
    "Uruguay2010"                493899   9.15807056427002   1
    "West Bank And Gaza2013"     528757 1.4929810762405396  29
    "Zimbabwe2011"               513928               3.31  18
    end

    With all this on hand, first I set my survey structure like this:

    Code:
     
     svyset idstd [pweight=wt], strata(strata) singleunit(scaled)
    After that, I want to run a logit regression but using cluster standard errors at country level, because might be correlation within a country. Then I type this code:

    Code:
     
     svy: logit collateral n_outcome age lnemployees i.ownership, vce(cluster country)
    However, Stata tells me: option vce() of logit is not allowed with the svy prefix. So, looking at the design of the survey and therefore, at the command svyset, am I already considering standard errors clustered at country level and hence, adding vce(cluster ...) has no sense, or may I have to specify it but with another command?

    Thank you in advanced!

  • #2
    If you have complex survey data, use the svy prefix and do not worry about clustering as svy will handle this. In the absence of stratification, clustering on the PSU variable+ pweights is equivalent to svy.

    Code:
    webuse nhanes2f, clear
    regress zinc age age2 weight female black orace rural [pweight=finalwgt], cluster(location)
    svyset location [pweight=finalwgt]
    svy: regress zinc age age2 weight female black orace rural
    Res.:

    Code:
    . regress zinc age age2 weight female black orace rural [pweight=finalwgt], cluster(location)
    (sum of wgt is 104,176,071)
    
    Linear regression                               Number of obs     =      9,189
                                                    F(7, 61)          =      85.88
                                                    Prob > F          =     0.0000
                                                    R-squared         =     0.0698
                                                    Root MSE          =     14.218
    
                                  (Std. err. adjusted for 62 clusters in location)
    ------------------------------------------------------------------------------
                 |               Robust
            zinc | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
             age |  -.1701161   .0851811    -2.00   0.050    -.3404463    .0002142
            age2 |   .0008744   .0009049     0.97   0.338    -.0009351    .0026839
          weight |   .0535225    .012396     4.32   0.000     .0287352    .0783098
          female |  -6.134161   .4475023   -13.71   0.000    -7.028997   -5.239324
           black |  -2.881813   .8938939    -3.22   0.002    -4.669264   -1.094361
           orace |  -4.118051   1.117154    -3.69   0.000    -6.351939   -1.884163
           rural |  -.5386327   .6620974    -0.81   0.419    -1.862578    .7853128
           _cons |   92.47495   2.045053    45.22   0.000     88.38561    96.56429
    ------------------------------------------------------------------------------
    
    . 
    . svyset location [pweight=finalwgt]
    
    Sampling weights: finalwgt
                 VCE: linearized
         Single unit: missing
            Strata 1: <one>
     Sampling unit 1: location
               FPC 1: <zero>
    
    . 
    . svy: regress zinc age age2 weight female black orace rural
    (running regress on estimation sample)
    
    Survey: Linear regression
    
    Number of strata =  1                            Number of obs   =       9,189
    Number of PSUs   = 62                            Population size = 104,176,071
                                                     Design df       =          61
                                                     F(7, 55)        =       77.49
                                                     Prob > F        =      0.0000
                                                     R-squared       =      0.0698
    
    ------------------------------------------------------------------------------
                 |             Linearized
            zinc | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
             age |  -.1701161   .0851487    -2.00   0.050    -.3403814    .0001493
            age2 |   .0008744   .0009046     0.97   0.338    -.0009344    .0026832
          weight |   .0535225   .0123912     4.32   0.000     .0287447    .0783003
          female |  -6.134161   .4473318   -13.71   0.000    -7.028656   -5.239665
           black |  -2.881813   .8935533    -3.23   0.002    -4.668583   -1.095042
           orace |  -4.118051   1.116729    -3.69   0.000    -6.351088   -1.885014
           rural |  -.5386327   .6618451    -0.81   0.419    -1.862074    .7848084
           _cons |   92.47495   2.044274    45.24   0.000     88.38717    96.56273
    ------------------------------------------------------------------------------
    
    .

    Comment


    • #3
      Thank so much Andrew! Your answer is what I was looking for. However, as yo can see in my first post, I have a variable called strata which records the stratas of my survey. In the example that you posted there is another variable called stratid which records the stratum id. So, why don't include straid in the svyset command? That is,

      Code:
       
       svyset location [pweight=finalwgt], strata(stratid)
      Furthermore, in my case, when I use the svy coding, I have to specify the singleunit because otherwise the standard errors will be missed.

      Thank you again!
      Last edited by Ibai Ostolozaga Falcon; 09 May 2023, 09:23.

      Comment


      • #4
        #2 simply illustrates the equivalence, which holds in the absence of stratification [hence leaving out the option strata()]. As I said, if your survey settings are correct, proceed.

        Comment


        • #5
          I understood it, but in my case there is stratification, so I am not sure what I should do. As I showed in the first post, I have the variable strata, and the weights (variable wt) are based on these stratas. So I specified the survey as,


          Code:
           svyset idstd [pweight=wt], strata(strata) singleunit(scaled)
          Where idstd is the id of the company. So if I try this, so as to get the standard errors clustered at country level,

          Code:
           logit collateral n_outcome age lnemployees i.ownership [pweight=wt], vce(cluster country)
          and

          Code:
           svyset country [pweight=wt], strata(strata) singleunit(scaled)
          svy: logit collateral n_outcome age lnemployees i.ownership
          I do not get the same standard errors. Looking at your answer I think that is because when I am using the stratification they are not equivalent, but I do not know how to proceed when I have stratification.

          Thank you!

          Comment


          • #6
            Originally posted by Andrew Musau View Post
            #2 simply illustrates the equivalence, which holds in the absence of stratification [hence leaving out the option strata()]. As I said, if your survey settings are correct, proceed.
            Thanks for the answer.

            Comment


            • #7
              The VCE Linearized option is set by default when using svy commands.

              You can change the vce options but they come before the colon : in your command line rather than after

              https://www.stata.com/manuals13/svysvy.pdf

              Comment

              Working...
              X