Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Svysetting pooled GSS data using year as stratum variable

    A student is pooling several years of GSS (General Social Survey) data. She sent me the following question (which I suppose might apply equally well to many situations where you have successive cross-sections of data).

    Do you have an opinion on treating Year as a stratum variable with pooled data? I ask because Donald Treiman recommends this in his book Quantitative Data Analysis. He writes that "it is reasonable to treat Year as the stratum variable because the surveys from each year are independent, and Year is a fixed variable." His code to set up pooled GSS data is: svyset sampcode [pweight=weight], strata(year). This code is similar to the UCLA code you sent me, with the addition of year (see http://www.ats.ucla.edu/stat/stata/f...setups.htm#GSS). I haven't seen this approach recommended before, and am not sure if this is the best route to take. My current analysis uses svyset without year as a stratum variable. Instead, I include dummies for Year in my models.
    On the one hand, the advice to treat year as a stratum variable sounds reasonable; on the other hand I don't remember seeing similar advice anywhere else. I have a suspicion it won't matter much either way, but I wonder if there is any consensus or controversy over whether or not to do this.
    -------------------------------------------
    Richard Williams, Notre Dame Dept of Sociology
    StataNow Version: 19.5 MP (2 processor)

    EMAIL: [email protected]
    WWW: https://www3.nd.edu/~rwilliam

  • #2
    I don't see why one would want to treat year as a stratum variable, That certainly is not how NORC's sampling people think of the design. For many years it has not been clear exactly how to deal with the GSS survey design in programs like Stata that allow one to specify survey design variables. In particular, until recently the GSS documentation did not even specify a variable containing information on strata. The most recent version of Appendix A to the GSS codebook does appear to contain the information needed to specify the full survey design. I have quoted it below See http://gss.norc.org/Documents/codebook/A.pdf. The variable VSTRAT referred to below does appear in the current version of the GSS codebook found at. (http://gss.norc.org/documents/codebook/GSS_Codebook.pdf), However, if you have older versions of the data I am not sure which sample design variables are available.


    Here is sample Stata code to analyze the variable ANALYSISVAR within a GSSDATAFILE with the weight variable WTVAR
    (either WTSSALL or WTSSNR):
    use GSSDATAFILE.dta, clear
    svyset vpsu [weight=WTVAR], strata (vstrat)
    svy: proportion ANALYSISVAR // point estimates and design adjusted s.e.'s
    svy: tabulate ANALYSISVAR, deff //deff
    tab ANALYSISVAR [weight=round(WTVAR,1.0)] // Weighted frequency
    Richard T. Campbell
    Emeritus Professor of Biostatistics and Sociology
    University of Illinois at Chicago

    Comment


    • #3
      The variable VSTRAT doesappear in GSS)Codebook, Dick; it's on page 12.
      Steve Samuels
      Statistical Consulting
      [email protected]

      Stata 14.2

      Comment


      • #4
        I think there is a misunderstanding here. First, I said until recently a stratum variable was not available. It is now as can be seen in the current version of the codebook, for which I provided a link and from which I provided an example. However, It was not available until relatively recently as shown in the quote below from a discussion of the GSS found on UCLA's IDRE website. One can verify this by looking at the 2010 GSS codebook which is available on the ICPSR website. The site for the UCLA discussion is:

        http://www.ats.ucla.edu/stat/stata/f...setups.htm#GSS


        GSS (General Social Survey)

        The GSS data and documentation can be found here. There are datasets from 1972 to 2010.
        The 2010 data are used for this example. Please note that although the sampling design includes stratification, the stratification variable was not released in the dataset.
        NOTE: The difference in estimated population sizes between Stata and SAS has to do with the 996 missing cases on the variable wwwhr.
        Stata

        svyset sampcode [pw= wtssnr] pweight: wtssnr VCE: linearized Single unit: missing Strata 1: <one>
        Richard T. Campbell
        Emeritus Professor of Biostatistics and Sociology
        University of Illinois at Chicago

        Comment


        • #5
          I did misunderstand, Dick. I apologize.
          Steve Samuels
          Statistical Consulting
          [email protected]

          Stata 14.2

          Comment


          • #6
            Richard ​I've never analyzed GSS. . Apparently GSS has switched to a rotating panel design (https://ropercenter.cornell.edu/general-social-survey/). but I see nothing about this in the Codebook for the combined data 1972-2014 ((http://gss.norc.org/documents/codebook/GSS_Codebook.pdf) . Appendix A of the codebook shows many changes over the years, including changes to the sampling frame and target population.

            Multiple year surveys: When multi-year surveys draw independent sample in each year, I've always treated the years as strata. Or, rather, I've created "superstrata" that grouped year and year-specific strata. I haven't found any guidance about what to do for GSS. However, the fact that the stratification changed in many years suggests that the super-stratum approach may be best.

            Note also this interesting article about weighting for multi-year analysis.

            Chu, Adam, J Michael Brick, and Graham Kalton. 1999. Weights for combining surveys across time or space. Bulletin of the International Statistical Institute, Contributed Papers 2, 103-104.
            http://www.tilastokeskus.fi/isi99/pr...o/kalt0185.pdf
            Steve Samuels
            Statistical Consulting
            [email protected]

            Stata 14.2

            Comment


            • #7
              The GSS dates back to 1972 at which time things were, relative to to what we do now, much simpler. The first couple of years used quote sampling. It was intended to be a basic data set for undergraduate education and as a basis for "social indicators" research and was deliberately kept simple, e.g. "self weighting." The code book and documentation were also kept relatively simple. Of course technically competent people understood that there was a design effect but for the purpose of undergraduate education it was ignored in part because software to handle the complex survey design was not available until much later in the game. Over time, the design has become increasingly complex to the point where each biennial survey now contains a panel and a repeated cross sectional component. As software for analysis of complex survey designs became widely available and as the survey became used for much more than teaching purposes investigators began to push for more sample design information and NORC has responded.

              A major rationale for the GSS has been the investigation of time trends, for example in attitudes toward abortion, capital punishment, gun control and many other variables. Thus YEAR has been seen as a major variable of interest and not merely as a stratification variable. There are few surveys that have been replicated over a period of more than 45 years.See Marsden, PV I(ed) 2012. Social Trends in American Life: Findings from the General Social Survey Since 1972. Princeton University Press for many examples of time trends. In this regard, readers might find Greg J. Duncan and Graham Kalton Issues of Design and Analysis of Surveys across Time. International Statistical Review / Revue Internationale de Statistique, Vol. 55, No. 1. (Apr.,1987), pp. 97-117 of some interest.
              Richard T. Campbell
              Emeritus Professor of Biostatistics and Sociology
              University of Illinois at Chicago

              Comment


              • #8
                Interesting discussion. To be perfectly honest, I didn't even realize until recently that GSS had pweights. Perhaps that reflects the fact that I mostly saw it being used in textbooks that featured SPSS.

                If i follow Steve, he suggests using something like

                egen yearvstrat = group(year vstrat)
                svyset vpsu [weight=WTVAR], strata (yearvstrat)

                Does that sound right? Or do I misunderstand?
                -------------------------------------------
                Richard Williams, Notre Dame Dept of Sociology
                StataNow Version: 19.5 MP (2 processor)

                EMAIL: [email protected]
                WWW: https://www3.nd.edu/~rwilliam

                Comment


                • #9
                  Dick Campbell: Thanks for the very enlightening history of GSS.

                  Richard Williams: Your code is my best guess for svyset. The code might not be strictly correct for recent rotating panel designs, but for analysis of trends should be conservative.

                  A personal recollection of my initial exposure to this question: Years ago, a student analyzed five years of data from the US Behavioral Risk Factor Surveillance System (BRFSS). I was new to the topic and asked her to write to the BRFSS statisticians. Their advice: treat year as the PSUt!
                  Steve Samuels
                  Statistical Consulting
                  [email protected]

                  Stata 14.2

                  Comment


                  • #10
                    It turns out that all this discussion is a bit moot. The GSS strata are distinct to the year of survey, or at least they have different numbers. Here is a table from the combined GSS 1975-2014. For reasons that don't matter now I eliminated the 1975 survey even though I could have used it. If you look at the minimum value of vstrat and the max you will see that there is no overlap from year to year. There are a few strata with a single psu and I have accounted for that in the svyset command.

                    Code:
                    . table year, c(min vstrat max vstrat)
                    
                    ------------------------------------
                    GSS YEAR  |
                    FOR THIS  |
                    RESPONDEN |
                    T         | min(vstrat)  max(vstrat)
                    ----------+-------------------------
                         1975 |        7001         7050
                         1976 |        7101         7150
                         1977 |        7201         7250
                         1978 |        7301         7350
                         1980 |        7401         7450
                         1982 |        7501         7598
                         1983 |        7601         8031
                         1984 |        8101         8162
                         1985 |        8201         8254
                         1986 |        8301         8354
                         1987 |        8401         8456
                         1988 |        1001         1066
                         1989 |        1067         1131
                         1990 |        1132         1195
                         1991 |        1196         1259
                         1993 |        1260         1345
                         1994 |        1346         1451
                         1996 |        1458         2457
                         1998 |        1540         1643
                         2000 |        1644         1745
                         2002 |        1746         1847
                         2004 |        1848         1956
                         2006 |        1957         2105
                         2008 |        2106         2239
                         2010 |        2240         2373
                         2012 |        3001         3066
                         2014 |        3101         3166
                    ------------------------------------
                    
                    .
                    The obvious upshot of this is that you get exactly the same standard errors if you use the design variable vstrat as the stratification variable or you create a new stratification variable to include year. In the example below I include the centered value of year in my models, but you get the same result if year is not included.

                    Code:
                    *no design variables set
                    logit gunlaw c.c_year##c.c_year female educ, or
                    estimate store nosvyset
                    
                    *use provided design variables
                    svyset vpsu [pw=wtssall],strata(vstrat) singleunit(certainty)
                    svy: logit gunlaw c.c_year##c.c_year female educ, or
                    estimates store standard
                    
                    *add year to strata definition
                    svyset vpsu [pw=wtssall],strata(yearvstrat) singleunit(certainty)
                    svy: logit gunlaw c.c_year##c.c_year female educ, or
                    estimates store plus_year
                    estout nosvyset standard plus_year, cells(b se t)
                    Here is the table of results.

                    Code:
                    . estout nosvyset standard plus_year, cells(b se t)
                    
                    ---------------------------------------------------
                                     nosvyset     standard    plus_year
                                       b/se/t       b/se/t       b/se/t
                    ---------------------------------------------------
                    gunlaw                                             
                    c_year           .0037118     .0050951     .0050951
                                     .0011201      .001404      .001404
                                     3.313815      3.62908      3.62908
                    c.c_year#c~r    -.0010987    -.0011302    -.0011302
                                     .0001001     .0001265     .0001265
                                    -10.97142    -8.934293    -8.934293
                    female           .7714604     .7392479     .7392479
                                     .0263181       .02931       .02931
                                     29.31296     25.22167     25.22167
                    educ             .0361226     .0320655     .0320655
                                     .0041668     .0046143     .0046143
                                     8.669083     6.949182     6.949182
                    _cons             .517544     .5978114     .5978114
                                     .0590795     .0668919     .0668919
                                     8.760135     8.936974     8.936974
                    ---------------------------------------------------
                    Richard T. Campbell
                    Emeritus Professor of Biostatistics and Sociology
                    University of Illinois at Chicago

                    Comment


                    • #11
                      I realized later that I should have excluded cases in the panel portion of the design to get correct estimates of standard errors (or accounted for the repeat measures) but for demonstration purposes this will do.
                      Richard T. Campbell
                      Emeritus Professor of Biostatistics and Sociology
                      University of Illinois at Chicago

                      Comment


                      • #12
                        Fantastic. Thanks Dick. So it appears that when using GSS with pooled years or even only one year, you should stratify on vstrat. In fairness to Treiman, who I quoted in the first post, it doesn't sound like vstrat was included with the GSS at the time he was writing. It sounds like the older years were retrofitted with vstrat, since you were able to run frequencies on it?
                        -------------------------------------------
                        Richard Williams, Notre Dame Dept of Sociology
                        StataNow Version: 19.5 MP (2 processor)

                        EMAIL: [email protected]
                        WWW: https://www3.nd.edu/~rwilliam

                        Comment


                        • #13
                          Yes, it was retrofitted. Of course, the design information has always been there, but NORC/GSS chose not to make it available until very recently.
                          Richard T. Campbell
                          Emeritus Professor of Biostatistics and Sociology
                          University of Illinois at Chicago

                          Comment


                          • #14
                            Thanks for an interesting and useful thread here. One minor addition: the reinterviews of persons in the panels are not included in the cumulative data file so you don't need to block, drop, reweight, or otherwise think about them.

                            Comment

                            Working...
                            X