Svysetting pooled GSS data using year as stratum variable

Richard Williams

Join Date: Apr 2014

Posts: 4987
#1

Svysetting pooled GSS data using year as stratum variable

30 Jun 2016, 08:56

A student is pooling several years of GSS (General Social Survey) data. She sent me the following question (which I suppose might apply equally well to many situations where you have successive cross-sections of data).

Do you have an opinion on treating Year as a stratum variable with pooled data? I ask because Donald Treiman recommends this in his book Quantitative Data Analysis. He writes that "it is reasonable to treat Year as the stratum variable because the surveys from each year are independent, and Year is a fixed variable." His code to set up pooled GSS data is: svyset sampcode [pweight=weight], strata(year). This code is similar to the UCLA code you sent me, with the addition of year (see http://www.ats.ucla.edu/stat/stata/f...setups.htm#GSS). I haven't seen this approach recommended before, and am not sure if this is the best route to take. My current analysis uses svyset without year as a stratum variable. Instead, I include dummies for Year in my models.

On the one hand, the advice to treat year as a stratum variable sounds reasonable; on the other hand I don't remember seeing similar advice anywhere else. I have a suspicion it won't matter much either way, but I wonder if there is any consensus or controversy over whether or not to do this.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Tags: None
Dick Campbell

Join Date: Apr 2014

Posts: 279
#2

30 Jun 2016, 11:33

I don't see why one would want to treat year as a stratum variable, That certainly is not how NORC's sampling people think of the design. For many years it has not been clear exactly how to deal with the GSS survey design in programs like Stata that allow one to specify survey design variables. In particular, until recently the GSS documentation did not even specify a variable containing information on strata. The most recent version of Appendix A to the GSS codebook does appear to contain the information needed to specify the full survey design. I have quoted it below See http://gss.norc.org/Documents/codebook/A.pdf. The variable VSTRAT referred to below does appear in the current version of the GSS codebook found at. (http://gss.norc.org/documents/codebook/GSS_Codebook.pdf), However, if you have older versions of the data I am not sure which sample design variables are available.

Here is sample Stata code to analyze the variable ANALYSISVAR within a GSSDATAFILE with the weight variable WTVAR
(either WTSSALL or WTSSNR):
use GSSDATAFILE.dta, clear
svyset vpsu [weight=WTVAR], strata (vstrat)
svy: proportion ANALYSISVAR // point estimates and design adjusted s.e.'s
svy: tabulate ANALYSISVAR, deff //deff
tab ANALYSISVAR [weight=round(WTVAR,1.0)] // Weighted frequency

Richard T. Campbell
Emeritus Professor of Biostatistics and Sociology
University of Illinois at Chicago
1 like
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#3

30 Jun 2016, 15:20

The variable VSTRAT doesappear in GSS)Codebook, Dick; it's on page 12.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Dick Campbell

Join Date: Apr 2014

Posts: 279
#4

30 Jun 2016, 16:14

I think there is a misunderstanding here. First, I said until recently a stratum variable was not available. It is now as can be seen in the current version of the codebook, for which I provided a link and from which I provided an example. However, It was not available until relatively recently as shown in the quote below from a discussion of the GSS found on UCLA's IDRE website. One can verify this by looking at the 2010 GSS codebook which is available on the ICPSR website. The site for the UCLA discussion is:

http://www.ats.ucla.edu/stat/stata/f...setups.htm#GSS

GSS (General Social Survey)

The GSS data and documentation can be found here. There are datasets from 1972 to 2010.
The 2010 data are used for this example. Please note that although the sampling design includes stratification, the stratification variable was not released in the dataset.
NOTE: The difference in estimated population sizes between Stata and SAS has to do with the 996 missing cases on the variable wwwhr.
Stata

svyset sampcode [pw= wtssnr] pweight: wtssnr VCE: linearized Single unit: missing Strata 1: <one>

Richard T. Campbell
Emeritus Professor of Biostatistics and Sociology
University of Illinois at Chicago
1 like
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#5

30 Jun 2016, 18:07

I did misunderstand, Dick. I apologize.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#6

30 Jun 2016, 18:33

Richard I've never analyzed GSS. . Apparently GSS has switched to a rotating panel design (https://ropercenter.cornell.edu/general-social-survey/). but I see nothing about this in the Codebook for the combined data 1972-2014 ((http://gss.norc.org/documents/codebook/GSS_Codebook.pdf) . Appendix A of the codebook shows many changes over the years, including changes to the sampling frame and target population.

Multiple year surveys: When multi-year surveys draw independent sample in each year, I've always treated the years as strata. Or, rather, I've created "superstrata" that grouped year and year-specific strata. I haven't found any guidance about what to do for GSS. However, the fact that the stratification changed in many years suggests that the super-stratum approach may be best.

Note also this interesting article about weighting for multi-year analysis.

Chu, Adam, J Michael Brick, and Graham Kalton. 1999. Weights for combining surveys across time or space. Bulletin of the International Statistical Institute, Contributed Papers 2, 103-104.
http://www.tilastokeskus.fi/isi99/pr...o/kalt0185.pdf

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Dick Campbell

Join Date: Apr 2014

Posts: 279
#7

30 Jun 2016, 21:07

The GSS dates back to 1972 at which time things were, relative to to what we do now, much simpler. The first couple of years used quote sampling. It was intended to be a basic data set for undergraduate education and as a basis for "social indicators" research and was deliberately kept simple, e.g. "self weighting." The code book and documentation were also kept relatively simple. Of course technically competent people understood that there was a design effect but for the purpose of undergraduate education it was ignored in part because software to handle the complex survey design was not available until much later in the game. Over time, the design has become increasingly complex to the point where each biennial survey now contains a panel and a repeated cross sectional component. As software for analysis of complex survey designs became widely available and as the survey became used for much more than teaching purposes investigators began to push for more sample design information and NORC has responded.

A major rationale for the GSS has been the investigation of time trends, for example in attitudes toward abortion, capital punishment, gun control and many other variables. Thus YEAR has been seen as a major variable of interest and not merely as a stratification variable. There are few surveys that have been replicated over a period of more than 45 years.See Marsden, PV I(ed) 2012. Social Trends in American Life: Findings from the General Social Survey Since 1972. Princeton University Press for many examples of time trends. In this regard, readers might find Greg J. Duncan and Graham Kalton Issues of Design and Analysis of Surveys across Time. International Statistical Review / Revue Internationale de Statistique, Vol. 55, No. 1. (Apr.,1987), pp. 97-117 of some interest.

Richard T. Campbell
Emeritus Professor of Biostatistics and Sociology
University of Illinois at Chicago
1 like
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4987
#8

01 Jul 2016, 06:19

Interesting discussion. To be perfectly honest, I didn't even realize until recently that GSS had pweights. Perhaps that reflects the fact that I mostly saw it being used in textbooks that featured SPSS.

If i follow Steve, he suggests using something like

egen yearvstrat = group(year vstrat)
svyset vpsu [weight=WTVAR], strata (yearvstrat)

Does that sound right? Or do I misunderstand?

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#9

01 Jul 2016, 13:35

Dick Campbell: Thanks for the very enlightening history of GSS.

Richard Williams: Your code is my best guess for svyset. The code might not be strictly correct for recent rotating panel designs, but for analysis of trends should be conservative.

A personal recollection of my initial exposure to this question: Years ago, a student analyzed five years of data from the US Behavioral Risk Factor Surveillance System (BRFSS). I was new to the topic and asked her to write to the BRFSS statisticians. Their advice: treat year as the PSUt!

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment

Dick Campbell

Join Date: Apr 2014
Posts: 279

#10

03 Jul 2016, 14:32

It turns out that all this discussion is a bit moot. The GSS strata are distinct to the year of survey, or at least they have different numbers. Here is a table from the combined GSS 1975-2014. For reasons that don't matter now I eliminated the 1975 survey even though I could have used it. If you look at the minimum value of vstrat and the max you will see that there is no overlap from year to year. There are a few strata with a single psu and I have accounted for that in the svyset command.

Code:

. table year, c(min vstrat max vstrat)

------------------------------------
GSS YEAR  |
FOR THIS  |
RESPONDEN |
T         | min(vstrat)  max(vstrat)
----------+-------------------------
     1975 |        7001         7050
     1976 |        7101         7150
     1977 |        7201         7250
     1978 |        7301         7350
     1980 |        7401         7450
     1982 |        7501         7598
     1983 |        7601         8031
     1984 |        8101         8162
     1985 |        8201         8254
     1986 |        8301         8354
     1987 |        8401         8456
     1988 |        1001         1066
     1989 |        1067         1131
     1990 |        1132         1195
     1991 |        1196         1259
     1993 |        1260         1345
     1994 |        1346         1451
     1996 |        1458         2457
     1998 |        1540         1643
     2000 |        1644         1745
     2002 |        1746         1847
     2004 |        1848         1956
     2006 |        1957         2105
     2008 |        2106         2239
     2010 |        2240         2373
     2012 |        3001         3066
     2014 |        3101         3166
------------------------------------

.

The obvious upshot of this is that you get exactly the same standard errors if you use the design variable vstrat as the stratification variable or you create a new stratification variable to include year. In the example below I include the centered value of year in my models, but you get the same result if year is not included.

Code:

*no design variables set
logit gunlaw c.c_year##c.c_year female educ, or
estimate store nosvyset

*use provided design variables
svyset vpsu [pw=wtssall],strata(vstrat) singleunit(certainty)
svy: logit gunlaw c.c_year##c.c_year female educ, or
estimates store standard

*add year to strata definition
svyset vpsu [pw=wtssall],strata(yearvstrat) singleunit(certainty)
svy: logit gunlaw c.c_year##c.c_year female educ, or
estimates store plus_year
estout nosvyset standard plus_year, cells(b se t)

Here is the table of results.

Code:

. estout nosvyset standard plus_year, cells(b se t)

---------------------------------------------------
                 nosvyset     standard    plus_year
                   b/se/t       b/se/t       b/se/t
---------------------------------------------------
gunlaw                                             
c_year           .0037118     .0050951     .0050951
                 .0011201      .001404      .001404
                 3.313815      3.62908      3.62908
c.c_year#c~r    -.0010987    -.0011302    -.0011302
                 .0001001     .0001265     .0001265
                -10.97142    -8.934293    -8.934293
female           .7714604     .7392479     .7392479
                 .0263181       .02931       .02931
                 29.31296     25.22167     25.22167
educ             .0361226     .0320655     .0320655
                 .0041668     .0046143     .0046143
                 8.669083     6.949182     6.949182
_cons             .517544     .5978114     .5978114
                 .0590795     .0668919     .0668919
                 8.760135     8.936974     8.936974
---------------------------------------------------

Richard T. Campbell
Emeritus Professor of Biostatistics and Sociology
University of Illinois at Chicago

Comment

Dick Campbell

Join Date: Apr 2014

Posts: 279
#11

03 Jul 2016, 20:18

I realized later that I should have excluded cases in the panel portion of the design to get correct estimates of standard errors (or accounted for the repeat measures) but for demonstration purposes this will do.

Richard T. Campbell
Emeritus Professor of Biostatistics and Sociology
University of Illinois at Chicago
1 like
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4987
#12

04 Jul 2016, 12:21

Fantastic. Thanks Dick. So it appears that when using GSS with pooled years or even only one year, you should stratify on vstrat. In fairness to Treiman, who I quoted in the first post, it doesn't sound like vstrat was included with the GSS at the time he was writing. It sounds like the older years were retrofitted with vstrat, since you were able to run frequencies on it?

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Dick Campbell

Join Date: Apr 2014

Posts: 279
#13

04 Jul 2016, 13:59

Yes, it was retrofitted. Of course, the design information has always been there, but NORC/GSS chose not to make it available until very recently.

Richard T. Campbell
Emeritus Professor of Biostatistics and Sociology
University of Illinois at Chicago
1 like
Comment
Mike Hout

Join Date: May 2016

Posts: 1
#14

01 Mar 2018, 09:19

Thanks for an interesting and useful thread here. One minor addition: the reinterviews of persons in the panels are not included in the cumulative data file so you don't need to block, drop, reweight, or otherwise think about them.
Comment

Announcement