Struggling to create a percent change variable that is organized by multiple subgroupings

Kian Williams

Join Date: Mar 2022

Posts: 8
#1

Struggling to create a percent change variable that is organized by multiple subgroupings

17 Mar 2022, 16:44

Hi all! Apologies for the possibly beginner question, I've been searching this forum all afternoon and have not been able to figure out how to do this.
I'm currently working with a survey dataset with two waves (wave 1 and wave 2). Each observation is a particular individual belonging to a particular social group, in a particular state, in wave 1 or 2 of the survey. I'm looking to do two things. First, for each wave I want to find the number of individuals in state x and group i who are employed by the public sector (a binary variable in my dataset), as a percent of the total number of individuals belonging to that social group in that state. Then, I'd like to find the percent change between waves for each (state, group) pair.

So if 25% of individuals (observations) in state 1 who are part of group 3 are employed in the public sector in wave 1, and 30% of individuals in state 1 who are part of group 3 are employed in the public sector in wave 2, then I would want this variable to be equal to ((30-25)/25)*100=10%.

I would appreciate all the help I can get. Thanks so much!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29799
#2

17 Mar 2022, 17:37

The -proportion- command will give you the proportions (which you can mentally multiply by 100 to get percentages). For the relative changes you can follow that with -nlcom-. The -proportion- command accepts the -svy:- prefix, so you can apply the survey design settings effortlessly.

If you would like more specific advice, you should post back with example data, using the -dataex- command to do so. Please choose your example so that it includes several groups in both waves 1 and 2. Also, show the output of -svyset- so that we can apply the same survey design settings as you are using.

If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

When asking for help with code, always show example data. When showing example data, always use -dataex-.

By the way, ((30-25)/25)*100 = 20%, not 10%.
Comment
Kian Williams

Join Date: Mar 2022

Posts: 8
#3

18 Mar 2022, 09:36

Hi Clyde. Thanks very much for the response. I am still a bit confused on how best to implement proportion in my data. This is a condensed version of what the data looks like through dataex:

input int(survey state group) float govjob
1 1 1 0
1 1 1 0
1 1 2 1
1 1 3 0
1 2 1 0
1 2 1 1
1 2 2 1
1 2 2 1
1 2 2 1
1 2 3 0
1 3 1 1
1 3 1 0
1 3 2 1
1 3 2 0
1 3 2 0
1 3 3 0
1 4 1 0
1 4 2 0
1 4 3 1
1 4 3 0
1 5 1 0
1 5 2 0
1 5 2 1
2 1 1 0
2 1 1 0
2 1 2 1
2 2 1 0
2 2 1 0
2 2 2 1
2 2 3 0
2 2 3 0
2 3 1 1
2 3 1 1
2 3 2 0
2 3 2 0
2 4 1 0
2 4 2 0
2 4 3 1
2 5 1 1
2 5 2 0
2 5 2 0
2 5 3 1
2 5 3 0

I just need a variable that tells me the percent change in the percent of people in a particular group in a particular state that have a government job, between survey wave 1 and 2.
As in, taking the number of people in a group in a state as a percentage of the total number of people in that group in that state, and determining how this changes between waves. Thanks!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29799
#4

18 Mar 2022, 13:02

Code:

levelsof group, local(groups) levelsof state, local(states) foreach s of local states { display "***OVERALL GROUP PROPORTIONS WITHIN STATE `s' BY SURVEY***" proportion group if state == `s', over(survey) foreach g of local groups { display "Percent Change from Survey 1 to Survey 2 for Group `g' in State `s'" if _b[`g'.group#1.survey] != 0 { nlcom (100*(_b[`g'.group#2.survey]-_b[`g'.group#1.survey]) /// /_b[`g'.group#1.survey]) } else { display "Survey2 Population for Group `g' in State `s' is 0--Percent Change Undefined" } } display _newline(5) }

is, I believe, what you want.

Note: You did not provide the output of -svyset-. You have described your data as two waves of a survey. Most surveys use complex sampling designs, including sampling weights, perhaps multiple levels of sampling units, and perhaps stratification. If your data were collected using such a complex design, analyzing the data without accounting for that will produce incorrect results. Ignoring sampling weights, for example, can lead to severely biased estimates of the proportions. And ignoring multiple levels of sampling or stratification may lead to incorrect standard errors (and consequently confidence levels, test-statistics, and p-values.)

If your survey did not in fact use a complex design, then you are fine with the above code. But if it did, you need to consult the documentation that the surveyors provide with the survey to get this design information and -svyset- your data set accordingly before you do any analysis with it. Nowadays, most professionally curated surveys will actually include in their documentation the specific -svyset- commands needed to use their data with Stata (and corresponding commands for other major statistical software packages). With older surveys you should at least be able to find a verbal description of the sampling design and the names of the variables in the data set that can be used to account for the design. The analyses must then have the -svy:- prefix, and selecting out single states requires using the -subpop()- option of the -svy:- prefix rather than an -if- condition.
Comment
Kian Williams

Join Date: Mar 2022

Posts: 8
#5

18 Mar 2022, 13:37

Thanks so much for the help, it's greatly appreciated. Apologies as I've never worked with survey data before and am learning the ropes now. Thanks for the patience.

I had not svyset my data and am looking into doing so. The dataset is weighted through pweights (sampling weights) that I think I would need to include in the svyset command. This thread: https://www.statalist.org/forums/for...01-predictions is using the same dataset that I am currently working with, and villages and urban blocks seem to be the primary sampling units. However, the household-level is my unit of observation (in that, I'm looking at how households reacted to state-level policy changes).

I am running the above code but Stata is giving me the following error: [2.group#1.survey] not found
Which I imagine might be rooted in a lack of observations of that particular group in that particular state and survey wave?
One thing to note is that I might have been a bit unclear in my final sentence: I'm not only looking for the percentage of a group in a state in a survey wave, but more specifically the percentage of a group in a particular state that has a public sector job (binary variable), and how that percentage changes for that state between waves, for each of the groups (categorical var). So the variables generated (one for each group) will have the same value for all observations in any given state in each of the two waves.

Last edited by Kian Williams; 18 Mar 2022, 13:46.
Comment
Kian Williams

Join Date: Mar 2022

Posts: 8
#6

18 Mar 2022, 14:01

Given this information, would we set the psu to the village level (even though I'm interested in household-level unit of observation)? And the strata is the state?

svyset PSUID [pweight = weights], strata(state)

... where PSUID is the village identifier.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 29799

18 Mar 2022, 19:47

Re #5: Sorry I misunderstood what you were asking. Here is code that will do what you want. I have written it for data that has been -svyset-.

Code:

by state group, sort: egen has_1 = max(survey == 1)
by state group: egen has_2 = max(survey == 2)
gen byte usable = has_1 & has_2
svy, subpop(if usable):proportion govjob, over(state group survey) coefl
matrix M = r(table)

frame create results int(state group) float(pct_w1 lb_w1 ub_w1 ///
    pct_w2 lb_w2 ub_w2 rel_diff lb_rel_diff ub_rel_diff)
levelsof state, local(states)
foreach s of local states {
    levelsof group if state == `s' & usable, local(groups)
    foreach g of local groups {
        local topost (`s') (`g')
        forvalues w = 1/2 {
            local topost `topost' (M["b", `"1.govjob@`s'.state#`g'.group#`w'.survey"']) ///
                (M["ll", `"1.govjob@`s'.state#`g'.group#`w'.survey"']) ///
                (M["ul", `"1.govjob@`s'.state#`g'.group#`w'.survey"'])
        }
        if _b[1.govjob@`s'.state#`g'.group#1.survey] != 0 {     
            nlcom (_b[1.govjob@`s'.state#`g'.group#2.survey] - _b[1.govjob@`s'.state#`g'.group#1.survey]) ///
            / _b[1.govjob@`s'.state#`g'.group#1.survey]
            matrix b = r(b)
            matrix V = r(V)
            local topost `topost' (b[1,1]) (b[1,1] + invnormal(0.025)*sqrt(V[1,1])) ///
                (b[1,1] + invnormal(0.975)*sqrt(V[1,1]))
        }
        else {
            local topost `topost' (.) (.) (.)
        }
        frame post results `topost'
    }
}

frame change results
foreach v of varlist pct_* lb_* ub_* rel_diff {
    replace `v' = `v' * 100
    format `v' %3.2f
}
list, noobs clean sepby(state group) ds

It also includes code that deals with non-existent combinations of group and state, or combinations of group and state for which only wave 1 or wave 2 observations, but not both, are present. It also deals with the possibility that the percent with government jobs in wave 1 is 0, so that the percent difference is undefined. Finally, since the output could be really voluminous and difficult to read through, I have also posted the results in a second frame in a more readable fashion. This last improvement requires that you be running version 16 or later.

Re #6: Unlike Steve Samuels I am neither familiar with this data set, nor am I an expert in survey statistics. But I'm confident you won't go wrong following his advice in that thread.

Comment

Kian Williams

Join Date: Mar 2022

Posts: 8
#8

19 Mar 2022, 08:48

This is very helpful, thank you. I am getting an error I was wondering if you were familiar with. It's on the following line:

svy, subpop(if usable):proportion govjob, over(state group survey) coefl
(running proportion on estimation sample)
too many categories
Comment
Kian Williams

Join Date: Mar 2022

Posts: 8
#9

19 Mar 2022, 08:51

There are 33 states in the dataset and 6 groups. Though, only two of the groups are actually ones that I am interested in the percent change in gov employments for (groups labeled 4 and 5). Is there an upper bound on the number of categories that can be assigned as subpops?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29799
#10

19 Mar 2022, 12:14

I don't really know--I've never encountered it, but I've never really done anything quite like this. It may be coming from -over()- rather than -subpop()-. Anyway, if you are only interested in groups 4 and 5, then you can replace -gen byte usable = has_1 & has_2- with -gen byte usable = has_1 & has_2 & inlist(group, 4, 5)- to reduce the load.
Comment

Announcement