Help with creating new Datasets in Stata

Jesse Bilger

Join Date: Mar 2022

Posts: 5
#1

Help with creating new Datasets in Stata

08 Mar 2022, 11:29

Hi, I am working on an undergraduate thesis paper, and need some help creating a new dataset in stata!

What I want is to create a new dataset, where the columns are values calculated from a pre-existing dataset which I have loaded into Stata. For example, the columns I want in my dataset look something like this;

| Year | Occupation | assoc_educ | bach_educ | hs_educ | N | Mean Wage |

My new dataset compresses rows from the original dataset; where rows previously were individuals, in the new dataset I group all the rows which share an education level and occupation. I also want to take the mean wage for these groups, where the individual's wages are already existing in the original dataset and a variable (n) which is the number of people in the group. For example, for a given row, I calculate the N and Mean Wage like this;

sum incwage if assoc_educ == 1 & occ == 3255

gen assoc_nurse_n = r(N)
gen assoc_nurse_mean = r(mean)

So I have these two values stored under variable names, I would like to enter these in the columns as such;

| Year | Occupation | assoc_educ | bach_educ | hs_educ | N | Mean Wage |
| 2018 | 3255 | 1 | 0 | 0 | 12526 | 49042.785 |

Variables like year, occupation and the dummies _educ are unchanged from the original dataset.

Thank you for any help, this is probably super basic, but I am new to Stata and Statalist forum. If you have any questions, ask away, if I made any forum faux pas, please let me know.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#2

08 Mar 2022, 11:53

Your problem statement is a bit unclear. If you want each observation (row) in the result to represent groups which share an education level and occupation, it seems implausible that there could be a year variable (column) unless there is something unusual about your data that makes all people with the same education and occupation also be in the same year. Similarly, if they all share the same education level, why would you have three variables that appear to be different education levels? Wouldn't you want a single variable that takes on values like associate degree, bachelor degree, high school diploma? It is even harder to figure this out because you did not provide example data. Since we don't know where you're starting from, it's hard to give clear directions to get you where you want to go.

Speaking in general terms, it sounds as if the solution to your problem will come from using the -collapse- command, perhaps with some tweaking of the data before or after. If you are not familiar with -collapse-, run -help collapse- to read the help file. If this doesn't enable you to solve your problem, please post back with 1) a clearer explanation, and 2) example data prepared with the -dataex- command.

If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

When asking for help with code, always show example data. When showing example data, always use -dataex-.
1 like
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2396
#3

08 Mar 2022, 12:48

I share Clyde's uncertainty about what you really want. Continuing in a similar direction: I wonder if you want to create a data set in which observations (rows) represent groups of individuals *only* as a way to be able to obtain/report summary statistics for these groups, and not because you want to treat such groups as the unit of analysis. If that's right, using -collapse- would likely be unnecessary, and I'd guess that your idea might be based in thinking based on how you might do something in a spreadsheet rather than a data analysis program. If I'm right, you might only need to use one of Stata's table commands. So, in responding to Clyde's request for clarification, speaking to this though might also help you get a better answer.
1 like
Comment

Jesse Bilger

Join Date: Mar 2022
Posts: 5

08 Mar 2022, 13:41

My dataset comes from the current population survey, the year variable is the year in which the given row was recorded. Due to tech limitations on my end, I can only work year by year as any year has around 3 million observations. Thus, I have around ~20 or so datasets. Ideally, I would work through each dataset with a .do file that calculates the relevant stats and appends them to the custom dataset.

In the CPS dataset, education level is a categorical variable like you would want, however, I am working towards running linear regression on this dataset so I have them set up as dummy variables. Switching between the categorical data and dummy variables is trivial. I'll include both the education (educd) and dummies in my append, however, the education variable uses a numerical code, the relevant ones are 63,64,65 (various high school or equivalent diplomas), 81 (associates) and 101 (bachelors).

The occ variable, like education, is a numerical code, it can be cross referenced here; https://usa.ipums.org/usa/volii/occ2018.shtml though I doubt this is of any particular relevance.

I am running Stata BE 17.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int(year occ) long incwage int educd float(hs_educ assoc_educ bach_educ)
2018 4720      0  65 1 0 0
2018 4720   1600  65 1 0 0
2018 3605  10000  64 1 0 0
2018    0      0  71 0 0 0
2018 9640      0  30 0 0 0
2018    0      0  63 1 0 0
2018    0      0  71 0 0 0
2018 7220      0  30 0 0 0
2018 7330  24000  63 1 0 0
2018 2722      0  65 1 0 0
2018    0      0  71 0 0 0
2018    0      0  40 0 0 0
2018    0      0  71 0 0 0
2018    0      0  64 1 0 0
2018 3870  29000  71 0 0 0
2018    0      0  63 1 0 0
2018 8140      0  50 0 0 0
2018 9620  24100  63 1 0 0
2018 5400   4500  71 0 0 0
2018 5840  30000  81 0 1 0
2018    0      0  71 0 0 0
2018    0      0  63 1 0 0
2018    0      0  63 1 0 0
2018    0 999999  17 0 0 0
2018    0      0  63 1 0 0
2018 4220   5700  65 1 0 0
2018 4600    960  65 1 0 0
2018    0      0  63 1 0 0
2018 4220    480  64 1 0 0
2018 4020  17000  65 1 0 0
2018 4760  13000  71 0 0 0
2018    0      0  65 1 0 0
2018    0      0  40 0 0 0
2018 2300   4800  71 0 0 0
2018    0      0  63 1 0 0
2018    0      0  26 0 0 0
2018 4160   3000  65 1 0 0
2018 9620      0  50 0 0 0
2018  710   5400 101 0 0 1
2018    0      0  40 0 0 0
2018 4720   6100  63 1 0 0
2018    0      0  26 0 0 0
2018    0      0  63 1 0 0
2018    0      0  64 1 0 0
2018    0      0  30 0 0 0
2018    0 999999  25 0 0 0
2018 9620   1900  30 0 0 0
2018    0      0  63 1 0 0
2018 9620   5500  64 1 0 0
2018 7750   3000  65 1 0 0
2018    0      0 101 0 0 1
2018    0      0  63 1 0 0
2018 8800  11700  71 0 0 0
2018    0      0  63 1 0 0
2018    0      0  71 0 0 0
2018 5840  30000  81 0 1 0
2018    0      0  63 1 0 0
2018    0      0  61 0 0 0
2018    0      0  65 1 0 0
2018 9620      0  64 1 0 0
2018 4720      0 101 0 0 1
2018 2910  24700  40 0 0 0
2018    0      0  65 1 0 0
2018 9640    400  63 1 0 0
2018 3870  20000  65 1 0 0
2018 4220    220  71 0 0 0
2018 5260      0  81 0 1 0
2018    0      0  40 0 0 0
2018 9620  21000  64 1 0 0
2018    0      0  63 1 0 0
2018 4020      0  63 1 0 0
2018 5400   4500  71 0 0 0
2018    0      0  30 0 0 0
2018 5560   1500  71 0 0 0
2018    0      0  71 0 0 0
2018    0      0  17 0 0 0
2018 7150      0 101 0 0 1
2018    0      0  63 1 0 0
2018 4020      0  63 1 0 0
2018 9620   6000  40 0 0 0
2018  230  10000  71 0 0 0
2018 4220      0  64 1 0 0
2018 4760   4300  71 0 0 0
2018    0      0  63 1 0 0
2018    0      0  65 1 0 0
2018 4540   5000  71 0 0 0
2018    0      0 101 0 0 1
2018 4110      0  71 0 0 0
2018 4110   3000  63 1 0 0
2018 4030   1500  40 0 0 0
2018    0      0  65 1 0 0
2018    0      0  40 0 0 0
2018    0      0 101 0 0 1
2018    0      0 101 0 0 1
2018    0      0  64 1 0 0
2018 3870  20000  63 1 0 0
2018    0      0  63 1 0 0
2018    0      0  40 0 0 0
2018    0      0  50 0 0 0
2018 4110  35000  64 1 0 0
end
label values year YEAR
label def YEAR 2018 "2018", modify
label values educd EDUCD
label def EDUCD 17 "grade 4", modify
label def EDUCD 25 "grade 7", modify
label def EDUCD 26 "grade 8", modify
label def EDUCD 30 "grade 9", modify
label def EDUCD 40 "grade 10", modify
label def EDUCD 50 "grade 11", modify
label def EDUCD 61 "12th grade, no diploma", modify
label def EDUCD 63 "regular high school diploma", modify
label def EDUCD 64 "ged or alternative credential", modify
label def EDUCD 65 "some college, but less than 1 year", modify
label def EDUCD 71 "1 or more years of college credit, no degree", modify
label def EDUCD 81 "associate's degree, type not specified", modify
label def EDUCD 101 "bachelor's degree", modify

The collapse code worked, I collapsed using the following commands;

gen index = 1
collapse (mean) incwage (count) index, by(hs_educ assoc_educ bach_educ occ year)

I use create and count index because I want to collapse by occupation and count the amount of people in the given occupation, collapse does not like targeting and collapsing by the same variable, so I just added a variable that is 1 for all entries and had it count that instead, I can rename/manipulate it post mortem.

So thank you for help with that.

My next problem is that now that I create the custom datasets to my hearts desire, I now need to merge them. To elaborate, I have multiple datasets, that observe the same variables for income/wage, education occupation etc for different years. For example this dataset include survey results from 2017 and 2019;

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int(year occ) long incwage int educd float(assoc_educ hs_educ bach_educ)
2017    0      0  25 0 0 0
2017  350  38500 101 0 0 1
2017 6260  72000  64 0 1 0
2017    0      0  63 0 1 0
2017    0      0  40 0 0 0
2017  230  18000  63 0 1 0
2017    0      0  50 0 0 0
2017    0      0  40 0 0 0
2017    0 999999  14 0 0 0
2017    0 999999   2 0 0 0
2017 8140  65000  63 0 1 0
2017  136  57000  81 1 0 0
2017    0 999999   1 0 0 0
2017 4620  15000  71 0 0 0
2017    0      0  71 0 0 0
2017 4120   1200  71 0 0 0
2017 3930  30000 101 0 0 1
2017  120 103000 101 0 0 1
2017    0      0  63 0 1 0
2017 2430  43500 114 0 0 0
2017    0      0  63 0 1 0
2017 5120      0  71 0 0 0
2017 1410 160000 101 0 0 1
2017 9130  25700 114 0 0 0
2017    0      0  71 0 0 0
2017    0      0  50 0 0 0
2017    0 999999  30 0 0 0
2017    0 999999  14 0 0 0
2017  430 100000 101 0 0 1
2017 4710 344000 101 0 0 1
2017 4710 120000 101 0 0 1
2017 4850  50000 101 0 0 1
2017    0      0  71 0 0 0
2017 4820  20100  64 0 1 0
2017 7200      0  63 0 1 0
2017    0 999999  25 0 0 0
2017    0 999999  17 0 0 0
2017    0 999999  14 0 0 0
2017 4500      0  71 0 0 0
2017 4965    200  71 0 0 0
2017    0      0  63 0 1 0
2017 3535  38000 101 0 0 1
2017    0      0  81 1 0 0
2017 7260  28000  40 0 0 0
2017  910  19500  81 1 0 0
2017 5510   5600  50 0 0 0
2017    0      0  50 0 0 0
2017    0 999999  22 0 0 0
2017    0      0  63 0 1 0
2017    0      0  63 0 1 0
2017    0      0  63 0 1 0
2017 6440  29700  63 0 1 0
2017 4840  56000 101 0 0 1
2017 9130  24700  63 0 1 0
2017 2320  53000 114 0 0 0
2017 2320  51000 114 0 0 0
2017    0      0  50 0 0 0
2017    0 999999  23 0 0 0
2017 2200 344000 116 0 0 0
2017 2340  10000  65 0 1 0
2017    0 999999  25 0 0 0
2017    0 999999  23 0 0 0
2017    0 999999  17 0 0 0
2017    0      0  50 0 0 0
2017    0      0 101 0 0 1
2017 3060 344000 115 0 0 0
2017    0 999999  15 0 0 0
2017    0 999999   2 0 0 0
2017 4510  15000  81 1 0 0
2017 4760  30000  81 1 0 0
2017 9620   1500  71 0 0 0
2017  800  58000 114 0 0 0
2017 2310      0 114 0 0 0
2017 5220  25000  71 0 0 0
2017 8965  29500  71 0 0 0
2017    0      0  26 0 0 0
2017 3740  65000  65 0 1 0
2017 3600  65000 101 0 0 1
2017    0 999999  30 0 0 0
2017    0 999999  25 0 0 0
2017  800  42000 101 0 0 1
2017 7360  58000  65 0 1 0
2017    0 999999  16 0 0 0
2017    0 999999  14 0 0 0
2017 1020 114000 114 0 0 0
2017  430 118000  71 0 0 0
2017 1020  60000 101 0 0 1
2017 4760 100000 101 0 0 1
2017    0 999999  11 0 0 0
2017    0 999999   1 0 0 0
2017 4120  32000 101 0 0 1
2017    0      0  81 1 0 0
2017 3930  75000  61 0 0 0
2017    0      0  61 0 0 0
2017 8965      0  65 0 1 0
2017    0      0  64 0 1 0
2017 7200  15000  71 0 0 0
2017 5700  38000  65 0 1 0
2017 9130   4000  50 0 0 0
2017    0      0  65 0 1 0
end
label values year YEAR
label def YEAR 2017 "2017", modify
label values educd EDUCD
label def EDUCD 1 "n/a", modify
label def EDUCD 2 "no schooling completed", modify
label def EDUCD 11 "nursery school, preschool", modify
label def EDUCD 14 "grade 1", modify
label def EDUCD 15 "grade 2", modify
label def EDUCD 16 "grade 3", modify
label def EDUCD 17 "grade 4", modify
label def EDUCD 22 "grade 5", modify
label def EDUCD 23 "grade 6", modify
label def EDUCD 25 "grade 7", modify
label def EDUCD 26 "grade 8", modify
label def EDUCD 30 "grade 9", modify
label def EDUCD 40 "grade 10", modify
label def EDUCD 50 "grade 11", modify
label def EDUCD 61 "12th grade, no diploma", modify
label def EDUCD 63 "regular high school diploma", modify
label def EDUCD 64 "ged or alternative credential", modify
label def EDUCD 65 "some college, but less than 1 year", modify
label def EDUCD 71 "1 or more years of college credit, no degree", modify
label def EDUCD 81 "associate's degree, type not specified", modify
label def EDUCD 101 "bachelor's degree", modify
label def EDUCD 114 "master's degree", modify
label def EDUCD 115 "professional degree beyond a bachelor's degree", modify
label def EDUCD 116 "doctoral degree", modify

I run the same collapse command as before and get the same format as with my dataset from 2018. What is the easiest way for me to merge these two collapsed datasets? Note that I want to add the rows from the 2018 dataset to the rows of the 2019 and 2017 dataset and that the columns have the same names.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#5

08 Mar 2022, 14:08

I am working towards running linear regression on this dataset so I have them set up as dummy variables.

Unless you are using an ancient version of Stata, that is not true. Just use i.educd in the list of variables in your regression command and Stata will create (virtual) indicator variables for the levels of education on the fly, without cluttering up your data set with unnecessary, redundant information. What I think your approach has primarily accomplished is collapsing the many, many categories of educd into a smaller, more manageable coarser-grained classification system, which is fine. Still, I would have done that with:

Code:

recode educd (1/61 = 0 "No degree") /// (63/71 = 1 "H.S. Education") /// (81 = 2 "Associate's Degree") /// (101/max = 3 "Bachelor's Degree"), gen(education)

and used the new education variable.

collapse does not like targeting and collapsing by the same variable, so I just added a variable that is 1 for all entries and had it count that instead,

Also not true, even with ancient versions of Stata. You just need to make your intentions clear to Stata:

Code:

collapse (count) N = incwage (mean) mean_wage = incwage, by(education year)

will do it. Of course, there is nothing wrong with creating a variable index = 1 and doing what you did. But it is not necessary.

Be that as it may, the combining of data sets you refer to is accomplished with the -append- command, not with -merge-.

Code:

clear append using 2017_data 2018_data 2019_data

Replace the italicized material here with the actual names you have used for those data sets. You can append as many as you wish.
Comment

Jesse Bilger

Join Date: Mar 2022
Posts: 5

08 Mar 2022, 20:11

Okay, so with your help, I have managed to create the merged dataset of my dreams, I have one last question, which is probably the most difficult. But, if you can solve this, I may add statlist.org to the dedications of my thesis. So I have this dataset;

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int(year occ education) double meanwage long N int Occupation
2003  800 0 32917.857142857145    28 0
2003  800 1  27329.23588039867   301 0
2003  800 2 29622.027027027027  1036 0
2003  800 3  46300.15148609779  5215 0
2003  800 4  55996.32958801498  1602 0
2003 1010 0           41563.75    24 1
2003 1010 1  44294.69026548672   113 1
2003 1010 2  49831.22424242424   330 1
2003 1010 3  56347.17302452316  1468 1
2003 1010 4 61304.469273743016   537 1
2003 2300 0  9319.801980198019   101 2
2003 2300 1 10752.274881516587   211 2
2003 2300 2 13326.242236024846   322 2
2003 2300 3 18543.014623172105   889 2
2003 2300 4 26678.389513108614   267 2
2003 3130 0              44470    30 3
2003 3130 1 27295.454545454544   132 3
2003 3130 2  38220.24906654953  4553 3
2003 3130 3 40949.679370840895  4959 3
2003 3130 4 44844.477699530515  1704 3
2003 3640 0 16084.745762711864    59 5
2003 3640 1 17164.137931034482   174 5
2003 3640 2 18461.629213483146   178 5
2003 3640 3  20207.82608695652    92 5
2003 3640 4              33945    34 5
2003 6200 0  27805.21689497717   876 4
2003 6200 1 41557.788944723616   398 4
2003 6200 2   42954.4393442623   305 4
2003 6200 3  43974.65238095238   420 4
2003 6200 4  43741.48148148148   108 4
2004  800 0  29631.81818181818    22 0
2004  800 1 27517.275747508305   301 0
2004  800 2  30483.11623246493   998 0
2004  800 3 48877.439094571266  5213 0
2004  800 4  60994.62120343839  1745 0
2004 1010 0  22643.46153846154    26 1
2004 1010 1 43745.544554455446   101 1
2004 1010 2  51036.23693379791   287 1
2004 1010 3  59005.83451202263  1414 1
2004 1010 4  65470.52631578947   532 1
2004 2300 0  7730.731707317073    82 2
2004 2300 1 10562.303664921466   191 2
2004 2300 2  14402.10843373494   332 2
2004 2300 3 18159.078341013825   868 2
2004 2300 4 26568.178694158076   291 2
2004 3130 0  50689.65517241379    29 3
2004 3130 1  27268.20809248555   173 3
2004 3130 2  39571.98277765511  4529 3
2004 3130 3  43078.46368159204  5025 3
2004 3130 4   48169.2299044407  1779 3
2004 3640 0 14054.545454545454    66 5
2004 3640 1  17731.01265822785   158 5
2004 3640 2 19453.804347826088   184 5
2004 3640 3  16585.60975609756    82 5
2004 3640 4  24662.19512195122    41 5
2004 6200 0 31281.369284876906   853 4
2004 6200 1 41185.042735042734   468 4
2004 6200 2  41956.28571428572   350 4
2004 6200 3  46663.83495145631   412 4
2004 6200 4  36257.30337078652    89 4
2005  800 0  41038.46153846154    26 0
2005  800 1 28764.237288135595   649 0
2005  800 2  31467.52736116741  2467 0
2005  800 3  50580.68668844101 12553 0
2005  800 4  65994.00762449368  4197 0
2005 1010 0  25929.25925925926    54 1
2005 1010 1   50671.4979757085   247 1
2005 1010 2  50901.94029850746   670 1
2005 1010 3  58258.13324750563  3107 1
2005 1010 4 65138.864197530864  1215 1
2005 2300 0   9043.31797235023   217 2
2005 2300 1 11341.583333333334   480 2
2005 2300 2 14719.479166666666   864 2
2005 2300 3 18990.354575946334  2087 2
2005 2300 4 27543.374233128834   652 2
2005 3130 0  42154.28571428572    35 3
2005 3130 1 31394.150943396227   318 3
2005 3130 2 41027.919131279246 11327 3
2005 3130 3  45353.37881958979 11945 3
2005 3130 4 51372.877358490565  4240 3
2005 3640 0 13776.148148148148   135 5
2005 3640 1 19670.335917312663   387 5
2005 3640 2  19603.79474940334   419 5
2005 3640 3 23095.652173913044   161 5
2005 3640 4 25014.024390243903    82 5
2005 6200 0 27721.183541377715  2163 4
2005 6200 1 42877.179487179485  1170 4
2005 6200 2  45262.90868094701   887 4
2005 6200 3  53826.61912751678  1192 4
2005 6200 4  56779.81884057971   276 4
2006  800 0              44992    25 0
2006  800 1 28681.489361702126   611 0
2006  800 2 32355.401695599514  2477 0
2006  800 3 53400.342674264306 12811 0
2006  800 4    69796.243535496  4254 0
2006 1010 0  35306.89655172414    58 1
2006 1010 1  46941.22448979592   245 1
2006 1010 2  54303.53130016052   623 1
2006 1010 3  60504.53551912568  2928 1
2006 1010 4  69246.75824175825  1092 1
end
label values year YEAR
label def YEAR 2003 "2003", modify
label def YEAR 2004 "2004", modify
label def YEAR 2005 "2005", modify
label def YEAR 2006 "2006", modify
label values education education
label def education 0 "No degree", modify
label def education 1 "H.S. Education", modify
label def education 2 "Associate's Degree", modify
label def education 3 "Bachelor's Degree", modify
label def education 4 "Beyond Bachelors", modify
label values Occupation Occupation
label def Occupation 0 "Accountants", modify
label def Occupation 1 "Computer programmers", modify
label def Occupation 2 "Kindergarten and Below Teachers", modify
label def Occupation 3 "Nurses", modify
label def Occupation 4 "Mining and Construction supervisors", modify
label def Occupation 5 "Dental Assistants", modify

In the previous dataset, the rows were individuals observing their education, occupation, and income. In this new dataset, the rows observe the mean wage and the number of people in that occupation for a given year, occupation, and education level.

As an example, one row observes the mean wage of all surveyed computer programmers in 2003 whose highest education level was an associate's degree as well as the number of people who fit into this category (associate degree-holding computer programmers in 2003).

What I want to do with this, is create these two variables, most likely in a new dataset;

Click image for larger version

Name: Screenshot from 2022-03-08 17-56-18.png
Views: 2
Size: 5.3 KB
ID: 1653589

Where W_{B,i,t} would be the mean wage for Bachelor degree holders, for a given job, i, in a given year, t. N follows suit, but is for the number of workers in the given year, degree and occupation.

In this new dataset, a row would record the year, the occupation, and these ratios between Bachelor and Associate degree holders for the given year and occupation.

I know this is most likely some "collapse" command, where I collapse on occupation and year, however, I am struggling in figuring out how to make it so bachelors go in the numerator of the ratios, and associates go in the denominator.

Thank you for all and any help!

Edit: I should note that the wage data is held in the same column, regardless of education level. I am debating on breaking this up, such that one column call it "assoc_wage" has the wage values only if the row is for associates degrees and is otherwise empty, and another column for "bach_wages" with a similar pattern. Would this simplify this problem?

Last edited by Jesse Bilger; 08 Mar 2022, 20:17.

Comment

Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#7

08 Mar 2022, 20:20

You do seek the collapse command, but before we get there, I'm sorry, could you please explain the exact variables you'd like to create, please? I have an idea on how to do this, but before I recommend any code I wanna be specific about what you want.
Comment
Jesse Bilger

Join Date: Mar 2022

Posts: 5
#8

08 Mar 2022, 21:26

Sure, so I started this thread working with a lot of CPS survey results. In this, my rows were individuals surveyed, measuring their income, highest level of degree obtained and occupation.

With some help, I managed to create the dataset from my previous comment. That dataset has rows that collapse the original dataset on year, occupation and education. As a result, in the dataset I posted most recently, the rows are the mean wages for a specific occupation at a specific education level in a year. For example, the first line in the dataset reads that for the 2003 CPS survey, 28 people who did not graduate high school said their job was "Accountant" and on average they earned just shy of $33k a year.

To help with the ratios, I'll construct one for you I pulled the 3rd and 4th line from the dataset, as well as the relevant variable labels.

Code:

input int(year occ education) double meanwage long N int Occupation 2003 800 2 29622.027027027027 1036 0 2003 800 3 46300.15148609779 5215 0 label def education 2 "Associate's Degree", modify label def education 3 "Bachelor's Degree", modify label def Occupation 0 "Accountants", modify

In this case, W_{B,i,t} = 46300.15, the mean wage for accountants with bachelor's degrees in 2003. Likewise W_{A,i,t} = 29622.03, the mean wage for accountants with associates degrees in 2003.

N_{B,i,t} = 5215, that in the CPS survey from 2003, there were 5,215 accountants with bachelors degrees. And N_{A,i,t} = 1036, that in 2003, there were 1036 accountants with associates degrees in the CPS survey.

These numbers being put into the ratios come from the "meanwage" and "N" columns.

What I want to do is to calculate these ratios between Bachelor's and associates, for every occupation in each year.

If you are curious, I am attempting to estimate the rate of substitution between workers with bachelors and workers with associates, this is just a rough proof of concept type analysis, in the final draft I would be running a wage estimate from OLS, and not just the mean from a survey.

Thanks for helping
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#9

08 Mar 2022, 22:07

I know this is most likely some "collapse" command, where I collapse on occupation and year, however, I am struggling in figuring out how to make it so bachelors go in the numerator of the ratios, and associates go in the denominator.

Actually, -collapse- is not helpful for this because it does not calculate ratios or apply logarithms or other functions.

Code:

keep if inlist(education, 2, 3) // Bachelor's and Associates Degrees Only by Occupation year (education): gen w_ratio = meanwage[2]/meanwage[1] by Occupation year (education): gen n_ratio = N[2]/N[1] by Occupation year (education): keep if _n == 1 drop education meanwage N foreach v of varlist *_ratio { gen ln_`v' = ln(`v') }

Note: If for some combination of occupation and year, you don't actually have an observation for bachelor's and another for associates, then the variables you seek to calculate are non-existent for that combination and this code will generate missing values for those results. Remember also that if meanwage or N is 0 for either group, you will also get missing values, as you can't divide by 0 (associates = 0) nor can you take the logarithm of 0 (bachelors = 0).

I should note that the wage data is held in the same column, regardless of education level. I am debating on breaking this up, such that one column call it "assoc_wage" has the wage values only if the row is for associates degrees and is otherwise empty, and another column for "bach_wages" with a similar pattern. Would this simplify this problem?

No, no, no, no, no! It would actually make it much harder. The natural organization for data in Stata is exactly what you have: the long layout. Creating separate variables to present information on subsets of the data is rarely helpful in Stata and very often gums up the works. People who use spreadsheets a lot are used to that way of doing things--but Stata is not a spreadsheet, and instincts formed from using spreadsheets often just lead to trouble in Stata.
Comment
Jesse Bilger

Join Date: Mar 2022

Posts: 5
#10

09 Mar 2022, 22:21

Beautiful, your code worked perfectly, thank you for the help!
Comment

Announcement