individual tax data grouped by household question

Tiana Stussie

Join Date: Apr 2022

Posts: 4
#1

individual tax data grouped by household question

04 Apr 2022, 02:12

Hi all, my name is Tiana Stussie. I am currently completing my undergraduate research project. For my research project, I am focusing on the differences of the EITC between childless workers and households with children. For my sample, I am pulling data from 2019 Revised Supplemental Poverty Measure 2019 and I am using Stata 17.

Currently I am trying to find the average EITC received by single mother households by number of children (average EITC for households with 0,1,2,3 kids) as well as the average federal tax liability for single mother households. I am having difficulty because the EITC and federal tax liability are grouped by spm_id but are repeated for each individual observation. I am trying to find a way to not have these duplicates in my calculations for the average. I have included the variables I have been using below and the stata commands that I have tried.

Commands:
//Below are the commands that I used to create my sample. I wanted to focus on households that had a single parent who was female. So I chose the family type to be female reference person (4), the age of head of the household to be 18, and the marital status to be everything that excludes married.

gen single_pop = (spm_famtype==4 & spm_hage>=18 & spm_hmaritalstatus>3)
replace single_pop =. if spm_famtype==.
replace single_pop =. if spm_hage==.
replace single_pop =. if spm_hmaritalstatus==.
drop if single_pop == 0

//Below are the commands that I used to try to find the averages.
sort spm_id
by spm_id: egen unit_eitc = total(spm_eitc)
by spm_id: egen unit_eitc_avg = total(spm_eitc/_N)
by spm_id: egen unit_fedtax = total(spm_fedtaxbc)
by egen unit_fedtax_average = total(spm_fedtaxbc/_N)

Code:

* Example generated by -dataex-. For more info, type help dataex clear input double spm_id float(unit_eitc unit_eitc_avg) double spm_eitc float single_pop double(spm_hmaritalstatus spm_hage spm_famtype) 18001 0 0 0 1 5 65 4 18001 0 0 0 1 5 65 4 82001 0 0 0 1 7 72 4 82001 0 0 0 1 7 72 4 328001 0 0 0 1 7 37 4 328001 0 0 0 1 7 37 4 390001 0 0 0 1 5 49 4 390001 0 0 0 1 5 49 4 390001 0 0 0 1 5 49 4 393001 0 0 0 1 7 41 4 393001 0 0 0 1 7 41 4 411001 0 0 0 1 4 80 4 411001 0 0 0 1 4 80 4 630001 1058 529 529 1 7 45 4 630001 1058 529 529 1 7 45 4 690001 0 0 0 1 7 42 4 690001 0 0 0 1 7 42 4 764001 1916 958 958 1 7 19 4 764001 1916 958 958 1 7 19 4 840001 0 0 0 1 5 61 4 840001 0 0 0 1 5 61 4 937001 7296 1824 1824 1 6 35 4 937001 7296 1824 1824 1 6 35 4 937001 7296 1824 1824 1 6 35 4 937001 7296 1824 1824 1 6 35 4

For example for spm_id 63001, the eitc amount is reported twice but I only want one of the observations to be used in finding the average.

This is my first time posting on statalist, so please let know if there is any additional information I can provide!

2019 Revised Supplemental Poverty Measure Research File

https://www.census.gov

These research files enable researchers to replicate the SPM estimates describe in SEHSD Working Paper #2021-17.
Tags: data, panel, panel data, Suggestion, survey
Andrew Musau

Join Date: Oct 2014

Posts: 9958
#2

04 Apr 2022, 04:15

You can drop perfect duplicates using

Code:

duplicates drop *, force

In case these are not perfect duplicates and you need to retain all observations in the dataset, see the -tag()- function of egen.

Code:

help egen

//Below are the commands that I used to try to find the averages.
sort spm_id
by spm_id: egen unit_eitc = total(spm_eitc)
by spm_id: egen unit_eitc_avg = total(spm_eitc/_N)
by spm_id: egen unit_fedtax = total(spm_fedtaxbc)
by egen unit_fedtax_average = total(spm_fedtaxbc/_N)

Look at the -mean()- function of egen for calculating averages.
Comment
Tiana Stussie

Join Date: Apr 2022

Posts: 4
#3

04 Apr 2022, 09:25

Hi Andrew! Thank you for your help! I used the commands that you provided, but I do not think I am getting the right averages. I think the issue I am running into now is I need to find the average of a variable that satisfies multiple conditions. For example, I want the sample to be single mother households which I believe would require the head of the household (spm_head) to be 1 (this indicates that the individual is the head of the household for that spm_id) and female(spm_famtype=female reference person), the number of adults in the household to be 1(spm_numadults=1), the the age of the individual to be greater than 18, and the eitc amount to be greater than 0 to indicate that the individual is an eitc recipient. I think I am messing up in making this variable. I was wondering if you would by any chance know if there would be a simpler way to create this variable? I have included my commands, output, and variables below for your reference

Commands:
gen single_pop = (spm_famtype==4 & spm_hage>=18 & spm_hmaritalstatus>3 & spm_head==1)
replace single_pop=. if spm_famtype==.
replace single_pop=. if spm_hage==.
replace single_pop=. if spm_hmaritalstatus==.
replace single_pop=. if spm_head==.

**I did the following commands to find the average eitc received and federal tax liability for single-mother households depending on number of children**
drop if single_pop==0
egen eitc_index = tag(spm_eitc) if spm_eitc>0 & single_pop==1
egen fedtaxbc_index= tag(spm_fedtaxbc) if spm_eitc>0 & single_pop==1
sort spm_id, stable
by spm_id: egen avg_eitc = mean(spm_eitc) if spm_eitc>0 & single_pop==1 & eitc_index==1
by spm_id: egen avg_fedtaxbc = mean(spm_fedtaxbc) if spm_eitc>0 & single_pop==1 & fedtaxbc_index==1

Data Dictionary:
spm_id = SPM Unit identification number
spm_numadults = SPM unit's number of adults
spm_fedtaxbc = SPM unit's Federal tax before refundable tax credits
spm_eitc = SPM unit's Federal Earned Income Tax Credit
spm_hage = head of spm unit's age
spm_hmaritalstatus=head of spm unit's marital status

spm_famtype= spm unit's family type
Married couple family

Cohabiting partner

Male reference person

Female reference person

unrelated individuals

spm_head=indicator for head of spm resource unit

Code:

* Example generated by -dataex-. For more info, type help dataex clear input double spm_id float single_pop byte eitc_index float(avg_fedtaxbc avg_eitc) double(spm_eitc spm_fedtaxbc spm_numkids) 18001 1 0 . . 0 0 0 82001 1 0 . . 0 3304 0 328001 1 0 . . 0 0 0 390001 1 0 . . 0 3608 2 393001 1 0 . . 0 8760 1 411001 1 0 . . 0 6644 0 630001 1 1 0 529 529 0 0 690001 1 0 . . 0 6876 1 764001 1 1 1233 958 958 1233 0 840001 1 0 . . 0 16565 0 937001 1 1 . 1824 1824 0 3 964001 1 0 . . 0 0 0 971001 1 0 . . 0 3304 0 996001 1 0 . . 0 90824 2 1033001 1 0 . . 0 1342 0 1073001 1 0 . . 0 0 0 1077001 1 0 . . 0 2241 1 1111001 1 0 . . 0 12579 0 1125001 1 0 . . 0 7357 0 1152001 1 0 . . 0 43081 0 1201001 1 0 . . 0 20430 1 1301001 1 0 . . 0 16406 1 1342001 1 0 . . 0 990 1 1465001 1 0 . . 0 6907 0 1490001 1 0 . . 0 7582 1 1552001 1 0 . . 0 3926 0 1553001 1 0 . . 0 12810 0 1746001 1 1 . 990 990 0 2 1772001 1 0 . . 0 0 2 1790001 1 0 . . 0 3142 0 1794001 1 0 . . 0 4452 3 1795001 1 1 1903 1409 1409 1903 0 1918001 1 0 . . 0 780 0 1961001 1 0 . . 0 0 0 2023001 1 0 . . 0 0 1 2065001 1 0 . . 0 811 1 2091001 1 0 . . 0 2243 1 2142001 1 0 . . 0 280 0 2195001 1 0 . . 0 0 2 2253001 1 0 . . 0 493 2 2270001 1 1 . 632 632 0 1 2351001 1 0 . . 0 954 1 2516001 1 0 . . 0 280 2 2558001 1 0 . . 0 2121 1 2743001 1 0 . . 0 0 0 2945001 1 1 . 1032 1032 0 1 2962001 1 1 1766 104 104 1766 1 2987001 1 0 . . 0 3502 0 3075001 1 1 . 5299 5299 0 3 3090001 1 1 575 76 76 575 0 3093001 1 0 . . 0 5085 0 3273001 1 1 63636 1833 1833 63636 2

Thanks again!
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 9958
#4

04 Apr 2022, 09:59

Originally posted by Tiana Stussie View Post

For example, I want the sample to be single mother households which I believe would require the head of the household (spm_head) to be 1 (this indicates that the individual is the head of the household for that spm_id) and female(spm_famtype=female reference person), the number of adults in the household to be 1(spm_numadults=1), the the age of the individual to be greater than 18, and the eitc amount to be greater than 0 to indicate that the individual is an eitc recipient. I think I am messing up in making this variable. I was wondering if you would by any chance know if there would be a simpler way to create this variable? I have included my commands, output, and variables below for your reference

From your description:

Code:

gen sample = spm_famtype==4 & spm_numadults==1 & age>18 & !missing(age) & spm_eitc>0 & !missing(spm_eitc) bys spm_id: egen tokeep = max(sample) keep if tokeep

Your data example does not include all relevant variables, so if this does not give you what you want, copy and paste the result of

Code:

dataex spm_id spm_famtype spm_numadults age spm_eitc nkids

where "age" and "nkids" are the variables indicating the age of the household member and number of kids in the household. Now, single mother households are by definition households with one mother. Therefore, the eitc value is the average value for that household, unless these are also paid out to other family members. If the former and you want to calculate the the eitc value per kid in the household:

Code:

egen wanted=max(cond(spm_famtype==4 & nkids>0, spm_eitc/ nkids, .))

Last edited by Andrew Musau; 04 Apr 2022, 10:09.
Comment
Tiana Stussie

Join Date: Apr 2022

Posts: 4
#5

04 Apr 2022, 11:48

Hi Andrew,
Thanks for your continued assistance! I greatly appreciate it!

I tried the code you provided. However the new 'sample' and 'keep' variables did not calculate the average eitc for single mother households by number of children. I provided more context below to better clarify my question. But please let me know if it doesn't makes sense and I can try to better explain my question.

Data Source - The 2019 Revised Supplemental Research File
here you can find the data dictionary and .dta file that I am using

This data set reports the poverty threshold on the individual level and groups individuals into "units" by the 'spm_id' identifier. It factors in government programs and consumer expenditures. So it provides an estimate of each "units" EITC (should they qualify) and federal tax liability. However this estimate for the whole unit is duplicated for each individual person. I believe this duplication is causing issues in my calculations, since the EITC is based on a family unit and not the individual

I am trying to answer the questions of: on average, what is the EITC credit received by single-parent households with 0 children, 1 child, 2 children, and 3 children? and on average, what is the federal tax liability of single-parent households with 0 children, 1 child, 2 children, and 3 children? (I believe that my research would be more robust if I found the average of single-mother households as that would match better with existing EITC literature, but my sample does not necessarily have to be that specific). I have pasted a simplified version of the data below:

spm_id spm_numkids spm_eitc spm_fedtaxbc

001 2 1000 800

001 0 1000 800

001 0 1000 800

002 1 500 400

002 0 500 400

003 0 300 500

003 0 300 500

Where spm_id 001 would refer to a household with 1 parent and 2 kids, spm_id 002 would refer to a household with 1 parent and 1 kid, spm_id 003 would refer to a household with 2 cohabiting or married individuals with 0 kids. So for spm_id 001 the household receives 1000 dollars in eitc and pays 800 dollars in federal taxes. So I want to find the average eitc and average federaltax received by households with 1 parent and nkids. So for example, in my study I would not be interested in spm_id 003 since the household has 2 adults.

I have also listed a few variables below that I think may be useful (full list can be found in data dictionary)
- spm_id (SPM unit identification number)
- spm_numper (SPM unit's number of persons)
- spm_numadults (SPM unit's number of adults)
- spm_numkids (SPM unit's number of children)
- spm_fedtax (SPM unit's federal tax after federal tax credits)
-spm_fedtaxbc (SPM unit's federal tax before tax credits)
- spm_hage (Head of SPM unit's age)
- spm_wcohabit (SPM unit has a cohabiting couple - 1 if has cohabiting couple, 0 if no)
- spm_hmaritalstatus (head of spm unit's marital status)
- spm_famtype (spm unit's family type)
- spm_head (indicator for head of spm resource unit)

I have also included the result from the dataex code you provided:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input double(spm_id spm_famtype spm_numadults spm_hage spm_eitc spm_numkids) 937001 4 1 35 1824 3 937001 4 1 35 1824 3 937001 4 1 35 1824 3 937001 4 1 35 1824 3 1746001 4 1 44 990 2 1746001 4 1 44 990 2 1746001 4 1 44 990 2 2270001 4 1 48 632 1 2270001 4 1 48 632 1 3075001 4 1 38 5299 3 3075001 4 1 38 5299 3 3075001 4 1 38 5299 3 3075001 4 1 38 5299 3

Again, I cannot express enough gratitude for your assistance with this question!

2019 Revised Supplemental Poverty Measure Research File

https://www.census.gov

These research files enable researchers to replicate the SPM estimates describe in SEHSD Working Paper #2021-17.

Last edited by Tiana Stussie; 04 Apr 2022, 12:07.
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 9958

04 Apr 2022, 12:27

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input double(spm_id spm_famtype spm_numadults spm_hage spm_eitc spm_numkids)
  937001 4 1 35 1824 3
  937001 4 1 35 1824 3
  937001 4 1 35 1824 3
  937001 4 1 35 1824 3
 1746001 4 1 44  990 2
 1746001 4 1 44  990 2
 1746001 4 1 44  990 2
 2270001 4 1 48  632 1
 2270001 4 1 48  632 1
 3075001 4 1 38 5299 3
 3075001 4 1 38 5299 3
 3075001 4 1 38 5299 3
 3075001 4 1 38 5299 3
end

gen sample = spm_famtype==4 & spm_numadults==1 & spm_hage>18 & !missing(spm_hage) & spm_eitc>0 & !missing(spm_eitc)
egen tag= tag(spm_id)
forval i=0/4{
    bys spm_numkids: egen avg_eitc_`i'= mean(cond(tag & spm_numkids==`i', spm_eitc, .))
}
sort spm_numkids spm_id
l spm_id spm_numkids spm_eitc tag avg_eitc*, sepby(spm_numkids)

Same procedure for the federal tax liability. For households with 3 kids, the average for the 2 households is obtained as (5299+1824)/2 = 3561.5.

Res.:

Code:

. l spm_id spm_numkids spm_eitc tag avg_eitc*, sepby(spm_numkids)

     +--------------------------------------------------------------------------------------------+
     |  spm_id   spm_n~ds   spm_eitc   tag   avg_ei~0   avg_ei~1   avg_ei~2   avg_ei~3   avg_ei~4 |
     |--------------------------------------------------------------------------------------------|
  1. | 2270001          1        632     0          .        632          .          .          . |
  2. | 2270001          1        632     1          .        632          .          .          . |
     |--------------------------------------------------------------------------------------------|
  3. | 1746001          2        990     1          .          .        990          .          . |
  4. | 1746001          2        990     0          .          .        990          .          . |
  5. | 1746001          2        990     0          .          .        990          .          . |
     |--------------------------------------------------------------------------------------------|
  6. |  937001          3       1824     0          .          .          .     3561.5          . |
  7. |  937001          3       1824     0          .          .          .     3561.5          . |
  8. |  937001          3       1824     1          .          .          .     3561.5          . |
  9. |  937001          3       1824     0          .          .          .     3561.5          . |
 10. | 3075001          3       5299     0          .          .          .     3561.5          . |
 11. | 3075001          3       5299     1          .          .          .     3561.5          . |
 12. | 3075001          3       5299     0          .          .          .     3561.5          . |
 13. | 3075001          3       5299     0          .          .          .     3561.5          . |
     +--------------------------------------------------------------------------------------------+

Better and more directly:

Code:

bys spm_numkids: egen avg_eitc= mean(cond(tag, spm_eitc, .))
tabstat avg_eitc, by(spm_numkid)

Res.:

Code:

. tabstat avg_eitc, by(spm_numkid)

Summary for variables: avg_eitc
     by categories of: spm_numkids 

spm_numkids |      mean
------------+----------
          1 |       632
          2 |       990
          3 |    3561.5
------------+----------
      Total |  2517.385
-----------------------

Last edited by Andrew Musau; 04 Apr 2022, 12:33.

Comment

Tiana Stussie

Join Date: Apr 2022

Posts: 4
#7

04 Apr 2022, 22:10

Hi Andrew,
I apologize for the delayed response and I appreciate all of your help! I put the command you provided into stata but the results for average eitc were significantly lower then expected and the results for federal tax were significantly higher than expected. There might just be something that I am missing in the data set though, so I can just go over the dataset again.

Best,
Tiana Stussie
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 9958

05 Apr 2022, 06:07

The mean is sensitive to the number of observations in a group. If you have few observations in some groups, you may get unexpected results. Here is a way to count them:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input double(spm_id spm_famtype spm_numadults spm_hage spm_eitc spm_numkids)
 937001 4 1 35 1824 3
 937001 4 1 35 1824 3
 937001 4 1 35 1824 3
 937001 4 1 35 1824 3
1746001 4 1 44  990 2
1746001 4 1 44  990 2
1746001 4 1 44  990 2
2270001 4 1 48  632 1
2270001 4 1 48  632 1
3075001 4 1 38 5299 3
3075001 4 1 38 5299 3
3075001 4 1 38 5299 3
3075001 4 1 38 5299 3
end

gen sample = spm_famtype==4 & spm_numadults==1 & spm_hage>18 & !missing(spm_hage) & spm_eitc>0 & !missing(spm_eitc)
egen tag= tag(spm_id)
bys spm_numkids: egen avg_eitc= mean(cond(tag, spm_eitc, .))
bys spm_numkids: egen count= total(tag)
tabstat avg_eitc count, by(spm_numkid)

Res.:

Code:

. tabstat avg_eitc count, by(spm_numkid)

Summary statistics: mean
  by categories of: spm_numkids

spm_numkids |  avg_eitc     count
------------+--------------------
          1 |       632         1
          2 |       990         1
          3 |    3561.5         2
------------+--------------------
      Total |  2517.385  1.615385
---------------------------------

I do not think that the issue is with the calculation of the mean, but do post a data sample if you find this to be the case. If there are too few observations in the tails, you can combine categories, e.g., "greater than or equal to 4 kids" instead of "4 kids, 5 kids, ...".

spm_id	spm_numkids	spm_eitc	spm_fedtaxbc
001	2	1000	800
001	0	1000	800
001	0	1000	800
002	1	500	400
002	0	500	400
003	0	300	500
003	0	300	500

Announcement