Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • individual tax data grouped by household question


    Hi all, my name is Tiana Stussie. I am currently completing my undergraduate research project. For my research project, I am focusing on the differences of the EITC between childless workers and households with children. For my sample, I am pulling data from 2019 Revised Supplemental Poverty Measure 2019 and I am using Stata 17.

    Currently I am trying to find the average EITC received by single mother households by number of children (average EITC for households with 0,1,2,3 kids) as well as the average federal tax liability for single mother households. I am having difficulty because the EITC and federal tax liability are grouped by spm_id but are repeated for each individual observation. I am trying to find a way to not have these duplicates in my calculations for the average. I have included the variables I have been using below and the stata commands that I have tried.

    Commands:
    //Below are the commands that I used to create my sample. I wanted to focus on households that had a single parent who was female. So I chose the family type to be female reference person (4), the age of head of the household to be 18, and the marital status to be everything that excludes married.

    gen single_pop = (spm_famtype==4 & spm_hage>=18 & spm_hmaritalstatus>3)
    replace single_pop =. if spm_famtype==.
    replace single_pop =. if spm_hage==.
    replace single_pop =. if spm_hmaritalstatus==.
    drop if single_pop == 0

    //Below are the commands that I used to try to find the averages.
    sort spm_id
    by spm_id: egen unit_eitc = total(spm_eitc)
    by spm_id: egen unit_eitc_avg = total(spm_eitc/_N)
    by spm_id: egen unit_fedtax = total(spm_fedtaxbc)
    by egen unit_fedtax_average = total(spm_fedtaxbc/_N)



    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input double spm_id float(unit_eitc unit_eitc_avg) double spm_eitc float single_pop double(spm_hmaritalstatus spm_hage spm_famtype)
      18001    0    0    0 1 5 65 4
      18001    0    0    0 1 5 65 4
      82001    0    0    0 1 7 72 4
      82001    0    0    0 1 7 72 4
     328001    0    0    0 1 7 37 4
     328001    0    0    0 1 7 37 4
     390001    0    0    0 1 5 49 4
     390001    0    0    0 1 5 49 4
     390001    0    0    0 1 5 49 4
     393001    0    0    0 1 7 41 4
     393001    0    0    0 1 7 41 4
     411001    0    0    0 1 4 80 4
     411001    0    0    0 1 4 80 4
     630001 1058  529  529 1 7 45 4
     630001 1058  529  529 1 7 45 4
     690001    0    0    0 1 7 42 4
     690001    0    0    0 1 7 42 4
     764001 1916  958  958 1 7 19 4
     764001 1916  958  958 1 7 19 4
     840001    0    0    0 1 5 61 4
     840001    0    0    0 1 5 61 4
     937001 7296 1824 1824 1 6 35 4
     937001 7296 1824 1824 1 6 35 4
     937001 7296 1824 1824 1 6 35 4
     937001 7296 1824 1824 1 6 35 4
    For example for spm_id 63001, the eitc amount is reported twice but I only want one of the observations to be used in finding the average.

    This is my first time posting on statalist, so please let know if there is any additional information I can provide!
    These research files enable researchers to replicate the SPM estimates describe in SEHSD Working Paper #2021-17.

  • #2
    You can drop perfect duplicates using

    Code:
    duplicates drop *, force
    In case these are not perfect duplicates and you need to retain all observations in the dataset, see the -tag()- function of egen.

    Code:
    help egen
    //Below are the commands that I used to try to find the averages.
    sort spm_id
    by spm_id: egen unit_eitc = total(spm_eitc)
    by spm_id: egen unit_eitc_avg = total(spm_eitc/_N)
    by spm_id: egen unit_fedtax = total(spm_fedtaxbc)
    by egen unit_fedtax_average = total(spm_fedtaxbc/_N)
    Look at the -mean()- function of egen for calculating averages.

    Comment


    • #3
      Hi Andrew! Thank you for your help! I used the commands that you provided, but I do not think I am getting the right averages. I think the issue I am running into now is I need to find the average of a variable that satisfies multiple conditions. For example, I want the sample to be single mother households which I believe would require the head of the household (spm_head) to be 1 (this indicates that the individual is the head of the household for that spm_id) and female(spm_famtype=female reference person), the number of adults in the household to be 1(spm_numadults=1), the the age of the individual to be greater than 18, and the eitc amount to be greater than 0 to indicate that the individual is an eitc recipient. I think I am messing up in making this variable. I was wondering if you would by any chance know if there would be a simpler way to create this variable? I have included my commands, output, and variables below for your reference


      Commands:
      gen single_pop = (spm_famtype==4 & spm_hage>=18 & spm_hmaritalstatus>3 & spm_head==1)
      replace single_pop=. if spm_famtype==.
      replace single_pop=. if spm_hage==.
      replace single_pop=. if spm_hmaritalstatus==.
      replace single_pop=. if spm_head==.

      **I did the following commands to find the average eitc received and federal tax liability for single-mother households depending on number of children**
      drop if single_pop==0
      egen eitc_index = tag(spm_eitc) if spm_eitc>0 & single_pop==1
      egen fedtaxbc_index= tag(spm_fedtaxbc) if spm_eitc>0 & single_pop==1
      sort spm_id, stable
      by spm_id: egen avg_eitc = mean(spm_eitc) if spm_eitc>0 & single_pop==1 & eitc_index==1
      by spm_id: egen avg_fedtaxbc = mean(spm_fedtaxbc) if spm_eitc>0 & single_pop==1 & fedtaxbc_index==1

      Data Dictionary:
      spm_id = SPM Unit identification number
      spm_numadults = SPM unit's number of adults
      spm_fedtaxbc = SPM unit's Federal tax before refundable tax credits
      spm_eitc = SPM unit's Federal Earned Income Tax Credit
      spm_hage = head of spm unit's age
      spm_hmaritalstatus=head of spm unit's marital status

      spm_famtype= spm unit's family type
      1. Married couple family
      2. Cohabiting partner
      3. Male reference person
      4. Female reference person
      5. unrelated individuals
      spm_head=indicator for head of spm resource unit



      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input double spm_id float single_pop byte eitc_index float(avg_fedtaxbc avg_eitc) double(spm_eitc spm_fedtaxbc spm_numkids)
        18001 1 0     .    .    0     0 0
        82001 1 0     .    .    0  3304 0
       328001 1 0     .    .    0     0 0
       390001 1 0     .    .    0  3608 2
       393001 1 0     .    .    0  8760 1
       411001 1 0     .    .    0  6644 0
       630001 1 1     0  529  529     0 0
       690001 1 0     .    .    0  6876 1
       764001 1 1  1233  958  958  1233 0
       840001 1 0     .    .    0 16565 0
       937001 1 1     . 1824 1824     0 3
       964001 1 0     .    .    0     0 0
       971001 1 0     .    .    0  3304 0
       996001 1 0     .    .    0 90824 2
      1033001 1 0     .    .    0  1342 0
      1073001 1 0     .    .    0     0 0
      1077001 1 0     .    .    0  2241 1
      1111001 1 0     .    .    0 12579 0
      1125001 1 0     .    .    0  7357 0
      1152001 1 0     .    .    0 43081 0
      1201001 1 0     .    .    0 20430 1
      1301001 1 0     .    .    0 16406 1
      1342001 1 0     .    .    0   990 1
      1465001 1 0     .    .    0  6907 0
      1490001 1 0     .    .    0  7582 1
      1552001 1 0     .    .    0  3926 0
      1553001 1 0     .    .    0 12810 0
      1746001 1 1     .  990  990     0 2
      1772001 1 0     .    .    0     0 2
      1790001 1 0     .    .    0  3142 0
      1794001 1 0     .    .    0  4452 3
      1795001 1 1  1903 1409 1409  1903 0
      1918001 1 0     .    .    0   780 0
      1961001 1 0     .    .    0     0 0
      2023001 1 0     .    .    0     0 1
      2065001 1 0     .    .    0   811 1
      2091001 1 0     .    .    0  2243 1
      2142001 1 0     .    .    0   280 0
      2195001 1 0     .    .    0     0 2
      2253001 1 0     .    .    0   493 2
      2270001 1 1     .  632  632     0 1
      2351001 1 0     .    .    0   954 1
      2516001 1 0     .    .    0   280 2
      2558001 1 0     .    .    0  2121 1
      2743001 1 0     .    .    0     0 0
      2945001 1 1     . 1032 1032     0 1
      2962001 1 1  1766  104  104  1766 1
      2987001 1 0     .    .    0  3502 0
      3075001 1 1     . 5299 5299     0 3
      3090001 1 1   575   76   76   575 0
      3093001 1 0     .    .    0  5085 0
      3273001 1 1 63636 1833 1833 63636 2

      Thanks again!

      Comment


      • #4
        Originally posted by Tiana Stussie View Post
        For example, I want the sample to be single mother households which I believe would require the head of the household (spm_head) to be 1 (this indicates that the individual is the head of the household for that spm_id) and female(spm_famtype=female reference person), the number of adults in the household to be 1(spm_numadults=1), the the age of the individual to be greater than 18, and the eitc amount to be greater than 0 to indicate that the individual is an eitc recipient. I think I am messing up in making this variable. I was wondering if you would by any chance know if there would be a simpler way to create this variable? I have included my commands, output, and variables below for your reference
        From your description:

        Code:
        gen sample = spm_famtype==4 & spm_numadults==1 & age>18 & !missing(age) & spm_eitc>0 &  !missing(spm_eitc)
        bys spm_id: egen tokeep = max(sample)
        keep if tokeep
        Your data example does not include all relevant variables, so if this does not give you what you want, copy and paste the result of

        Code:
        dataex spm_id spm_famtype spm_numadults age spm_eitc nkids

        where "age" and "nkids" are the variables indicating the age of the household member and number of kids in the household. Now, single mother households are by definition households with one mother. Therefore, the eitc value is the average value for that household, unless these are also paid out to other family members. If the former and you want to calculate the the eitc value per kid in the household:

        Code:
        egen wanted=max(cond(spm_famtype==4 & nkids>0, spm_eitc/ nkids, .))
        Last edited by Andrew Musau; 04 Apr 2022, 10:09.

        Comment


        • #5
          Hi Andrew,
          Thanks for your continued assistance! I greatly appreciate it!

          I tried the code you provided. However the new 'sample' and 'keep' variables did not calculate the average eitc for single mother households by number of children. I provided more context below to better clarify my question. But please let me know if it doesn't makes sense and I can try to better explain my question.

          Data Source - The 2019 Revised Supplemental Research File
          • here you can find the data dictionary and .dta file that I am using
          This data set reports the poverty threshold on the individual level and groups individuals into "units" by the 'spm_id' identifier. It factors in government programs and consumer expenditures. So it provides an estimate of each "units" EITC (should they qualify) and federal tax liability. However this estimate for the whole unit is duplicated for each individual person. I believe this duplication is causing issues in my calculations, since the EITC is based on a family unit and not the individual

          I am trying to answer the questions of: on average, what is the EITC credit received by single-parent households with 0 children, 1 child, 2 children, and 3 children? and on average, what is the federal tax liability of single-parent households with 0 children, 1 child, 2 children, and 3 children? (I believe that my research would be more robust if I found the average of single-mother households as that would match better with existing EITC literature, but my sample does not necessarily have to be that specific). I have pasted a simplified version of the data below:
          spm_id spm_numkids spm_eitc spm_fedtaxbc
          001 2 1000 800
          001 0 1000 800
          001 0 1000 800
          002 1 500 400
          002 0 500 400
          003 0 300 500
          003 0 300 500
          Where spm_id 001 would refer to a household with 1 parent and 2 kids, spm_id 002 would refer to a household with 1 parent and 1 kid, spm_id 003 would refer to a household with 2 cohabiting or married individuals with 0 kids. So for spm_id 001 the household receives 1000 dollars in eitc and pays 800 dollars in federal taxes. So I want to find the average eitc and average federaltax received by households with 1 parent and nkids. So for example, in my study I would not be interested in spm_id 003 since the household has 2 adults.



          I have also listed a few variables below that I think may be useful (full list can be found in data dictionary)
          - spm_id (SPM unit identification number)
          - spm_numper (SPM unit's number of persons)
          - spm_numadults (SPM unit's number of adults)
          - spm_numkids (SPM unit's number of children)
          - spm_fedtax (SPM unit's federal tax after federal tax credits)
          -spm_fedtaxbc (SPM unit's federal tax before tax credits)
          - spm_hage (Head of SPM unit's age)
          - spm_wcohabit (SPM unit has a cohabiting couple - 1 if has cohabiting couple, 0 if no)
          - spm_hmaritalstatus (head of spm unit's marital status)
          - spm_famtype (spm unit's family type)
          - spm_head (indicator for head of spm resource unit)


          I have also included the result from the dataex code you provided:
          Code:
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input double(spm_id spm_famtype spm_numadults spm_hage spm_eitc spm_numkids)
            937001 4 1 35 1824 3
            937001 4 1 35 1824 3
            937001 4 1 35 1824 3
            937001 4 1 35 1824 3
           1746001 4 1 44  990 2
           1746001 4 1 44  990 2
           1746001 4 1 44  990 2
           2270001 4 1 48  632 1
           2270001 4 1 48  632 1
           3075001 4 1 38 5299 3
           3075001 4 1 38 5299 3
           3075001 4 1 38 5299 3
           3075001 4 1 38 5299 3
          Again, I cannot express enough gratitude for your assistance with this question!
          These research files enable researchers to replicate the SPM estimates describe in SEHSD Working Paper #2021-17.
          Last edited by Tiana Stussie; 04 Apr 2022, 12:07.

          Comment


          • #6
            Code:
            * Example generated by -dataex-. For more info, type help dataex
            clear
            input double(spm_id spm_famtype spm_numadults spm_hage spm_eitc spm_numkids)
              937001 4 1 35 1824 3
              937001 4 1 35 1824 3
              937001 4 1 35 1824 3
              937001 4 1 35 1824 3
             1746001 4 1 44  990 2
             1746001 4 1 44  990 2
             1746001 4 1 44  990 2
             2270001 4 1 48  632 1
             2270001 4 1 48  632 1
             3075001 4 1 38 5299 3
             3075001 4 1 38 5299 3
             3075001 4 1 38 5299 3
             3075001 4 1 38 5299 3
            end
            
            gen sample = spm_famtype==4 & spm_numadults==1 & spm_hage>18 & !missing(spm_hage) & spm_eitc>0 & !missing(spm_eitc)
            egen tag= tag(spm_id)
            forval i=0/4{
                bys spm_numkids: egen avg_eitc_`i'= mean(cond(tag & spm_numkids==`i', spm_eitc, .))
            }
            sort spm_numkids spm_id
            l spm_id spm_numkids spm_eitc tag avg_eitc*, sepby(spm_numkids)
            Same procedure for the federal tax liability. For households with 3 kids, the average for the 2 households is obtained as (5299+1824)/2 = 3561.5.

            Res.:

            Code:
            . l spm_id spm_numkids spm_eitc tag avg_eitc*, sepby(spm_numkids)
            
                 +--------------------------------------------------------------------------------------------+
                 |  spm_id   spm_n~ds   spm_eitc   tag   avg_ei~0   avg_ei~1   avg_ei~2   avg_ei~3   avg_ei~4 |
                 |--------------------------------------------------------------------------------------------|
              1. | 2270001          1        632     0          .        632          .          .          . |
              2. | 2270001          1        632     1          .        632          .          .          . |
                 |--------------------------------------------------------------------------------------------|
              3. | 1746001          2        990     1          .          .        990          .          . |
              4. | 1746001          2        990     0          .          .        990          .          . |
              5. | 1746001          2        990     0          .          .        990          .          . |
                 |--------------------------------------------------------------------------------------------|
              6. |  937001          3       1824     0          .          .          .     3561.5          . |
              7. |  937001          3       1824     0          .          .          .     3561.5          . |
              8. |  937001          3       1824     1          .          .          .     3561.5          . |
              9. |  937001          3       1824     0          .          .          .     3561.5          . |
             10. | 3075001          3       5299     0          .          .          .     3561.5          . |
             11. | 3075001          3       5299     1          .          .          .     3561.5          . |
             12. | 3075001          3       5299     0          .          .          .     3561.5          . |
             13. | 3075001          3       5299     0          .          .          .     3561.5          . |
                 +--------------------------------------------------------------------------------------------+
            Better and more directly:

            Code:
            bys spm_numkids: egen avg_eitc= mean(cond(tag, spm_eitc, .))
            tabstat avg_eitc, by(spm_numkid)
            Res.:

            Code:
            . tabstat avg_eitc, by(spm_numkid)
            
            Summary for variables: avg_eitc
                 by categories of: spm_numkids 
            
            spm_numkids |      mean
            ------------+----------
                      1 |       632
                      2 |       990
                      3 |    3561.5
            ------------+----------
                  Total |  2517.385
            -----------------------
            Last edited by Andrew Musau; 04 Apr 2022, 12:33.

            Comment


            • #7
              Hi Andrew,
              I apologize for the delayed response and I appreciate all of your help! I put the command you provided into stata but the results for average eitc were significantly lower then expected and the results for federal tax were significantly higher than expected. There might just be something that I am missing in the data set though, so I can just go over the dataset again.

              Best,
              Tiana Stussie

              Comment


              • #8
                The mean is sensitive to the number of observations in a group. If you have few observations in some groups, you may get unexpected results. Here is a way to count them:

                Code:
                * Example generated by -dataex-. For more info, type help dataex
                clear
                input double(spm_id spm_famtype spm_numadults spm_hage spm_eitc spm_numkids)
                 937001 4 1 35 1824 3
                 937001 4 1 35 1824 3
                 937001 4 1 35 1824 3
                 937001 4 1 35 1824 3
                1746001 4 1 44  990 2
                1746001 4 1 44  990 2
                1746001 4 1 44  990 2
                2270001 4 1 48  632 1
                2270001 4 1 48  632 1
                3075001 4 1 38 5299 3
                3075001 4 1 38 5299 3
                3075001 4 1 38 5299 3
                3075001 4 1 38 5299 3
                end
                
                gen sample = spm_famtype==4 & spm_numadults==1 & spm_hage>18 & !missing(spm_hage) & spm_eitc>0 & !missing(spm_eitc)
                egen tag= tag(spm_id)
                bys spm_numkids: egen avg_eitc= mean(cond(tag, spm_eitc, .))
                bys spm_numkids: egen count= total(tag)
                tabstat avg_eitc count, by(spm_numkid)
                Res.:

                Code:
                . tabstat avg_eitc count, by(spm_numkid)
                
                Summary statistics: mean
                  by categories of: spm_numkids
                
                spm_numkids |  avg_eitc     count
                ------------+--------------------
                          1 |       632         1
                          2 |       990         1
                          3 |    3561.5         2
                ------------+--------------------
                      Total |  2517.385  1.615385
                ---------------------------------

                I do not think that the issue is with the calculation of the mean, but do post a data sample if you find this to be the case. If there are too few observations in the tails, you can combine categories, e.g., "greater than or equal to 4 kids" instead of "4 kids, 5 kids, ...".

                Comment

                Working...
                X