Calculating attrition rate

Rose Simmons

Join Date: Feb 2017
Posts: 114

Calculating attrition rate

06 Apr 2017, 14:48

Hi,

I have an unbalanced panel dataset (N=2976, T=13), using survey responses.
My dependent variable is the household's ability to save (saving=1 if able to save, 0 otherwise).
hhid is the Household's unique identifier, and the data is yearly.

Code:

. xtset hhid year
       panel variable:  hhid (unbalanced)
        time variable:  year, 2004 to 2016, but with gaps
                delta:  1 unit

.
. xtdes

    hhid:  6, 21, ..., 89972                                 n =       2976
    year:  2004, 2005, ..., 2016                             T =         13
           Delta(year) = 1 unit
           Span(year)  = 13 periods
           (hhid*year uniquely identifies each observation)

Distribution of T_i:   min      5%     25%       50%       75%     95%     max
                         1       1       1         3         7      13      13

     Freq.  Percent    Cum. |  Pattern
 ---------------------------+---------------
      224      7.53    7.53 |  ...........11
      206      6.92   14.45 |  1111111111111
      166      5.58   20.03 |  ............1
      145      4.87   24.90 |  1............
       94      3.16   28.06 |  ..........111
       81      2.72   30.78 |  ..........1..
       77      2.59   33.37 |  .1...........
       72      2.42   35.79 |  11...........
       62      2.08   37.87 |  ......1......
     1849     62.13  100.00 | (other patterns)
 ---------------------------+---------------
     2976    100.00         |  XXXXXXXXXXXXX

I would like to calculate the attrition rate.
Please could you help me to calculate the Frequencies in the table below?
With Frequency, I wanted to show how many of the respondents who were first interviewed in 2004 were then interviewed in 2016.
The same for 2005, 2006,...,2015.

I wanted to create a table as such:

Code:

Year of 1st Interview | Frequency | Percent | Cumulative Percent
            2004      | ---       | ---     | ---  
            2005      | ---       | ---     | ---  
            2006      | ---       | ---     | ---  
            .......   | ---       | ---     | ---  
            2015      | ---       | ---     | 100.0
          Total         ---                   100.0

Please let me know if further clarification is required.

Thanks in advance

Last edited by Rose Simmons; 06 Apr 2017, 14:53.

Tags: None

Rose Simmons

Join Date: Feb 2017

Posts: 114
#2

10 Apr 2017, 03:53

Hi,

I tried:

Code:

. gen yr2004=0 . replace yr2004=1 if year==2004 (952 real changes made) . gen yr2016=0 . . replace yr2016=1 if year==2016 (1,245 real changes made) . tab hhid if yr2004==1 & yr2016==1 no observations

I was hoping to use this to see which individuals that were surveyed in 2004 (i.e. yr2004=1) were also surveyed in 2016.

Looking at a part of the dataset, I see why the above didn't return what I had hoped:

Code:

hhid year yr2004 yr2016 6 2004 1 0 6 2005 0 0 6 2006 0 0 6 2007 0 0 6 2008 0 0 6 2009 0 0 6 2010 0 0 6 2011 0 0 6 2012 0 0 6 2013 0 0 6 2014 0 0 6 2015 0 0 6 2016 0 1

Could someone please help me to calculate the frequency?
I would like to create the frequency table in #1 first to explain my data, and then I will calculate the attrition rate afterwards - apologies if this was unclear

Last edited by Rose Simmons; 10 Apr 2017, 04:35.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#3

10 Apr 2017, 08:28

I gather the goal is to calcualte the proportion of hhid's surveyed in 2004 that were also surveyed in 2016.

Code:

by hhid, sort: egen in_2004 = max(year == 2004) by hhid: egen in_2016 = max(year == 2016) egen flag = tag(hhid) // SELECT ONE OBSERVATION PER hhid tab in_2016 if in_2004 & flag

In general, when working with panel data, when you need to create a variable that represents whether a panel unit has any observation that meets a certain condition, the code is -by panel, sort: egen ever_meets_condition = max(condition)-.

The additional complication here is that every observation of in_2004 and in_2016 will be set to the same value. When you ultimately want to tabulate, you do not want to count each hhid in proportion to the number of times they participated in the survey--you want to count each only once. That-s what the -egen, tag()- command is for: it identifies a single observation from each hhid. The -tab- command then counts only the one identified observation in each hhid.

Last edited by Clyde Schechter; 10 Apr 2017, 08:31. Reason: Correct error in code.
Comment

Rose Simmons

Join Date: Feb 2017
Posts: 114

10 Apr 2017, 09:57

Thank you Clyde Schechter for your help, particularly the -egen, tag()- command, as I was unsure of how to separate the observations by hhid.

In general, when working with panel data, when you need to create a variable that represents whether a panel unit has any observation that meets a certain condition, the code is -by panel, sort: egen ever_meets_condition = max(condition)-.

My understanding of the following code is that it creates a dummy variable in_2004 (which is 1 if there is an observation in 2004, and 0 otherwise).
Q1: What does the "max" do?

Code:

by hhid, sort: egen in_2004 = max(year == 2004)

Q2: Does the following suggest that, of the 1,245 hhid's surveyed in 2016, 977 were surveyed in 2006 (i.e. proportion 78.47%)?

Code:

 by hhid, sort: egen in_2004 = max(year == 2004)

. 
. by hhid, sort: egen in_2005 = max(year == 2005)

. 
. by hhid, sort: egen in_2006 = max(year == 2006)

. 
. by hhid: egen in_2016 = max(year == 2016)

. 
. egen flag = tag(hhid) 

. 
. tab in_2016 if in_2004 & flag

    in_2016 |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        622       65.34       65.34
          1 |        330       34.66      100.00
------------+-----------------------------------
      Total |        952      100.00

. 
. tab in_2016 if in_2005 & flag

    in_2016 |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        669       64.39       64.39
          1 |        370       35.61      100.00
------------+-----------------------------------
      Total |      1,039      100.00

. 
. tab in_2016 if in_2006 & flag

    in_2016 |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        589       60.29       60.29
          1 |        388       39.71      100.00
------------+-----------------------------------
      Total |        977      100.00

. tab year if year==2016

    year of |
     survey |      Freq.     Percent        Cum.
------------+-----------------------------------
       2016 |      1,245      100.00      100.00
------------+-----------------------------------
      Total |      1,245      100.00

. di 977/1245
.78473896

Many thanks

Comment

Rose Simmons

Join Date: Feb 2017
Posts: 114

10 Apr 2017, 10:02

To confirm, yes, the goal is to calculate the proportion of hhid's surveyed in 2004, 2005, 2006 etc that were also surveyed in 2016.

To follow up on Q2, the following suggests there were 977 hhid's surveyed in 2006:

Code:

. tab year if year==2006

    year of |
     survey |      Freq.     Percent        Cum.
------------+-----------------------------------
       2006 |        977      100.00      100.00
------------+-----------------------------------
      Total |        977      100.00

This 977 figure is that same as:

Code:

. tab in_2016 if in_2006 & flag

    in_2016 |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        589       60.29       60.29
          1 |        388       39.71      100.00
------------+-----------------------------------
      Total |        977      100.00

This suggests that all those who were surveyed in 2006 were still surveyed in 2016 - it is surprising that none seem to have dropped out

Thanks

Last edited by Rose Simmons; 10 Apr 2017, 10:05.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#6

10 Apr 2017, 10:11

Q1: What does the "max" do?

Boolean expressions in Stata are evaluated as 0 if false and 1 if true. So for each observation, the expression condition will be variously zero or one accordingly. If any of the observations satisfies the condition, there will be a 1 in that observation. Any observations for which it is false will have zero. The maximum value of the expression will therefore be 1. On the other hand, if the condition is never satisfied, the expression will always evaluate to zero, and the maximum value is therefore also 0.

Q2: Does the following suggest that, of the 1,245 hhid's surveyed in 2016, 977 were surveyed in 2006 (i.e. proportion 78.47%)?

No. If you want the number of hhid's surveyed in 2016 that were also surveyed in 2006 the code would be

Code:

tab in_2006 if in_2016 & flag

and examine the row of the output table where in_2006 is 1. There you will find the N and percent.
1 like
Comment
Rose Simmons

Join Date: Feb 2017

Posts: 114
#7

10 Apr 2017, 10:15

Q1: I understand the use of -max- now
Q2: I have corrected my mistake now

Thank you very much
Comment

Announcement

Calculating attrition rate

Comment

Comment

Comment

Comment

Comment

Comment