Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Which type of analysis fits better with this data structure?

    Hi all,

    I am trying to figure out how to better analyze this specific data structure I will show below. Basically, I would like to analyze if a worker taking a specific knowledge (treatment) is in a better position in the labor market. As you can see in the data example, the treatment takes place in a given year (different for different people). Assuming that I have good control (those that do not take the course) and treatment groups, there is several observations per person but not equally spaced (contrary to panel data where all measurements are taken in each year). Also, there are no end time for some observations (see missing values).
    I know this could be a very open question but, which type of analysis should be used here?

    1) Can I do xtset id_person occasion and then apply FE estimation even though the data is not equally spaced?
    xtreg Y treatment time treatment*time, fe

    2) Or should I treat the data as in a longitudinal/multilevel framework:
    mixed Y treatment time treatment*time ||id_person

    Bear in mind that the variable "time" is not in the data example, I just put here what I have read from other post to do when panel data structure and DID is what you want to be estimated.

    3) For accessing the causal impact of the treatment variable: a DID might be advisable, basically, I think that the empirical analysis should be as in the two previous equations examples. But I have read that data need to be equally spaced (as in panel data). So, maybe the solution to this could be to compile the observations per years (build a yearly panel data set) and then apply the DID?

    4) Because some people could argue about the control group not being a good counterfactual (because innate ability for instance), I would like to ask your opinion about instead of comparing control/treatment groups; to compare the treatment group before/after the treatment. In this way I think that the counterfactual would be better since it is the same subject, and thus, controlling by innate ability and other time-invariant characteristics. What do you think? And if possible, is there any example to follow that you advice me to check how to do this in Stata?


    Thanks a lot for all your help and suggestions.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte(id_person firm_id start_day start_month) int start_year byte(end_day end_month) int end_year byte(x1 x3 x4 x5 Treat) float occasion
    1  1 18 11 2005 18 11 2005 0 0 40 32 0  1
    1  1 16 12 2005 16 12 2005 0 0 20 32 0  2
    1  1 12  1 2006 12  1 2006 0 0 40 32 0  3
    1  1  1  2 2006  1  2 2006 0 0 35 32 0  4
    1  1 21  3 2006 21  3 2006 0 0 40 32 0  5
    1  1 17  1 2007 17  1 2007 0 0 40 32 0  6
    1  1  8  2 2007  8  2 2007 0 0 16 32 0  7
    1  1 14  2 2007 15  2 2007 0 0 40 32 0  8
    1  2 25  6 2008 24  9 2008 0 0 25 54 0  9
    1  3  2  7 2009  1 10 2009 0 1 16 33 1 10
    1  4 30  7 2009  .  .    . 0 0 40 55 1 11
    1  5 15  3 2010  .  .    . 0 0 40 32 1 12
    1  6 11  5 2010 31  8 2010 0 0 30 33 1 13
    1  7  2 11 2010 24 12 2010 0 0 40 23 1 14
    1  7 21  1 2011 23  4 2011 0 0 40 23 1 15
    1  7 26  4 2011  8  5 2011 0 0 35 33 1 16
    1  8  2  5 2011 30  9 2011 0 0 40 23 1 17
    2  9 31 12 2006  1  1 2007 0 1  5 11 0  1
    2 10 20  4 2007 22  4 2007 0 0 40 32 0  2
    2 10  5  5 2007  6  5 2007 0 0 40 80 0  3
    2 10 11  5 2007 27  5 2007 0 1 24 32 0  4
    2 11 30  4 2008 12 10 2008 0 0 40 33 0  5
    2 12 19 12 2008  .  .    . 0 1 20 32 0  6
    2 13  5  5 2009 13  9 2009 0 0 40 54 0  7
    2 14 10  5 2010 16  9 2010 0 0 20 54 0  8
    2 15  9  3 2011  8  9 2011 0 0 20 23 1  9
    3 16 28  7 2008  .  .    . 0 0 40 55 0  1
    3 17  8  3 2010  .  .    . 0 0 40 55 0  2
    3 17  1  4 2011  .  .    . 0 0 35 55 0  3
    3 18  1  1 2014  .  .    . 0 0 40 54 1  4
    3 19 14  1 2019 30  6 2019 0 1  4 55 1  5
    3 20  1  6 2019  .  .    . 1 0 10 54 1  6
    4 21  2  4 2000  5  5 2001 0 0 25 55 0  1
    4 21  4 10 2001  .  .    . 0 0 15 55 0  2
    4 21 20 11 2001  .  .    . 0 0 40 55 0  3
    4 22 12 12 2001 13 12 2001 0 1 30 33 0  4
    4 23 31  5 2002  .  .    . 1 0 40 48 0  5
    end

  • #2
    You didn't get a quick answer. You'll increase your chances of a useful answer by following the FAQ on asking questions - provide Stata code in code delimiters, readable Stata output, and sample data using dataex. You don't even tell us precisely what you ran.

    1) Can I do xtset id_person occasion and then apply FE estimation even though the data is not equally spaced?
    xtreg Y treatment time treatment*time, fe
    * You certainly can do this, although it's not nearly as neat as we normally have with panel data. A difference in different is basically comparing the before to after of the treated group to the before to after of the untreated. This doesn't really require that before and after are fixed in an explicit hour or year time. It's often done with experimental data where before-and-after simply before and after the treatment.

    2) Or should I treat the data as in a longitudinal/multilevel framework:
    mixed Y treatment time treatment*time ||id_person
    * I think you will find that the multilevel you are proposing is precisely the same as xtreg with random effects. So, there is no advantage either way and I would do what is normally done in your literature.


    3) For accessing the causal impact of the treatment variable: a DID might be advisable, basically, I think that the empirical analysis should be as in the two previous equations examples. But I have read that data need to be equally spaced (as in panel data). So, maybe the solution to this could be to compile the observations per years (build a yearly panel data set) and then apply the DID?

    *If you really can build a complete panel data set, and that is probably better than the incomplete data set you are analyzing.

    4) Because some people could argue about the control group not being a good counterfactual (because innate ability for instance), I would like to ask your opinion about instead of comparing control/treatment groups; to compare the treatment group before/after the treatment. In this way I think that the counterfactual would be better since it is the same subject, and thus, controlling by innate ability and other time-invariant characteristics. What do you think? And if possible, is there any example to follow that you advice me to check how to do this in Stata?
    *Another way to attack this is as a treatment model which provides a more sophisticated analysis of potential differences between control and treatment groups, often through propensity score matching or something similar. Such treatment models can be estimated using the treatment procedures in Stata, but can also be estimated with some of the extended regression procedures.

    Comment


    • #3
      Dear Prof. Phil Bromiley,

      My apologies for the delay in answering, thanks for all of your comments. Let me take advantage of this and ask a couple more of questions with regards to your comments:

      You certainly can do this, although it's not nearly as neat as we normally have with panel data
      What do you mean by this? Is it that the FE won't control by anything that is a person time-invariant characteristic as in a panel data? Any document/paper that you suggest to have a look for differences between FE in longitudinal vs panel data structure?

      With regard to the construction of a panel data departing from the longitudinal data set I show in #1: Remember that my interest in doing this is to apply a DID estimation and also FE (since I read that DID cannot be done if the time is not equally spaced -> as it is happening in the dataset in #1).
      I have compiled the observations by year (could be done by month also) as I show you below. I am not so sure if this is a good way to go, I mean, I might end with a huge unbalance panel (see at the bottom -> xtdescribe) and also lose information. Can you suggest me any document to start looking at for doing this?

      *Another way to attack this is as a treatment model which provides a more sophisticated analysis of potential differences between control and treatment groups, often through propensity score matching or something similar. Such treatment models can be estimated using the treatment procedures in Stata, but can also be estimated with some of the extended regression procedures.
      I have being looking at how to do a propensity score matching for a panel but did not find anything; this is why I thought about the DID. Can you give any hint?

      Thanks again for your help.

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input byte id_person float occasion byte(firm_id start_day start_month) int start_year byte(end_day end_month) int end_year float(newx1 newx3 newx4 newtreat)
      1  2  1 16 12 2005 16 12 2005 0 0        30 0
      1  4  1  1  2 2006  1  2 2006 0 0  38.33333 0
      1  7  1  8  2 2007  8  2 2007 0 0        32 0
      1  9  2 25  6 2008 24  9 2008 0 0        25 0
      1 10  3  2  7 2009  1 10 2009 0 1        28 1
      1 13  6 11  5 2010 31  8 2010 0 0 36.666668 1
      1 16  7 26  4 2011  8  5 2011 0 0  38.33333 1
      2  1  9 31 12 2006  1  1 2007 0 1         5 0
      2  4 10 11  5 2007 27  5 2007 0 1 34.666668 0
      2  6 12 19 12 2008  .  .    . 0 1        30 0
      2  7 13  5  5 2009 13  9 2009 0 0        40 0
      2  8 14 10  5 2010 16  9 2010 0 0        20 0
      2  9 15  9  3 2011  8  9 2011 0 0        20 1
      3  1 16 28  7 2008  .  .    . 0 0        40 0
      3  2 17  8  3 2010  .  .    . 0 0        40 0
      3  3 17  1  4 2011  .  .    . 0 0        35 0
      3  4 18  1  1 2014  .  .    . 0 0        40 1
      3  5 19 14  1 2019 30  6 2019 1 1         7 1
      4  1 21  2  4 2000  5  5 2001 0 0        25 0
      4  2 21  4 10 2001  .  .    . 0 1 28.333334 0
      4  5 23 31  5 2002  .  .    . 1 0        40 0
      end
      Code:
      . xtset start_day start_year
             panel variable:  start_day (unbalanced)
              time variable:  start_year, 2000 to 2019, but with gaps
                      delta:  1 unit
      
      . xtdescribe
      
      start_day:  1, 2, ..., 31                                    n =         15
      start_year:  2000, 2001, ..., 2019                           T =         12
                 Delta(start_year) = 1 unit
                 Span(start_year)  = 20 periods
                 (start_day*start_year uniquely identifies each observation)
      
      Distribution of T_i:   min      5%     25%       50%       75%     95%     max
                               1       1       1         1         2       3       3
      
           Freq.  Percent    Cum. |  Pattern
       ---------------------------+----------------------
              3     20.00   20.00 |  ........1...........
              2     13.33   33.33 |  ...........1........
              2     13.33   46.67 |  .......1..1.........
              1      6.67   53.33 |  ...................1
              1      6.67   60.00 |  ..........1.........
              1      6.67   66.67 |  .........1..........
              1      6.67   73.33 |  ......1....1..1.....
              1      6.67   80.00 |  .....1..............
              1      6.67   86.67 |  ..1...1.............
              2     13.33  100.00 | (other patterns)
       ---------------------------+----------------------
             15    100.00         |  XXX..XXXXXXX..X....X

      Comment


      • #4
        I just realized that I built the panel data in a wrong way (since the panel variable was start_day -> when it should be id_person), sorry. Here is the good one:

        Code:
        . xtset id_person start_year
               panel variable:  id_person (unbalanced)
                time variable:  start_year, 2000 to 2019, but with gaps
                        delta:  1 unit
        
        . xtdescribe
        
        id_person:  1, 2, ..., 4                                     n =          4
        start_year:  2000, 2001, ..., 2019                           T =         12
                   Delta(start_year) = 1 unit
                   Span(start_year)  = 20 periods
                   (id_person*start_year uniquely identifies each observation)
        
        Distribution of T_i:   min      5%     25%       50%       75%     95%     max
                                 3       3       4         6         7       7       7
        
             Freq.  Percent    Cum. |  Pattern
         ---------------------------+----------------------
                1     25.00   25.00 |  ........1.11..1....1
                1     25.00   50.00 |  ......111111........
                1     25.00   75.00 |  .....1111111........
                1     25.00  100.00 |  111.................
         ---------------------------+----------------------
                4    100.00         |  XXX..XXXXXXX..X....X

        Comment

        Working...
        X