Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Unbalanced Panel - Selection Bias due to unequal time periods only

    Hi everyone,

    I have an unbalanced panel data per Stata as seen here:

    Code:
    . //Setting panel variables
    . xtset household_key period
           panel variable:  household_key (unbalanced)
            time variable:  period, 1 to 34
                    delta:  1 unit
    
    .
    . xtdescribe
    
    household_key:  1, 2, ..., 2500                              n =       2500
      period:  1, 2, ..., 34                                     T =         34
               Delta(period) = 1 unit
               Span(period)  = 34 periods
               (household_key*period uniquely identifies each observation)
    
    Distribution of T_i:   min      5%     25%       50%       75%     95%     max
                             1       1       1         5        11      19      34
    
         Freq.  Percent    Cum. |  Pattern
     ---------------------------+------------------------------------
          916     36.64   36.64 |  1.................................
          268     10.72   47.36 |  111...............................
          223      8.92   56.28 |  11111.............................
          211      8.44   64.72 |  1111111...........................
          163      6.52   71.24 |  111111111.........................
          157      6.28   77.52 |  11111111111.......................
          123      4.92   82.44 |  1111111111111.....................
           94      3.76   86.20 |  111111111111111...................
           70      2.80   89.00 |  11111111111111111.................
          275     11.00  100.00 | (other patterns)
     ---------------------------+------------------------------------
         2500    100.00         |  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    However, this "unbalancedness" only results from unequal periods T attributed to households i. And the only reason this occurs is due to the lengths of various treatments assigned and its corresponding durations and how I aggregated this data based on said durations. Is this still prone to any type of "selection bias"? Is there anyway I can explicitly test for this?

    Just to be clear, the entire row vector is observed fully for each cross section over period T and contains NO missing values, but the "unbalancedness" again comes from the fact that treatment durations are wildly different. So each household/cross-section is observed for a total of 720 days, but some have 3 periods due to receiving 2 treatments, Some have 4 periods due to receiving 3 treatments etc. I'd really appreciate any guidance.
    Last edited by AJ Williamson; 22 Apr 2019, 10:14.

  • #2
    Unbalancedness per se is not an issue. Most of Stata's panel data estimators work perfectly well with unbalanced data. Those that don't will refuse to run with unbalanced data anyway. So if you can run your analysis at all, you don't have to worry about the fact that the panel is unbalanced.

    But what you may have to worry about is whether the differences in frequency of observation, which is due to differences in numbers of treatments administered, is itself a source of bias in the data. Why were different units given different numbers of treatments? For example, were more treatments given to those who exhibited less response to try to "help them catch up?" Or perhaps the other way around: were more treatments given to those who exhibited the greatest response, to try to "get the most bang for the buck." So these are the kind of things you have to think about. You don't say anything about what kind of treatments are involved, nor what they are supposed to affect, nor what outcomes you are looking at. So nothing more specific can be said.

    Comment


    • #3
      Originally posted by Clyde Schechter View Post
      Unbalancedness per se is not an issue. Most of Stata's panel data estimators work perfectly well with unbalanced data. Those that don't will refuse to run with unbalanced data anyway. So if you can run your analysis at all, you don't have to worry about the fact that the panel is unbalanced.

      But what you may have to worry about is whether the differences in frequency of observation, which is due to differences in numbers of treatments administered, is itself a source of bias in the data. Why were different units given different numbers of treatments? For example, were more treatments given to those who exhibited less response to try to "help them catch up?" Or perhaps the other way around: were more treatments given to those who exhibited the greatest response, to try to "get the most bang for the buck." So these are the kind of things you have to think about. You don't say anything about what kind of treatments are involved, nor what they are supposed to affect, nor what outcomes you are looking at. So nothing more specific can be said.
      Mr./Dr. Clyde Schechter,

      Thank you so much for your response. I didn't want to be so specific as I'm relatively new from posting on here.

      Basically, I have 7 treatments (Type A, Type B, Type AB, Type AC, Type ABC, Type C, Type BC) and one "control" (Type D). The treatment distribution itself are exposure effects given to consumers who are in a loyalty program which serves the purpose of trying to induce in-store trip frequency, and within this promotional cycle, treatments "expire" and households can revert back to periods of non-exposure (control = Type D). However, based on the data, the entire promotional loyalty program is 720 days, so I created a 720 day "timeline" for each household. In this quasi-experiment, however, we don't know much else regarding the distribution of the treatments other than they occur (and one is potentially endogenous, which is a separate issue that I'm comfortable mitigating) and how long they last, but it's easy to intuit that those who "experience more treatments" are the most active in the promotional periods relative to those who experience few. But due to this wild variation of treatment distribution and how long they last, this results in an "unbalanced" scenario where attrition occurs, but not seemingly in the traditional sense of "panel missingness", but in the sense of unequal time periods due to unequal treatment receipt.

      I've attached just a brief snapshot of what the dataframe looks like with two households (note, ignore the red "strings" for the campaigns/treatments...they've been converted to dummies for analysis).
      Min_day and max_day represent the duration of the treatments. But note the unbalancedness due to unequal periods: 17 periods for household 1 and 3 periods for household 2.


      Click image for larger version

Name:	Screen Shot 2019-04-22 at 9.20.02 PM.png
Views:	1
Size:	124.3 KB
ID:	1494547


      Does that make sense?

      Comment


      • #4
        Well, it really depends on the nature of the outcomes you are looking at and the types of analysis you plan to do. It is still not clear to me whether the oversampling of some and undersampling of others is actually associated with the treatments themselves, or with the outcomes. In either case, that's a problem. Depending on the analysis, you might be able to overcome it by weighting. Another approach that might work for some situations would be to reduce to one observation per household, where there is a variable for each of the 8 treatments (including control) that contains the total number of days of exposure to that treatment. (This is only workable if having 10 days of treatment A now and 5 days of treatment A later is equivalent, for your purposes, to having 15 days of treatment at yet any other time.) I'm afraid there is no generic answer to your question. It really has to be thought through in the context of specific variables and specific analyses, and I think that it is also not a purely statistical issue--content knowledge about these treatments and the outcome variables is also required.

        Comment

        Working...
        X