Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dealing with missing values in unbalanced panel dataset

    Dear members of this community,
    I have an unbalanced panel dataset that records information on revenues of firms for 7 years. Below there is a data example.
    It is fine to have an unbalanced dataset, but I would like to deal with / analyze missing values inside the period in which the firm is active. For example, firm with id=1 is fine, while firm with id=2 is observed until time 4 and then in time 7.
    My question is: how can I generate a dummy variable that identifies those firms (like firm 2)?
    Additional questions:
    1. Do you have any advice on how to treat such firms? My worry is that these firm are likely to misreport their revenues.
    2. Do you have any advice on how to impute these missing values? Or do you think it is preferrable to drop all the firms like firm 2
    Thanks a lot for your help!

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long id float time long revenues
     1 0        .
     1 1        .
     1 2        .
     1 3        .
     1 4        .
     1 5   371900
     1 6   327261
     1 7   294181
     2 0  1658122
     2 1  1790739
     2 2  2053846
     2 3  2301695
     2 4  2487270
     2 5        .
     2 6        .
     2 7  2239711
     3 0    78000
     3 1    78000
     3 2    84000
     3 3    84000
     3 4    84000
     3 5    84567
     3 6    85201
     3 7    77338
     4 0  1102690
     4 1  1100866
     4 2   888238
     4 3  1223983
     4 4    14727
     4 5        .
     4 6        .
     4 7        .
     5 0    19111
     5 1    14325
     5 2    10175
     5 3    23834
     5 4    29081
     5 5    30718
     5 6    33395
     5 7    28609
     6 0   309296
     6 1   270199
     6 2   322852
     6 3   330863
     6 4   330964
     6 5   420177
     6 6   439237
     6 7   172876
     7 0 12689348
     7 1        .
     7 2        .
     7 3        .
     7 4        .
     7 5        .
     7 6        .
     7 7        .
     8 0        .
     8 1        .
     8 2        .
     8 3   975812
     8 4   881912
     8 5   728987
     8 6   699269
     8 7    25019
     9 0        0
     9 1        0
     9 2        0
     9 3     3600
     9 4     7200
     9 5     7200
     9 6     9600
     9 7     9600
    10 0        .
    10 1        .
    10 2        .
    10 3        .
    10 4        .
    10 5        .
    10 6        .
    10 7        .
    11 0  1449842
    11 1  1335453
    11 2  1463161
    11 3  1549618
    11 4  1636581
    11 5  1629713
    11 6  1641254
    11 7   856642
    12 0        0
    12 1        0
    12 2        0
    12 3        0
    12 4        0
    12 5        0
    12 6        0
    12 7        0
    13 0  8170000
    13 1 11687000
    13 2  7607000
    13 3        .
    end




  • #2
    Code:
    isid id time, sort
    by id (time): gen run_num = sum(missing(revenues) != missing(revenues[_n-1]))
    by id (time): egen byte problem = ///
        max(missing(revenues) & !inlist(run_num, run_num[1], run_num[_N]))
    Added: I did not respond to your additional questions because good advice for those will depend on substantive knowledge of finance/economics, which is not in my wheelhouse. Hopefully, somebody else who is knowledgeable in those areas will.
    Last edited by Clyde Schechter; 31 Mar 2023, 12:19.

    Comment


    • #3
      May I ask an additional question on the topic? I am trying to impute the missing values "inside" the panel (see discussion above).
      I would like to impute only missing values with the mean between the two last available values in the series.
      To explain it better, consider firm 2. I would like to impute years 5 and 6 with the value 2363490.5 = (2487270+2239711)/2 which is the mean of years 4 and 7.
      Again, I would accept "better advices" to deal with the issue.
      Kind regards

      Comment


      • #4
        To do the averaging you propose, after using the code in #2, run this:
        Code:
        by id (run time), sort: gen prior = revenues[_n-1] if run != run[_n-1]
        by id (run time): gen subsequent = revenues[_n+1] if run != run[_n+1]
        by id run (time): gen imputed_value = 0.5*(prior[1] + subsequent[_N]) ///
            if missing(revenues)
        replace imputed_value = revenues if !missing(revenues)

        Comment

        Working...
        X