Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extrapolating between observations in time

    Hello all,

    I'm working on a dataset for which I'm interested in the changes over time of its percentiles (winners and losers). However, my observations are not continuous, since they are based on surveys, so I only have data for certain years.
    Now, I have worked on the data, and as in the example bellow, I have created 3 vars, one denoting the percentile in time t, another denoting that time t, and the value of the percentile itself.
    What I would like to do now is a "simple" interpolation/extrapolation, creating the years for which I do not have datapoints, while also creating its values in pct_100_ based on a simple liner interpolation between the closest years.

    I have been having trouble finding a way that can create specific values, while interpolating between certain periods. As in the case bellow, the years I have are: 2003; 2008; 2013 and 2018. What would be the best way to go about this?

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float position_pct int year float pct_100_
     1 2003  7.14
     1 2008   8.2
     1 2013 12.21
     1 2018 15.33
     2 2003 12.79
     2 2008 13.11
     2 2013 19.41
     2 2018 22.84
     3 2003 16.18
     3 2008 18.83
     3 2013 25.83
     3 2018 28.29
     4 2003 21.17
     4 2008 24.91
     4 2013 32.09
     4 2018 34.29
     5 2003  26.4
     5 2008 32.09
     5 2013 39.24
     5 2018 40.71
     6 2003 32.41
     6 2008 40.55
     6 2013 45.96
     6 2018 47.67
     7 2003  40.3
     7 2008 48.79
     7 2013 53.22
     7 2018 56.02
     8 2003 49.16
     8 2008 58.21
     8 2013 61.17
     8 2018 63.43
     9 2003 64.73
     9 2008  69.6
     9 2013 71.42
     9 2018 73.68
    10 2003   100
    10 2008   100
    10 2013   100
    10 2018   100
    end

    Thank you,

  • #2
    Hi Pedro,
    probably not the prettiest, not the simplest or most elegant piece of code, but try this:
    Code:
    bysort position_pct (year): gen expobs = year[_n+1] - year         
    expand expobs
    
    sort position_pct year
    bysort position_pct year : gen year_upd = year +_n -1    // update year
    
    clonevar pct_100_intp = pct_100 if year==year_upd        // add interpolated pct_100 var
    gen mispct = missing(pct_100_intp)                        // dummy for counting missings
    bysort position_pct year: gen nintp = _N                 // number of interpolation steps needed
    bysort position_pct year: egen minpct = min(pct_100_)    // min of pct_100 per group
    gen pctnext = pct_100_[_n+1] if year_ != year[_n+1] & year<year[_n+1]    // pct_100 of next position_pct group
    bysort position_pct year: egen maxpct = max(pctnext)    // max of pct_100 per group
    
    bysort position_pct year : replace pct_100_intp = minpct+((maxpct-minpct)/nintp) *(_n-1) if missing(pct_100_intp)    // update interpolated var
    replace year = year_upd
    drop ex* year_upd mispct nintp minpct pctnext maxpct
    I made the assumptions that you want a linear interpolation between consecutive observations. If you have some missings in pct_100 I wrote the code so that it should also work properly if you would drop lines with missing values in pct_100 as the number of interpolation points needed is automatically determined. Also if you add obervations from other years it should work.

    Best of luck,
    Benno

    Comment


    • #3
      I usedsepscatter from SSC to get a quick view of your data. (Thanks for the example.)

      Code:
      sepscatter pct year, separate(position_pct) recast(connected) xla(2003(5)2018) legend(order(10 9 8 7 6 5 4 3 2 1))

      The data don't look like percentiles to me, but perhaps some summaries for decile bins scaled so that the highest is always 100.

      Be that as it may, you could apply ipolate separately to each group for linear interpolation or extrapolation. That won't guarantee that the correct order will be preserved, as interpolation would be, as said, separate.
      Click image for larger version

Name:	percentile.png
Views:	1
Size:	72.2 KB
ID:	1750783

      Last edited by Nick Cox; 22 Apr 2024, 09:43.

      Comment


      • #4
        Hello Benno Schoenberger,

        Your code worked perfectly, and to be honest my attempts were not much cleaner or prettier. Thank you very much for the help! I'll try to adapt it for different size datasets and different number of years, but I believe it should work perfectly.
        Again, thank you.

        Comment


        • #5
          Hello Nick Cox ,

          This values are already just the percentiles, not the raw data. Since this is from a sort of "index" which is scaled to 100, thus the value for the 10th (not sure statistically here I make sense, but I wanted the "max" value it takes as well).
          Thank you for the suggestions.

          Comment

          Working...
          X