Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help with interpolation

    Hello Stata users,

    Im doing using panel data to test the relationship between deforestation (dependent variable) and certain drivers of deforestation. I have the data for 4 provinces of a given country, and i've got observations for 1984, 1987, 1990, 1991, 1995 and 1999. I've also got all the indpendent variable for 1994 and 1996, so i was trying to interpolate the missing vales.

    I used " ipolate mangrovecov year, gen(mangrovecov1)", but i get a new set of observations for the the years, when im only looking to get the interpolations for 1994/1996 for each of the 4 provinces.

    Help much aprecciated.

    Best Regards,

    Mike

  • #2
    I think you should interpolate your data within the 4 provinces if you want to keep your interpolation data equal to the observed data. Then you just need to assert that the non missing values are the same in both variables, if assertion is false then something went wrong but without more details (please read the FAQ) I can't say more.
    Code:
    bys province_code: ipolate mangrovecov year, g(mangrovecov1) 
    
    assert mangrovecov==mangrovecov1 if !mi(mangrovecov)

    Comment


    • #3
      Oded gives good advice. It's hard to see that pooling provinces is a good idea in interpolation, although a more complicated model might allow "borrowing strength" in Tukey's terms.

      There are several possible methods other than linear interpolation, and in any case there is always a question of what scale to work on. If mangrove cover is an absolute area, I would tend to consider interpolation on a logarithmic scale followed by extrapolation; if a percent or proportion, on a logit scale.

      See also http://www.statalist.org/forums/foru...-interpolation

      But be careful in using interpolated data in a model; you won't have as many degrees of freedom as the model output will show.

      From the sound of it the raw dependent variable data are small enough to be posted here, to allow substance to be added to speculation.

      Comment


      • #4
        If i understand correctly then i should try to interpolate the missing observations for each province using the data for that particular province? If so, id say that is what id like to do, since the deforestation process was different in each province.

        I attach the info on the provinces and deforestation observations. Now regarding degrees of freedom: Im interpolating so i can have a fuller dataset since N=24 is quite low. I was planning to run regressions with the inteporlated and non interpolated datasets just for robustness.

        I appreciate the help.
        Province year mangrovecov
        Guayas 1984 119526.2
        Guayas 1987 116065.9
        Guayas 1990 110395.5
        Guayas 1991 109927.62
        Guayas 1994
        Guayas 1995 102108.5
        Guayas 1996
        Guayas 1999 104586
        El Oro 1984 24455.8
        El Oro 1987 23402.7
        El Oro 1990 21317
        El Oro 1991 20918.09
        El Oro 1994
        El Oro 1995 17697.8
        El Oro 1996
        El Oro 1999 18911
        Esmeraldas 1984 30152.6
        Esmeraldas 1987 29257.4
        Esmeraldas 1990 27891
        Esmeraldas 1991 26662.68
        Esmeraldas 1994
        Esmeraldas 1995 22965.42
        Esmeraldas 1996
        Esmeraldas 1999 23189
        Manabi 1984 7973.4
        Manabi 1987 6400.7
        Manabi 1990 5830
        Manabi 1991 4457.22
        Manabi 1994
        Manabi 1995 4038.32
        Manabi 1996
        Manabi 1999 1797

        Comment


        • #5
          So here is the code for the ipolate.


          Code:
          clear
          input str10 province  year  mangrovecov
          "El Oro"     1991  20918.09
          "El Oro"     1984   24455.8
          "El Oro"     1990     21317
          "El Oro"     1987   23402.7
          "El Oro"     1996         .
          "El Oro"     1999     18911
          "El Oro"     1994         .
          "El Oro"     1995   17697.8
          "Esmeraldas" 1999     23189
          "Esmeraldas" 1990     27891
          "Esmeraldas" 1987   29257.4
          "Esmeraldas" 1991  26662.68
          "Esmeraldas" 1994         .
          "Esmeraldas" 1996         .
          "Esmeraldas" 1995  22965.42
          "Esmeraldas" 1984   30152.6
          "Guayas"     1999    104586
          "Guayas"     1991 109927.62
          "Guayas"     1990  110395.5
          "Guayas"     1987  116065.9
          "Guayas"     1994         .
          "Guayas"     1996         .
          "Guayas"     1995  102108.5
          "Guayas"     1984  119526.2
          "Manabi"     1990      5830
          "Manabi"     1991   4457.22
          "Manabi"     1994         .
          "Manabi"     1987    6400.7
          "Manabi"     1996         .
          "Manabi"     1984    7973.4
          "Manabi"     1995   4038.32
          "Manabi"     1999      1797
          end
          
          bys province: ipolate mangrovecov year, g(mangrovecov1) 
          
          assert mangrovecov==mangrovecov1 if !mi(mangrovecov)

          Comment


          • #6
            Thanks for the data. Ecuador!

            Here is a version using dataex to generate code for others (SSC; see FAQ Advice #12). I don't see enormous scope for varying the interpolation method but this is interpolation on logarithmic scale followed by back-transformation. The straight line segments on the graph are thus not merely cosmetic but correspond to how the data are being treated.

            I used mipolate (SSC) as mentioned in #3 even though ipolate would do the same job in this case.

            Code:
            set scheme s1color 
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input str10 province int year float mangrovecov
            "Guayas"     1984  119526.2
            "Guayas"     1987  116065.9
            "Guayas"     1990  110395.5
            "Guayas"     1991 109927.62
            "Guayas"     1994         .
            "Guayas"     1995  102108.5
            "Guayas"     1996         .
            "Guayas"     1999    104586
            "El Oro"     1984   24455.8
            "El Oro"     1987   23402.7
            "El Oro"     1990     21317
            "El Oro"     1991  20918.09
            "El Oro"     1994         .
            "El Oro"     1995   17697.8
            "El Oro"     1996         .
            "El Oro"     1999     18911
            "Esmeraldas" 1984   30152.6
            "Esmeraldas" 1987   29257.4
            "Esmeraldas" 1990     27891
            "Esmeraldas" 1991  26662.68
            "Esmeraldas" 1994         .
            "Esmeraldas" 1995  22965.42
            "Esmeraldas" 1996         .
            "Esmeraldas" 1999     23189
            "Manabi"     1984    7973.4
            "Manabi"     1987    6400.7
            "Manabi"     1990      5830
            "Manabi"     1991   4457.22
            "Manabi"     1994         .
            "Manabi"     1995   4038.32
            "Manabi"     1996         .
            "Manabi"     1999      1797
            end
            
            gen logm = log(mangrove)
            * to install mipolate: 
            * ssc inst mipolate 
            mipolate logm year, by(province) gen(loglinear)
            replace loglinear = exp(loglinear)
            twoway connected loglinear mangrove year, by(province, note("") yrescale) ///
            cmissing(n n) ysc(log) ms(+ O) msize(*1.2)  yla(, ang(h)) xtitle("")
            Click image for larger version

Name:	mangrove.png
Views:	1
Size:	37.9 KB
ID:	1351926

            Comment


            • #7
              thanks a lot, i've used it and it works, but by typing that i lose all my independent variables, is there a way of keeping the loaded dataset?

              Comment


              • #8
                Naturally in your case you should start with the entire dataset that you already have.

                The point about the input text is so that (a) I could play with your data to give an answer and (b) anyone can chip in to the discussion and/or adapt the solution for some similar problem without needing to ask you for the dataset. (Same applies, naturally, to Oded's answer.)

                Then assuming that you don't have variables logm or loglinear the point to start is the first generate statement. I'd regard exponential growth or decline as the natural first approximation here, rather than linear. But you can always compare Oded's results and mine.



                Comment


                • #9
                  Nick/Oded thanks a lot, I agree that the log of the dependant is the better aproximation, my next task after interpolating was to do some transformations, so the help has been extra helpful.

                  Comment


                  • #10
                    Sorry, after basically starting since the first generate statement, i get the error that cmissing command doesnt exist.

                    The code used (modified to the variable name sin the dataset) is :

                    gen logm = log(mangrovecov) mipolate logm year, by(Province) gen(loglinear) replace loglinear = exp(loglinear) twoway connected loglinear mangrovecov year, by(Province, note("") yrescale) /// cmissing(n n) ysc(log) ms(+ O) msize(*1.2) yla(, ang(h)) xtitle("")

                    Comment


                    • #11
                      Solved sorry!

                      Comment


                      • #12
                        The continuation comment

                        Code:
                        ///
                        must occur at the end of a command line as a signal that the next line is a continuation.

                        Please do read http://www.statalist.org/forums/help#stata to learn how to use CODE delimiters as Oded and I have done.

                        Without CODE delimiters it's utterly impossible to see where the ends of command lines occur in what you typed to Stata

                        Comment


                        • #13
                          I'll keep that in mind. Can i ask why the transformation of the variable to a loglinear, and then replace it with its exp(loglinear)? is this how you interpolate a variable that follows a exponential growth or decline rate?

                          Comment


                          • #14
                            It's how I do it and I'd say it was standard. The default of mipolate is identical to the default (and only) behaviour of ipolate -- to interpolate linearly. Neither supplies an option to interpolate on logarithmic scale; I can't speak for ipolate but otherwise I know it's because a command before and a command after make it easy.

                            Comment


                            • #15
                              Thanks for the clarification. Given the data and the expected relations between dependant and independnt variablesIi went with a Log-Log model re / fe model with clustered standard errors. Get the expected sign in both FE and RE, although with random effects i have more significance. When choosing between RE and FE, does my small N play a role when chosing one or the other? Ill probbalt run a Haussman test too.

                              Comment

                              Working...
                              X