Help with interpolation

Mike Tanner

Join Date: Aug 2016

Posts: 45
#1

Help with interpolation

03 Aug 2016, 17:22

Hello Stata users,

Im doing using panel data to test the relationship between deforestation (dependent variable) and certain drivers of deforestation. I have the data for 4 provinces of a given country, and i've got observations for 1984, 1987, 1990, 1991, 1995 and 1999. I've also got all the indpendent variable for 1994 and 1996, so i was trying to interpolate the missing vales.

I used " ipolate mangrovecov year, gen(mangrovecov1)", but i get a new set of observations for the the years, when im only looking to get the interpolations for 1994/1996 for each of the 4 provinces.

Help much aprecciated.

Best Regards,

Mike
Tags: None
Oded Mcdossi

Join Date: Jun 2014

Posts: 577
#2

04 Aug 2016, 01:10

I think you should interpolate your data within the 4 provinces if you want to keep your interpolation data equal to the observed data. Then you just need to assert that the non missing values are the same in both variables, if assertion is false then something went wrong but without more details (please read the FAQ) I can't say more.

Code:

bys province_code: ipolate mangrovecov year, g(mangrovecov1) assert mangrovecov==mangrovecov1 if !mi(mangrovecov)
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35431
#3

04 Aug 2016, 01:37

Oded gives good advice. It's hard to see that pooling provinces is a good idea in interpolation, although a more complicated model might allow "borrowing strength" in Tukey's terms.

There are several possible methods other than linear interpolation, and in any case there is always a question of what scale to work on. If mangrove cover is an absolute area, I would tend to consider interpolation on a logarithmic scale followed by extrapolation; if a percent or proportion, on a logit scale.

See also http://www.statalist.org/forums/foru...-interpolation

But be careful in using interpolated data in a model; you won't have as many degrees of freedom as the model output will show.

From the sound of it the raw dependent variable data are small enough to be posted here, to allow substance to be added to speculation.
Comment

Mike Tanner

Join Date: Aug 2016
Posts: 45

04 Aug 2016, 06:16

If i understand correctly then i should try to interpolate the missing observations for each province using the data for that particular province? If so, id say that is what id like to do, since the deforestation process was different in each province.

I attach the info on the provinces and deforestation observations. Now regarding degrees of freedom: Im interpolating so i can have a fuller dataset since N=24 is quite low. I was planning to run regressions with the inteporlated and non interpolated datasets just for robustness.

I appreciate the help.

Province	year	mangrovecov
Guayas	1984	119526.2
Guayas	1987	116065.9
Guayas	1990	110395.5
Guayas	1991	109927.62
Guayas	1994
Guayas	1995	102108.5
Guayas	1996
Guayas	1999	104586
El Oro	1984	24455.8
El Oro	1987	23402.7
El Oro	1990	21317
El Oro	1991	20918.09
El Oro	1994
El Oro	1995	17697.8
El Oro	1996
El Oro	1999	18911
Esmeraldas	1984	30152.6
Esmeraldas	1987	29257.4
Esmeraldas	1990	27891
Esmeraldas	1991	26662.68
Esmeraldas	1994
Esmeraldas	1995	22965.42
Esmeraldas	1996
Esmeraldas	1999	23189
Manabi	1984	7973.4
Manabi	1987	6400.7
Manabi	1990	5830
Manabi	1991	4457.22
Manabi	1994
Manabi	1995	4038.32
Manabi	1996
Manabi	1999	1797

Comment

Oded Mcdossi

Join Date: Jun 2014
Posts: 577

04 Aug 2016, 06:37

So here is the code for the ipolate.

Code:

clear
input str10 province  year  mangrovecov
"El Oro"     1991  20918.09
"El Oro"     1984   24455.8
"El Oro"     1990     21317
"El Oro"     1987   23402.7
"El Oro"     1996         .
"El Oro"     1999     18911
"El Oro"     1994         .
"El Oro"     1995   17697.8
"Esmeraldas" 1999     23189
"Esmeraldas" 1990     27891
"Esmeraldas" 1987   29257.4
"Esmeraldas" 1991  26662.68
"Esmeraldas" 1994         .
"Esmeraldas" 1996         .
"Esmeraldas" 1995  22965.42
"Esmeraldas" 1984   30152.6
"Guayas"     1999    104586
"Guayas"     1991 109927.62
"Guayas"     1990  110395.5
"Guayas"     1987  116065.9
"Guayas"     1994         .
"Guayas"     1996         .
"Guayas"     1995  102108.5
"Guayas"     1984  119526.2
"Manabi"     1990      5830
"Manabi"     1991   4457.22
"Manabi"     1994         .
"Manabi"     1987    6400.7
"Manabi"     1996         .
"Manabi"     1984    7973.4
"Manabi"     1995   4038.32
"Manabi"     1999      1797
end

bys province: ipolate mangrovecov year, g(mangrovecov1) 

assert mangrovecov==mangrovecov1 if !mi(mangrovecov)

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35431

04 Aug 2016, 07:23

Thanks for the data. Ecuador!

Here is a version using dataex to generate code for others (SSC; see FAQ Advice #12). I don't see enormous scope for varying the interpolation method but this is interpolation on logarithmic scale followed by back-transformation. The straight line segments on the graph are thus not merely cosmetic but correspond to how the data are being treated.

I used mipolate (SSC) as mentioned in #3 even though ipolate would do the same job in this case.

Code:

set scheme s1color 
* Example generated by -dataex-. To install: ssc install dataex
clear
input str10 province int year float mangrovecov
"Guayas"     1984  119526.2
"Guayas"     1987  116065.9
"Guayas"     1990  110395.5
"Guayas"     1991 109927.62
"Guayas"     1994         .
"Guayas"     1995  102108.5
"Guayas"     1996         .
"Guayas"     1999    104586
"El Oro"     1984   24455.8
"El Oro"     1987   23402.7
"El Oro"     1990     21317
"El Oro"     1991  20918.09
"El Oro"     1994         .
"El Oro"     1995   17697.8
"El Oro"     1996         .
"El Oro"     1999     18911
"Esmeraldas" 1984   30152.6
"Esmeraldas" 1987   29257.4
"Esmeraldas" 1990     27891
"Esmeraldas" 1991  26662.68
"Esmeraldas" 1994         .
"Esmeraldas" 1995  22965.42
"Esmeraldas" 1996         .
"Esmeraldas" 1999     23189
"Manabi"     1984    7973.4
"Manabi"     1987    6400.7
"Manabi"     1990      5830
"Manabi"     1991   4457.22
"Manabi"     1994         .
"Manabi"     1995   4038.32
"Manabi"     1996         .
"Manabi"     1999      1797
end

gen logm = log(mangrove)
* to install mipolate: 
* ssc inst mipolate 
mipolate logm year, by(province) gen(loglinear)
replace loglinear = exp(loglinear)
twoway connected loglinear mangrove year, by(province, note("") yrescale) ///
cmissing(n n) ysc(log) ms(+ O) msize(*1.2)  yla(, ang(h)) xtitle("")

Click image for larger version

Name: mangrove.png
Views: 1
Size: 37.9 KB
ID: 1351926

Comment

Mike Tanner

Join Date: Aug 2016

Posts: 45
#7

04 Aug 2016, 07:40

thanks a lot, i've used it and it works, but by typing that i lose all my independent variables, is there a way of keeping the loaded dataset?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35431
#8

04 Aug 2016, 08:02

Naturally in your case you should start with the entire dataset that you already have.

The point about the input text is so that (a) I could play with your data to give an answer and (b) anyone can chip in to the discussion and/or adapt the solution for some similar problem without needing to ask you for the dataset. (Same applies, naturally, to Oded's answer.)

Then assuming that you don't have variables logm or loglinear the point to start is the first generate statement. I'd regard exponential growth or decline as the natural first approximation here, rather than linear. But you can always compare Oded's results and mine.
Comment
Mike Tanner

Join Date: Aug 2016

Posts: 45
#9

04 Aug 2016, 08:15

Nick/Oded thanks a lot, I agree that the log of the dependant is the better aproximation, my next task after interpolating was to do some transformations, so the help has been extra helpful.
Comment
Mike Tanner

Join Date: Aug 2016

Posts: 45
#10

04 Aug 2016, 08:26

Sorry, after basically starting since the first generate statement, i get the error that cmissing command doesnt exist.

The code used (modified to the variable name sin the dataset) is :

gen logm = log(mangrovecov) mipolate logm year, by(Province) gen(loglinear) replace loglinear = exp(loglinear) twoway connected loglinear mangrovecov year, by(Province, note("") yrescale) /// cmissing(n n) ysc(log) ms(+ O) msize(*1.2) yla(, ang(h)) xtitle("")
Comment
Mike Tanner

Join Date: Aug 2016

Posts: 45
#11

04 Aug 2016, 08:29

Solved sorry!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35431
#12

04 Aug 2016, 08:30

The continuation comment

Code:

///

must occur at the end of a command line as a signal that the next line is a continuation.

Please do read http://www.statalist.org/forums/help#stata to learn how to use CODE delimiters as Oded and I have done.

Without CODE delimiters it's utterly impossible to see where the ends of command lines occur in what you typed to Stata
Comment
Mike Tanner

Join Date: Aug 2016

Posts: 45
#13

04 Aug 2016, 09:26

I'll keep that in mind. Can i ask why the transformation of the variable to a loglinear, and then replace it with its exp(loglinear)? is this how you interpolate a variable that follows a exponential growth or decline rate?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35431
#14

04 Aug 2016, 09:36

It's how I do it and I'd say it was standard. The default of mipolate is identical to the default (and only) behaviour of ipolate -- to interpolate linearly. Neither supplies an option to interpolate on logarithmic scale; I can't speak for ipolate but otherwise I know it's because a command before and a command after make it easy.
Comment
Mike Tanner

Join Date: Aug 2016

Posts: 45
#15

04 Aug 2016, 12:14

Thanks for the clarification. Given the data and the expected relations between dependant and independnt variablesIi went with a Log-Log model re / fe model with clustered standard errors. Get the expected sign in both FE and RE, although with random effects i have more significance. When choosing between RE and FE, does my small N play a role when chosing one or the other? Ill probbalt run a Haussman test too.
Comment

Announcement