Problem with Reshaping repeated ID data

Sultan Mehmood

Join Date: Sep 2016
Posts: 35

Problem with Reshaping repeated ID data

23 Jul 2017, 05:54

Dear StataListers,

I hope you are doing well. I have the following dataset (part of it I am copying below):

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str52 CountryName str3 CountryCode str131 IndicatorName str25 IndicatorCode double(YR1960 YR1961 YR1962 YR1963)
"Afghanistan" "AFG" "Mobile account (% age 15+)"                                                                        "WP15163_4.1"       . . . .
"Afghanistan" "AFG" "Children in employment, total (% of children ages 7-14)"                                           "SL.TLF.0714.ZS"    . . . .
"Afghanistan" "AFG" "Adjusted net savings, including particulate emission damage (% of GNI)"                            "NY.ADJ.SVNG.GN.ZS" . . . .
"Albania"     "ALB" "Educational attainment, at least completed post-secondary, population 25+, total (%) (cumulative)" "SE.SEC.CUAT.PO.ZS" . . . .
"Albania"     "ALB" "Delay in obtaining an electrical connection (days)"                                                "IC.ELC.DURS"       . . . .
"Albania"     "ALB" "Real effective exchange rate index (2010 = 100)"                                                   "PX.REX.REER"       . . . .
end

I want to do a reshape to the usual long format so, I am able to do a panel data analysis. I have tried many reshaping command variations, for instance the following reshaping code:

Code:

 reshape long YR, i(CountryName) j(Year)

However, I get the error that "variable id does not uniquely identify the observations". I have looked up corresponding help by typing "reshape error" and I do understand that my CountryName variable is non-unique and reshape would work on non-unique CountryName.
Now given that this is the data I have, where CountryName is repeated how do I convert it in long format with CountryName and Year variable so I can do panel analysis on it?

Any help by one of you would be really appreciated! Thank you!

Kind Regards,
Sultan

Last edited by Sultan Mehmood; 23 Jul 2017, 06:04.

Tags: None

William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

23 Jul 2017, 06:14

Thank you for the sample of your data, and for presenting it using the dataex command.

This is a bit confusing because in your sample data, the YR variables only contain missing values, so when I accomplished the reshape, the results were not so interesting to look at.

Nevertheless, what you want can be obtained by creating a made-up identifier equal to the observation number. The reshape command wants the identifier to be distinct so that if you ever need to reshape your data back to wide format, it knows which values came from which observations.

So the following code should point you in a helpful direction on this first step of making panel data.

Code:

generate idnum = _n reshape long YR, i(idnum) j(Year) drop idnum

However, once you have your data in long format, I think you will want to apply reshape wide to your data so that the various "indicators" which will constitute your analysis variables are combined onto a single observation for each combination of Country and year. Doing that will require some further work.
Comment
Sultan Mehmood

Join Date: Sep 2016

Posts: 35
#3

23 Jul 2017, 06:57

Dear Professor William,

Thank you very much for this. I have indeed created this 'dummy' indicator and that does give us data in long format though not by country-year and varaiables as you had pointed out.

More precisely, this is, what we get:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input int Year str52 CountryName str3 CountryCode str131 IndicatorName str25 IndicatorCode double YR 2014 "Cuba" "CUB" "PPG, IDA (DIS, current US$)" "DT.DIS.MIDA.CD" . 2015 "Cuba" "CUB" "PPG, IDA (DIS, current US$)" "DT.DIS.MIDA.CD" . 2016 "Cuba" "CUB" "PPG, IDA (DIS, current US$)" "DT.DIS.MIDA.CD" . 1960 "Cuba" "CUB" "Agricultural machinery, tractors" "AG.AGR.TRAC.NO" . 1961 "Cuba" "CUB" "Agricultural machinery, tractors" "AG.AGR.TRAC.NO" 11000 1962 "Cuba" "CUB" "Agricultural machinery, tractors" "AG.AGR.TRAC.NO" 15000 end

Now, as you suggested we should do a change back to wide, to get the appropriate panel data with country-year and indicator variables in the columns. I have tried the following code to perform this:

Code:

reshape wide YR, i(CountryName) j(IndicatorName IndicatorCode)

But, I get the error "IndicatorName not unique within CountryName" which is true. I have also tried the simple command:

Code:

reshape wide

and

Code:

reshape wide YR, i(idnum) j(YR)

error: variable YR contains missing values

But it does not work since we have dropped the observation indicator. So, to get back the wide format we regenerate the variable _n but it of course does not give us the country-year and variable-wise long format I need.
Probably,, I have to use Variablename as 'Stub' not YR. That is, "reshape wide Variableid" but I am unable to translate that in a code that would serve the purposes here.

How might I proceed from here? (since i feel I am stuck).

Thanks again for your help so far.

Kind Regards,
Sultan

Last edited by Sultan Mehmood; 23 Jul 2017, 07:32.
Comment

Sultan Mehmood

Join Date: Sep 2016
Posts: 35

23 Jul 2017, 08:04

Another thing I tried based on Stata Journal Paper by Professor Baum and Professor Cox:
Baum, C. F., & Cox, N. J. (2007). Stata tip 45: Getting those data into shape. Stata Journal, 7(2), 268-271.

Code:

reshape long YR, i(CountryName IndicatorName)
rename _j year
reshape wide a b, i(CountryName year) j(IndicatorName) string

However, I still get the error: variable "IndicatorCode not constant within CountryName year" (first reshape is performed but second wide reshape is not).
Again after the first reshape my data looks like this:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str52 CountryName str131 IndicatorName int year str3 CountryCode str25 IndicatorCode double YR
"Cuba" "Net flows on external debt, public and publicly guaranteed (PPG) (NFL, current US$)" 2010 "CUB" "DT.NFL.DPPG.CD" 555
"Cuba" "Net flows on external debt, public and publicly guaranteed (PPG) (NFL, current US$)" 2011 "CUB" "DT.NFL.DPPG.CD" 786
"Cuba" "Net flows on external debt, public and publicly guaranteed (PPG) (NFL, current US$)" 2012 "CUB" "DT.NFL.DPPG.CD" .
"Cuba" "Net flows on external debt, public and publicly guaranteed (PPG) (NFL, current US$)" 2013 "CUB" "DT.NFL.DPPG.CD" .
"Cuba" "Net flows on external debt, public and publicly guaranteed (PPG) (NFL, current US$)" 2014 "CUB" "DT.NFL.DPPG.CD" .
"Cuba" "Net flows on external debt, public and publicly guaranteed (PPG) (NFL, current US$)" 2015 "CUB" "DT.NFL.DPPG.CD" .
"Cuba" "Net flows on external debt, public and publicly guaranteed (PPG) (NFL, current US$)" 2016 "CUB" "DT.NFL.DPPG.CD" .
"Cuba" "Net flows on external debt, short-term (NFL, current US$)"                           1960 "CUB" "DT.NFL.DSTC.CD" .
"Cuba" "Net flows on external debt, short-term (NFL, current US$)"                           1961 "CUB" "DT.NFL.DSTC.CD" .
"Cuba" "Net flows on external debt, short-term (NFL, current US$)"                           1962 "CUB" "DT.NFL.DSTC.CD" .
"Cuba" "Net flows on external debt, short-term (NFL, current US$)"                           1963 "CUB" "DT.NFL.DSTC.CD" 456
"Cuba" "Net flows on external debt, short-term (NFL, current US$)"                           1964 "CUB" "DT.NFL.DSTC.CD" 123
"Cuba" "Net flows on external debt, short-term (NFL, current US$)"                           1965 "CUB" "DT.NFL.DSTC.CD" .
end

Last edited by Sultan Mehmood; 23 Jul 2017, 08:16.

Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

23 Jul 2017, 18:24

Getting your data into a panel format is somewhat complicated. Below I've made up a very small simplified example based on yours, and then reshaped it (twice, first long and then wide) with the result being one observation per country/year, with all the variables (what your data calls indicators) for that country/year. So it starts with data that looks like this:

Code:

. list, clean abbreviate(16)

       CountryName   CountryCode   IndicatorName    IndicatorCode   YR2014   YR2015   YR2016  
  1.          Cuba           CUB        PPG, IDA   DT.DIS.MIDA.CD     1234     2345     3456  
  2.          Cuba           CUB    Agricultural   AG.AGR.TRAC.NO    10000    11000    15000

and ends with data that looks like this:

Code:

. list, clean abbreviate(16)

       CountryName   CountryCode   Year   AG_AGR_TRAC_NO   DT_DIS_MIDA_CD  
  1.          Cuba           CUB   2014            10000             1234  
  2.          Cuba           CUB   2015            11000             2345  
  3.          Cuba           CUB   2016            15000             3456

I encourage you to copy the code, paste it into the Do-file editor and execute it, and examine the intermediate outputs, using the help command for reference, so that you can understand what it does and adapt it to meet your needs.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str8 CountryName str3 CountryCode str20 IndicatorName str25 IndicatorCode double(YR2014 YR2015 YR2016)
"Cuba" "CUB" "PPG, IDA"     "DT.DIS.MIDA.CD"  1234  2345  3456
"Cuba" "CUB" "Agricultural" "AG.AGR.TRAC.NO" 10000 11000 15000
end
list, clean abbreviate(16)
generate varname = subinstr(IndicatorCode,".","_",.)
drop IndicatorName IndicatorCode
generate id = _n
list, clean abbreviate(16)
reshape long YR, i(id) j(Year)
list, clean abbreviate(16)
drop id
rename YR V_
reshape wide V_, i(CountryCode Year) j(varname) string
list, clean abbreviate(16)
order CountryName CountryCode Year
rename (V_*) (*)
list, clean abbreviate(16)

Comment

Sultan Mehmood

Join Date: Sep 2016

Posts: 35
#6

24 Jul 2017, 00:52

Thank you very very much!
Comment
Roger More

Join Date: Jul 2017

Posts: 59
#7

31 Jul 2017, 15:26

Hello Dr. William and Stata Listers ,

I hope all is fine with you.

I am doing practically the identical thing as done in this post (where my raw data is exactly the same structure).

I notice you dropped IndicatorName and only kept Indicator Code. Is there a way in this example where we can keep both variable or ideally have IndicatorName as being a label for IndicatorCode in this example (so it would be easy to understand what the indicator code actually means)?

Thanks.

Cheers,
Roger
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

31 Jul 2017, 18:21

There is no direct way that I know of. I would approach this by having Stata create and display label commands that I would then copy from the Stata Results window into the do-file that does the reshaping.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str8 CountryName str3 CountryCode str20 IndicatorName str25 IndicatorCode double(YR2014 YR2015 YR2016)
"Cuba" "CUB" "PPG, IDA"     "DT.DIS.MIDA.CD"  1234  2345  3456
"Cuba" "CUB" "Agricultural" "AG.AGR.TRAC.NO" 10000 11000 15000
end
generate varname = substr(subinstr(IndicatorCode,".","_",.),1,80)
generate command = `"label variable "' + varname + " `" + `"""' + IndicatorName + `"""' + "'"
egen distinct = tag(varname)
format command %-5s
list command if distinct, clean noobs

Code:

. list command if distinct, clean noobs

    command                                         
    label variable DT_DIS_MIDA_CD `"PPG, IDA"'      
    label variable AG_AGR_TRAC_NO `"Agricultural"'

Doubtless you are thinking, could Stata not somehow automatically run that code, rather than having to copy-and-paste it into a do-file? The problem is, you may ultimately want to modify some of the labels - like those that were shortened to 80 characters to adhere to Stata's length limit - so just make it easy on yourself by not trying to automate this process to the n-th degree.

Announcement

Problem with Reshaping repeated ID data

Comment

Comment

Comment

Comment

Comment

Comment

Comment