Reshape DataSet: Converting from Wide to Long Format

Camila Caraballo

Join Date: Mar 2024
Posts: 2

Reshape DataSet: Converting from Wide to Long Format

10 Mar 2024, 18:37

I have a dataset with variables such as "CountryName," "SeriesName," and individual columns for each year from 2008 to 2022 (e.g., "YR2008," "YR2009," ...). The data is currently in a wide format, and I want to convert it into a long format where each row represents a unique combination of country, series, and year.

Here is the dataex

Code:

input str48 CountryName str3 CountryCode str62 SeriesName double(YR2008 YR2009)
"Peru" "PER" "Exports of goods and services (% of GDP)" 29.727630209883788 26.442052739027815
"Peru" "PER" "General government final consumption expenditure (% of GDP)" 10.370861790830661 11.466355995306959
"Peru" "PER" "Gross capital formation (% of GDP)" 26.178345935433022 19.978677979793538
"Peru" "PER" "Households and NPISHs final consumption expenditure (% of GDP)" 62.42929924387401 63.782790162195724
"Peru" "PER" "Imports of goods and services (% of GDP)" 28.70613718002149 21.66987687632404
"Chile" "CHL" "Exports of goods and services (% of GDP)" 41.16256079094571 37.11563479541047
"Chile" "CHL" "General government final consumption expenditure (% of GDP)" 11.264929190094698 12.868527794916263
"Chile" "CHL" "Gross capital formation (% of GDP)" 28.89694247135492 22.920007923062457
"Chile" "CHL" "Households and NPISHs final consumption expenditure (% of GDP)" 58.18963299683485 56.666225654931644
"Chile" "CHL" "Imports of goods and services (% of GDP)" 39.51406544923018 29.570396167280666
"Spain" "ESP" "Exports of goods and services (% of GDP)" 25.623929174316228 23.061694174725503
"Spain" "ESP" "General government final consumption expenditure (% of GDP)" 18.823098921085386 20.63969446088787
"Spain" "ESP" "Gross capital formation (% of GDP)" 28.454559137517226 23.3033423951416
"Spain" "ESP" "Households and NPISHs final consumption expenditure (% of GDP)" 57.457813636449664 56.928449121547
"Spain" "ESP" "Imports of goods and services (% of GDP)" 30.359400869368503 23.93318015230197
"Argentina" "ARG" "Exports of goods and services (% of GDP)" 22.060900382624236 19.56098505608181
"Argentina" "ARG" "General government final consumption expenditure (% of GDP)" 13.633513259563115 15.90389674688397
"Argentina" "ARG" "Gross capital formation (% of GDP)" 19.572775877710622 16.053112359211212
"Argentina" "ARG" "Households and NPISHs final consumption expenditure (% of GDP)" 63.074583476516025 62.97814768722907
"Argentina" "ARG" "Imports of goods and services (% of GDP)" 18.341772996413997 14.49614184940606
"" "" "" . .
"" "" "" . .
"" "" "" . .

*using this code:
reshape long YR2008-YR2022, i(CountryName SeriesName) j(Year)

When I try to traspose it with this code, it appears an error that says the id variable does not uniquely identify the observations.

It would be really helpful if someone can identify what i am doing wrong,

Many many thanks!

Last edited by Camila Caraballo; 10 Mar 2024, 18:41.

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 29795
#2

10 Mar 2024, 18:56

Within your example data, the source of the error message you are getting is those three final observations where CountryName and SeriesName are all identical. To use -reshape long-, the -i()- variables must uniquely identify observations. That is, there must be no duplicates. So, at the very least, you have to -drop- all of those observations where CountryName and SeriesName are empty.

After you fix that, the question naturally arises whether you have other situations where there is a duplication of CountryName and SeriesName. It is easy enough to identify whether that is the case:

Code:

duplicates tag CountryName SeriesName, gen(flag) browse if flag

Stata will then show you any such sets of observations, if there are any. And if there are, you will need to figure out why they are there and fix that.

Once you have cleaned up the data set so it is usable in this way, what you want to do faces an obstacle. You can't make a variable name out of SeriesName because the values of SeriesName are too long (max allowable is 32 characters) and contain characters that are not admissible in variable names (only letters, digits, and _ are allowed.) So that has to be fixed up first. Here I have done it by using the -strtoname()- function, which takes care of illegal characters, and then truncated the result of that to 30 characters, which enables it to fit on after the two character YR and stay within the limits.

Then, you need two -reshape-s to get to your goal.

Code:

replace SeriesName = substr(strtoname(SeriesName), 1, 30) reshape long YR, i(CountryName SeriesName) j(Year) reshape wide YR, i(CountryName Year) j(SeriesName) string
Comment
Camila Caraballo

Join Date: Mar 2024

Posts: 2
#3

10 Mar 2024, 19:44

The information you provided about the variable name length and special characters solved my issue when performing the transformation.

I appreciate your prompt response to my question!!
Comment

Announcement

Reshape DataSet: Converting from Wide to Long Format

Comment

Comment