Reshape Command

Hessam Ghazi

Join Date: Mar 2024

Posts: 2
#1

Reshape Command

05 Mar 2024, 07:15

Hi all,

I am having trouble converting my data to a long format. I have used the following code and the new Year variable generated does not correspond to the original Year variable in my dataset. I have around 14000 obs from 2018 and 4000 from 2019 but the new year variable generated gives 50% to each year so observations are being mismatched. I would appreciate any advice you may have:

rename YEAR YEAR_temp

reshape long England_ Region_ Turnover_ RnD_ FemDirector_ FemOwner_ FamOwned_ NoEmployees_ Deprived_ EMDirector_ AI_ Competition_ Profit_ Age_ Sector_ StaffTraining_ LegalStatus_, i(SERIAL) j(Year_info) string
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#2

05 Mar 2024, 08:54

I have around 14000 obs from 2018 and 4000 from 2019

does not make sense in a wide data set. In a wide data set, your list of variables will look like this:

Code:

SERIAL England_2018 England_2019 Region_2018 Region_2019 ... etc.

So simply by virtue of the fact that the data are in wide layout to start, you inevitably have the same number of observations from 2018 and 2019. It may be that many of the *_2019 variables have missing values while the corresponding *_2018 values do not. But that's still the same number of observations. And after reshape, you should, indeed, see a 50/50 distribution of 2018 and 2019 in the Year variable that gets created. Of course, wherever there was a missing value to begin with, there will still be a missing value. So if, say, LegalStatus_2019 is one of the variables where there is a lot of missing data, in the -reshape-d data set if you run -tab LegalStatus Year-, you will see the 14000/4000 distribution of the number of non-missing values for LegalStatus.

So assuming your starting data look as I say, I think the only thing wrong here is what you expected to find about the Year variable.

If your starting data do not look the way I said, then please post back showing example data from your data set so we can see what is going on. Be sure to use the -dataex- command to do that. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

By the way, -dataex- will not be able to show all of the variables in that long list. I suggest you just show SERIAL and perhaps 3 2018 and 2019 pairs of other variables, being sure to include at least one variable pair where you have a lot of missing data in 2019. So, perhaps something like

Code:

dataex SERIAL England_2018 England_2019 Sector_2018 Sector_2019 LegalStatus_2018 LegalStatus_2019

(again where I'm using LegalStatus as a stand-in for some variable pair where much of the 2019 data is missing.)
Comment
Hessam Ghazi

Join Date: Mar 2024

Posts: 2
#3

05 Mar 2024, 15:57

So assuming your starting data look as I say, I think the only thing wrong here is what you expected to find about the Year variable.

It is indeed. Thank you for the clarification on the distribution of observations.

My only confusion is why the new year variable (Year_info) generated in the reshaping process does not take the same values as the original variable (YEAR). I assumed there was a mismatch for that reason.

Here's an example from a few observations in my dataset:

SERIAL Year_info YEAR

150107905 2018 2018

150107905 2019 2018

150107907 2018 2018

150107907 2019 2018

150107908 2018 2018

150107908 2019 2018

150107953 2018 2018

150107953 2019 2018

150108003 2018 2018

150108003 2019 2018

150108032 2018 2018

150108032 2019 2018

150108066 2018 2018

150108066 2019 2018

From what I can see in the Data Editor, "Year_info" alternates between 2018 and 2019 for every observation.

Here's a snippet from the command you had suggested in case it helps:

Code:

dataex SERIAL Turnover_2018 Turnover_2019 StaffTraining_2018 StaffTraining_2019 FamOwned_2018 FamOwned_2019

150107905 5 4 . . 1 1
150107907 8 6 . . 1 1
150107908 8 6 . . 1 1
150107953 9 7 0 0 2 2
150108003 6 6 . . 1 1
150108032 8 . . . 2 .
150108066 4 3 . . 2 2
150108069 7 . . . 1 .
150108070 5 3 . . 1 1
150108075 9 . . . 1 .
150108078 7 . 0 . 1 .
150108084 7 5 . . 1 1
150108114 5 3 0 . 2 2
150108142 8 6 . . 1 1
150108150 6 . . . 1 .
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#4

05 Mar 2024, 18:28

From what I can see in the Data Editor, "Year_info" alternates between 2018 and 2019 for every observation.

Yes, that's correct. That's exactly how -reshape long- creates the new variable you specify in the -j()- option. For each observation in the original wide data set, several new observations are created: one for each value of the -j()- option variable. The values of the -j()- option variable are precisely the "suffixes" that complete the variable names from the stubs that are list in the first part of the -reshape- command. In your case, all your variables (except SERIAL and YEAR) occur twice in the original data, once with suffix 2018 and once with suffix 2019. So when you -reshape- to long, each observation will become two observations, one with YEAR_info set to 2018 and one with it set to 2019.

By contrast, the variable YEAR_temp, which does not appear anywhere in the -reshape- command will just be given its original value in each of the new observations created from the original observations.
Comment

Announcement

Comment

Comment

Comment