Reshaping variable from wide to long and joining it to a pre-existing long variable

Zach Goldberg

Join Date: Jul 2017
Posts: 184

Reshaping variable from wide to long and joining it to a pre-existing long variable

19 Nov 2019, 19:30

Greetings,

I'm running Stata 15.1 on OSX. I suspect the subject line was confusing so allow me to explain: I have a panel dataset where most of the variables have already been shaped from wide to long. One of the variables, 'twitter_', is missing data for a specific year (2013). I subsequently realized that one of the wide variables ('twitter_ever2013'), which was only measured in 2013, could be used to fill in the data for this year. The question is how to reshape 'twitter_ever2013' from wide to long and merge it with 'twitter_use'. Is this possible or do I have to start with a pre-shaped dataset (i.e. from scratch)?

Here is a sample of the data I'm currently looking at:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(twitter_ twitter_ever2013) int year double caseid
. . 2012 989870
. . 2014 989870
. . 2015 989870
. . 2017 989870
0 0 2012 989871
. 0 2014 989871
. 0 2015 989871
. 0 2017 989871
. . 2012 989872
. . 2014 989872
. . 2015 989872
. . 2017 989872
. . 2012 989873
. . 2014 989873
. . 2015 989873
. . 2017 989873
0 0 2012 989874
. 0 2014 989874
. 0 2015 989874
. 0 2017 989874
0 0 2012 989875
0 0 2014 989875
0 0 2015 989875
0 0 2017 989875
0 0 2012 989876
0 0 2014 989876
0 0 2015 989876
0 0 2017 989876
0 0 2012 989877
0 0 2014 989877
0 0 2015 989877
0 0 2017 989877
0 0 2012 989878
0 0 2014 989878
0 0 2015 989878
. 0 2017 989878
. . 2012 989879
. . 2014 989879
. . 2015 989879
. . 2017 989879
0 0 2012 989880
0 0 2014 989880
0 0 2015 989880
0 0 2017 989880
0 1 2012 989881
0 1 2014 989881
. 1 2015 989881
. 1 2017 989881
. 0 2012 989882
. 0 2014 989882
. 0 2015 989882
. 0 2017 989882
0 0 2012 989883
1 0 2014 989883
1 0 2015 989883
0 0 2017 989883
0 0 2012 989884
0 0 2014 989884
0 0 2015 989884
0 0 2017 989884
. . 2012 989885
. . 2014 989885
. . 2015 989885
. . 2017 989885
1 0 2012 989886
1 0 2014 989886
. 0 2015 989886
1 0 2017 989886
. . 2012 989887
. . 2014 989887
. . 2015 989887
. . 2017 989887
0 0 2012 989888
0 0 2014 989888
0 0 2015 989888
0 0 2017 989888
1 1 2012 989889
1 1 2014 989889
1 1 2015 989889
0 1 2017 989889
1 1 2012 989890
. 1 2014 989890
. 1 2015 989890
. 1 2017 989890
. . 2012 989891
. . 2014 989891
. . 2015 989891
. . 2017 989891
0 0 2012 989892
0 0 2014 989892
0 0 2015 989892
1 0 2017 989892
0 0 2012 989893
0 0 2014 989893
0 0 2015 989893
0 0 2017 989893
0 0 2012 989894
. 0 2014 989894
. 0 2015 989894
. 0 2017 989894
end
label values twitter_ TECH5S67
label def TECH5S67 1 "yes", modify

Thanks in advance for your help!

Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 30100

19 Nov 2019, 21:39

Code:

// FIRST VERIFY THAT twitter_ever2013 IS CONSISTENT FOR ALL OBSERVATIONS OF
// A GIVEN CASEID
by caseid (twitter_ever2013), sort: assert twitter_ever2013[1] == twitter_ever2013[_N]

// CREATE A DATA SET WITH JUST THE 2013 INFORMATION
tempfile original
save `original'
keep caseid twitter_ever2013
rename twitter_ever2013 twitter_
by caseid, sort: keep if _n == 1
gen int year = 2013

// PUT THIS BACK WITH THE ORIGINAL DATA
append using `original'
sort caseid year
drop twitter_ever2013

Comment

Zach Goldberg

Join Date: Jul 2017

Posts: 184
#3

19 Nov 2019, 21:50

Thanks Clyde! Did you mean 'merge' as opposed to 'append'? (otherwise, I don't see how the variable will be linked with the others)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#4

19 Nov 2019, 22:15

No, I mean -append-. Try the code and you will see that it does what you ask. The variable gets "linked" with the others because it gets -rename-d from twitter_ever2013 to twitter_ shortly before it is -appended-. N.B.: You could do -merge 1:1 caseid year using `original'- instead of -append using `original'- and the result would be the same (except for the creation of the _merge variable), but -merge- is just a slow way to do -append- when there is no overlap between the master and using data sets.
Comment

Zach Goldberg

Join Date: Jul 2017
Posts: 184

19 Nov 2019, 22:46

Hey Clyde, I did as you suggested:

Code:

by caseid (twitter_ever2013), sort: assert twitter_ever2013[1] == twitter_ever2013[_N]
tempfile original
save `original'
keep caseid twitter_ever2013
rename twitter_ever2013 twitter_
by caseid, sort: keep if _n == 1
gen int year = 2013
gen int month = 6
append using `original'
sort caseid year month

However, when I try to correlate the appended variable with other variables, I get a 'no observations' notice:

Code:

.
spearman twitter_ ideo7_ if year==2013
no observations
r(2000);

Here is the data that I attempted to correlate in the above:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double(ideo7_ twitter_ caseid) float(year month)
. . 989870 2013  1
. . 989870 2013  2
. . 989870 2013  3
. . 989870 2013  4
. . 989870 2013  5
. . 989870 2013  6
. . 989870 2013  6
. . 989870 2013  7
. . 989870 2013  8
. . 989870 2013  9
. . 989870 2013 10
. . 989870 2013 11
. . 989870 2013 12
. . 989871 2013  1
. . 989871 2013  2
. . 989871 2013  3
. . 989871 2013  4
. . 989871 2013  5
. . 989871 2013  6
. 0 989871 2013  6
. . 989871 2013  7
. . 989871 2013  8
. . 989871 2013  9
. . 989871 2013 10
6 . 989871 2013 11
6 . 989871 2013 12
. . 989872 2013  1
. . 989872 2013  2
. . 989872 2013  3
. . 989872 2013  4
. . 989872 2013  5
. . 989872 2013  6
. . 989872 2013  6
. . 989872 2013  7
. . 989872 2013  8
. . 989872 2013  9
. . 989872 2013 10
. . 989872 2013 11
. . 989872 2013 12
. . 989873 2013  1
. . 989873 2013  2
. . 989873 2013  3
. . 989873 2013  4
. . 989873 2013  5
. . 989873 2013  6
. . 989873 2013  6
. . 989873 2013  7
. . 989873 2013  8
. . 989873 2013  9
. . 989873 2013 10
. . 989873 2013 11
. . 989873 2013 12
. . 989874 2013  1
. . 989874 2013  2
. . 989874 2013  3
. . 989874 2013  4
. . 989874 2013  5
. 0 989874 2013  6
. . 989874 2013  6
. . 989874 2013  7
. . 989874 2013  8
. . 989874 2013  9
. . 989874 2013 10
2 . 989874 2013 11
2 . 989874 2013 12
. . 989875 2013  1
. . 989875 2013  2
. . 989875 2013  3
. . 989875 2013  4
. . 989875 2013  5
. . 989875 2013  6
. 0 989875 2013  6
. . 989875 2013  7
. . 989875 2013  8
. . 989875 2013  9
. . 989875 2013 10
5 . 989875 2013 11
4 . 989875 2013 12
. . 989876 2013  1
. . 989876 2013  2
. . 989876 2013  3
. . 989876 2013  4
. . 989876 2013  5
. . 989876 2013  6
. 0 989876 2013  6
. . 989876 2013  7
. . 989876 2013  8
. . 989876 2013  9
. . 989876 2013 10
4 . 989876 2013 11
4 . 989876 2013 12
. . 989877 2013  1
. . 989877 2013  2
. . 989877 2013  3
. . 989877 2013  4
. . 989877 2013  5
. 0 989877 2013  6
. . 989877 2013  6
. . 989877 2013  7
. . 989877 2013  8
end
label values ideo7_ IDEOL1S61
label def IDEOL1S61 2 "Very liberal", modify
label def IDEOL1S61 4 "Slightly liberal", modify
label def IDEOL1S61 5 "Moderate", modify
label def IDEOL1S61 6 "Slightly conservative", modify

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35698
#6

20 Nov 2019, 00:57

That’s right. You have no observations in which both variables are not missing.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#7

20 Nov 2019, 05:57

I note that the variable ideo7_ was not included in the sample data in post #1. That omission likely led Clyde to a solution that works for your sample data but does not work for your real data. Perhaps the merge he discussed in post #4 will solve your problem.
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#8

20 Nov 2019, 10:22

To William Lisowski's insightful comment I will add that the original data example also did not contain any observations with year == 2013, and the way the question was posed led me to think that the only way any 2013 data would be found is in the variable twitter_ever2013. For both the reasons cited in #7 and this, it is now clear that the -append- command is not suitable for the actual data and should be replaced by -merge 1:1 caseid year using `original'-. Given that there is in fact 2013 data already in the original data, this will in fact be a true -merge- and not just a kludgy way to -append-.

I also notice that the new data example has not only year but month as a time variable. So it is going to be necessary to carry that month variable along in the code, and it looks like it also needs to be included as a key variable in the final -merge- as well.

If these general instructions don't enable you to solve the problem, please post back with an example of your data that truly represents what you have. By oversimplifying the example you have steered things towards an inadequate solution. Give a realistic example.

All of that said, if the example shown in #5 is representative, none of this will solve your correlation problem because it still appears that when mergeing the data and taking the month variable into account, there still will be no observation for which both of the variables in the correlation are non-missing. You have twitter_ in some months and ideo_7 in different months, and there is no way to make a correlation out of that arrangement.

Last edited by Clyde Schechter; 20 Nov 2019, 10:28.
1 like
Comment
Zach Goldberg

Join Date: Jul 2017

Posts: 184
#9

20 Nov 2019, 13:16

Hey Clyde,

The 'merge' approach worked. I do have one final if unrelated question: is the 'collapse' command the best or only way to shift back and forth between using annual and monthly data? Specifically, I'd like run a cross-lagged time series regression, where t=year. However, I have two time variables (as shown above): month and year. Thus when I go to tsset my data, I get the 'repeated time values within panel' notice:

Code:

. tsset caseid year repeated time values within panel r(451);

Thus, to return to my question, is collapse the best way to handle this (by handle I mean prepare the data for a time series regression where t=year).

Thanks again!

-Zach
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#10

20 Nov 2019, 13:28

So there are really two questions here. The first is whether you actually need to -tsset- your data, and then, if the answer is yes, the next question is how to go about it.

I'm not sure what a "cross-lagged" time series regression means, but if you need to calculate lagged or leading values (or other time series opreators: see -help tsvarlist-) then, yes, you will need to set a time series variable. (If you do not need to use time-series operators, nor autoregressive structures, then you can just -xtset caseid- and forget about a time variable.)

The obstacle to -tsset-ing your data you have is that year is not the appropriate time variable because your dates are actually monthly. So the first step is to create a real Stata internal format monthly date variable:

Code:

gen mdate = ym(year, month) format mdate %tm

Then you can -tsset caseid mdate- and Stata will not complain (unless you also have repeated observations for the same caseid and monthly datae).

Now, note that when you do this, if you use Stata's lag operator, L.x will refer to the value of x in the preceding month, not year. If you need the value from one year ago, you will need to refer to that as L12.x.

Note: This question is off the topic of the thread. While it is easy to think of these threads as dialogs between questioners and responders, they are also read by others who come to the Forum and search for topics they need information about. When the content of the thread strays from the topic in its title, the time and effort of those who search for the title topic is wasted reading irrelevant material, and those who are looking for information about the off-title topic will be unable to find it with the Forum search engine. So, in the future, if you are changing the subject, please start a new thread.
Comment
Zach Goldberg

Join Date: Jul 2017

Posts: 184
#11

20 Nov 2019, 14:43

Hey Clyde,

Thanks for your reply. I hear your point about the question being off-topic (and I will ensure in the future that I keep everything on topic). At this point, I really only have one final question in regards to the post above: if I use L12.x, wouldn't L12.x still correspond to a specific month as opposed to the year as a whole?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#12

20 Nov 2019, 14:46

Yes, L12.x would correspond to the same month in the preceding year. If you want to aggregate everything up to the level of the year, then you should use -collapse … , by(caseid year)- [with whatever statistics, means, medians, mins, maxes, etc. are appropriate for aggregating these particular variables], and then you can -xtset caseid year-. After that L.x would refer to the previous year's (aggregated) value of x.
Comment
Zach Goldberg

Join Date: Jul 2017

Posts: 184
#13

20 Nov 2019, 14:46

...And if so, is it best to return to the original [wide] dataset, average all the monthly variables (e.g. twitter_52012, twitter_92012, twitter_122012) together to create annual variables (e.g. twitter_2012), and thereafter reshape to long?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#14

20 Nov 2019, 14:53

If your data actually look like what you showed in #5, they are already in long layout. It would be a waste of time (a lot of time, by the way, if your data set is large) to -reshape- back to wide and calculate rowmeans and then return to long layout. -collapse_ will get you your sums, or means, or whatever, and save you the time and trouble of two -reshape-s.

Code:

collapse (mean) twitter_ ideo7_, by(caseid year)

does it all in one line.
Comment
Zach Goldberg

Join Date: Jul 2017

Posts: 184
#15

20 Nov 2019, 15:06

Got it. I know that everything not included in the line is discarded. So if I need my sampling weights, do I collapse them as well? (probably a stupid question, but just wanted to clarify)
Comment

Announcement

Reshaping variable from wide to long and joining it to a pre-existing long variable

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment