My data changed after reshape. Anyone experienced this before?

Nietha Heng

Join Date: Mar 2019
Posts: 5

My data changed after reshape. Anyone experienced this before?

20 Jun 2019, 07:45

Hey folks. Something weird happened to me. I have a dataset:

bcode	bname	live99	live00	live01	live02	live03	live04	live05	live06	live07	live08	live09	live10	live11	live12	live13	live14	live15	live16	live17
E09~	London	53	53	53	57	62	58	.	.	.	.	.	.	.	.	.	.	.	.	.

I run this code:

Code:

reshape long live, i(bcode bname) j(year)

and the result returned this:

bcode	bname	year	live
E09000001	London	99	33
E09000001	London	00	33
E09000001	London	01	33
E09000001	London	02	5966
E09000001	London	03	33
E09000001	London	04	33
E09000001	London	05	.
E09000001	London	06	.
E09000001	London	07	.
E09000001	London	08	.
E09000001	London	09	.
E09000001	London	10	.
E09000001	London	11	.
E09000001	London	12	.
E09000001	London	13	.
E09000001	London	14	.
E09000001	London	15	.
E09000001	London	17	.

Have anyone experience this? I tried to restart STATA and the computer but nothing changed.

If it helps, following is the description of the dataset:

Code:

Contains data
  obs:            19                          
 vars:             4                          
 size:           703                          
--------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
--------------------------------------------------------------------------------------------------------------------
bcode           str9    %9s                   
bname           str22   %22s                  
year            int     %9.0g                 
live            long    %8.0g      v57        
--------------------------------------------------------------------------------------------------------------------
Sorted by: bcode  bname  year
     Note: Dataset has changed since last saved.

Tags: None

FernandoRios

Join Date: Apr 2014

Posts: 2430
#2

20 Jun 2019, 08:07

Hi Nietha
try the following:

Code:

reshape long live, i(bcode bname) j(year) string

HTH
Fernando
Comment
Jorrit Gosens

Join Date: Jan 2015

Posts: 1019
#3

20 Jun 2019, 08:10

Likely because your 'livexx' variables are actually encoded or labelled variables. Difficult to diagnose exactly unless you provide a dataex example of your data. Please see https://www.statalist.org/forums/help#stata
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

20 Jun 2019, 08:38

If the "wide" data before your reshape represents the "original" data you read in or imported from some other source, then I will build on Jorrit's analysis and suggest that you found that the live99...live17 variables were read in as string variables and you used encode to convert them to numeric values. If so, that was a mistake. The output of help encode tells us

Do not use encode if varname contains numbers that merely happen to be stored as strings; instead, use generate newvar = real(varname) or destring; see real() or [D] destring.
Comment

Nietha Heng

Join Date: Mar 2019
Posts: 5

21 Jun 2019, 00:50

Thanks for the response. I tried Fernando's suggestion but it does not work. But Jorrit was right, the culprit showed up once I did the dataex. the result showed me this:

Code:

input str9 bcode str22 bname long(live1999 live2000 live2001 live2002 live2003 live2004 live2005 live2006 live2007 live2008 live2009 live2010 live2011 live2012 live2013 live2014 live2015 live2016 live2017)
"E09000001" "City of London" 33 33 33 32 33 33 . . . . . . . . . . . . .
end
label values live1999 v3
label def v3 33 "53", modify
label values live2000 v6
label def v6 33 "53", modify
label values live2001 v9
label def v9 33 "53", modify
label values live2002 v12
label def v12 32 "57", modify
label values live2003 v15
label def v15 33 "62", modify
label values live2004 v18
label def v18 33 "58", modify
label values live2005 v21
label values live2006 v24
label values live2007 v27
label values live2008 v30
label values live2009 v33
label values live2010 v36
label values live2011 v39
label values live2012 v42
label values live2013 v45
label values live2014 v48
label values live2015 v51
label values live2016 v54
label values live2017 v57

Now I am unsure about how to proceed, is it because of label? Should I drop the label? I tried to do

Code:

label drop _all

but it altered my number in the end, and I can't seem to get the original back.

And about William's suggestion, should I get back to the original dataset then? I checked my code and yes indeed I used encode previously because I tried destring but the result tells me that all my vars contain nonnumeric characters (although they didn't)

Last edited by Nietha Heng; 21 Jun 2019, 00:53. Reason: Added question to William

Comment

Jorrit Gosens

Join Date: Jan 2015

Posts: 1019
#6

21 Jun 2019, 01:04

You should use decode to go back to the original.

Code:

ren live* live_old_* foreach year of numlist 1999/2017{ capture decode live_old_`year', gen(live`year') } destring, replace

Afterwards, check to see if any of your years did not properly decode.
After you decode and destring, you can use your reshape code.
There may be a few more steps because of specifics of your data, but this is the best guess based on the one line of observations.

Edit:
Of course you could also not do this, and instead go back to your original data and not encode it.

And also:
if there is a problem with destring, you'll need to fix your data at that point. Stata tends to have better eyes than most humans, so in most cases where Stata says there are non-numeric characters, it is really true. You can find out most easily by sorting data on one problematic variable. The first or last observations will likely contain some text. Maybe n/a or a ?. Could also be commas/dots not imported correctly, but you can ask about that when you have your data back to where it is not encoded.

Last edited by Jorrit Gosens; 21 Jun 2019, 01:11.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35432
#7

21 Jun 2019, 01:58

all my vars contain nonnumeric characters (although they didn't)

We need to focus on this. My experience is that Stata is always right on this call, although my bias is increased by personal association with destring.

Further, different encodings of different string variables are a recipe for disaster. Even your short data example shows that 33 sometimes comes from "53", sometimes from "58" and sometimes from "62". reshape can't possibly fix that kind of inconsistency, as in essence it just shuffles the data into a different layout.

In short, you didn't suffer from reshape perversely messing up your data. They were already messed up by unsuitable encoding.

As you fear, you need to go back to your original and work out why you have string variables in the first place.

Code:

destring

declining to act can arise for all kinds of reasons, including

* metadata, spreadsheet style, in the first few rows of an original file

* trailing gunk, ditto, such as side or footnote text

* string characters such as NA that look sensible to you but don't follow Stata's expectations

and several more. There is an overarching survey within https://journals.sagepub.com/doi/abs...867X1801800413 to which you may have access. You will have access to https://www.statalist.org/forums/for...iable-problems

The problemstring program posted there works like this. Here I just load the auto data and create a variable that would be fine to destring, except that there is one problematic value, which problemstring tells you about.

Code:

. sysuse auto, clear (1978 Automobile Data) . gen tocheck = cond(_n == 42, "NA", "666") . problemstring problematic values found: see help for destring tocheck tocheck | Freq. Percent Cum. ------------+----------------------------------- NA | 1 100.00 100.00 ------------+----------------------------------- Total | 1 100.00

NOTE: This took a long time to write, as I was doing something else in the middle. Jorrit Gosens made some of the same points in his post, which I have just seen.

Last edited by Nick Cox; 21 Jun 2019, 02:01.
1 like
Comment
Nietha Heng

Join Date: Mar 2019

Posts: 5
#8

21 Jun 2019, 03:30

Thanks guys! I seem to find the reason why destring didn't work in the first place, and why I used decode which created all the problems. So apparently somehow my values were separated by a space instead of comma for thousands so this is why STATA thinks that it contains the non numeric value. I easily went back to the original dataset and destring using this code:

Code:

forvalues i=99(1)17{ destring live`i', ignore(" ") replace }

I successfully changed the data type, and the reshape has been working wonderfully. Thanks for the dataex suggestion which helped me find the culprit and encode/destring which helped me to fix the problem.

You are right, Nick! STATA recognized my values as non-numeric because of the spaces!
Comment

Announcement

My data changed after reshape. Anyone experienced this before?

Comment

Comment

Comment

Comment

Comment

Comment

Comment