Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • My data changed after reshape. Anyone experienced this before?

    Hey folks. Something weird happened to me. I have a dataset:
    bcode bname live99 live00 live01 live02 live03 live04 live05 live06 live07 live08 live09 live10 live11 live12 live13 live14 live15 live16 live17
    E09~ London 53 53 53 57 62 58 . . . . . . . . . . . . .
    I run this code:
    Code:
    reshape long live, i(bcode bname) j(year)
    and the result returned this:
    bcode bname year live
    E09000001 London 99 33
    E09000001 London 00 33
    E09000001 London 01 33
    E09000001 London 02 5966
    E09000001 London 03 33
    E09000001 London 04 33
    E09000001 London 05 .
    E09000001 London 06 .
    E09000001 London 07 .
    E09000001 London 08 .
    E09000001 London 09 .
    E09000001 London 10 .
    E09000001 London 11 .
    E09000001 London 12 .
    E09000001 London 13 .
    E09000001 London 14 .
    E09000001 London 15 .
    E09000001 London 17 .
    Have anyone experience this? I tried to restart STATA and the computer but nothing changed.

    If it helps, following is the description of the dataset:
    Code:
    Contains data
      obs:            19                          
     vars:             4                          
     size:           703                          
    --------------------------------------------------------------------------------------------------------------------
                  storage   display    value
    variable name   type    format     label      variable label
    --------------------------------------------------------------------------------------------------------------------
    bcode           str9    %9s                   
    bname           str22   %22s                  
    year            int     %9.0g                 
    live            long    %8.0g      v57        
    --------------------------------------------------------------------------------------------------------------------
    Sorted by: bcode  bname  year
         Note: Dataset has changed since last saved.

  • #2
    Hi Nietha
    try the following:
    Code:
    reshape long live, i(bcode bname) j(year) string
    HTH
    Fernando

    Comment


    • #3
      Likely because your 'livexx' variables are actually encoded or labelled variables. Difficult to diagnose exactly unless you provide a dataex example of your data. Please see https://www.statalist.org/forums/help#stata

      Comment


      • #4
        If the "wide" data before your reshape represents the "original" data you read in or imported from some other source, then I will build on Jorrit's analysis and suggest that you found that the live99...live17 variables were read in as string variables and you used encode to convert them to numeric values. If so, that was a mistake. The output of help encode tells us

        Do not use encode if varname contains numbers that merely happen to be stored as strings; instead, use generate newvar = real(varname) or destring; see real() or [D] destring.

        Comment


        • #5
          Thanks for the response. I tried Fernando's suggestion but it does not work. But Jorrit was right, the culprit showed up once I did the dataex. the result showed me this:
          Code:
          input str9 bcode str22 bname long(live1999 live2000 live2001 live2002 live2003 live2004 live2005 live2006 live2007 live2008 live2009 live2010 live2011 live2012 live2013 live2014 live2015 live2016 live2017)
          "E09000001" "City of London" 33 33 33 32 33 33 . . . . . . . . . . . . .
          end
          label values live1999 v3
          label def v3 33 "53", modify
          label values live2000 v6
          label def v6 33 "53", modify
          label values live2001 v9
          label def v9 33 "53", modify
          label values live2002 v12
          label def v12 32 "57", modify
          label values live2003 v15
          label def v15 33 "62", modify
          label values live2004 v18
          label def v18 33 "58", modify
          label values live2005 v21
          label values live2006 v24
          label values live2007 v27
          label values live2008 v30
          label values live2009 v33
          label values live2010 v36
          label values live2011 v39
          label values live2012 v42
          label values live2013 v45
          label values live2014 v48
          label values live2015 v51
          label values live2016 v54
          label values live2017 v57
          Now I am unsure about how to proceed, is it because of label? Should I drop the label? I tried to do
          Code:
          label drop _all
          but it altered my number in the end, and I can't seem to get the original back.

          And about William's suggestion, should I get back to the original dataset then? I checked my code and yes indeed I used encode previously because I tried destring but the result tells me that all my vars contain nonnumeric characters (although they didn't)
          Last edited by Nietha Heng; 21 Jun 2019, 00:53. Reason: Added question to William

          Comment


          • #6
            You should use decode to go back to the original.

            Code:
            ren live* live_old_*
            foreach year of numlist 1999/2017{
            capture decode live_old_`year', gen(live`year')
            }
            destring, replace
            Afterwards, check to see if any of your years did not properly decode.
            After you decode and destring, you can use your reshape code.
            There may be a few more steps because of specifics of your data, but this is the best guess based on the one line of observations.

            Edit:
            Of course you could also not do this, and instead go back to your original data and not encode it.

            And also:
            if there is a problem with destring, you'll need to fix your data at that point. Stata tends to have better eyes than most humans, so in most cases where Stata says there are non-numeric characters, it is really true. You can find out most easily by sorting data on one problematic variable. The first or last observations will likely contain some text. Maybe n/a or a ?. Could also be commas/dots not imported correctly, but you can ask about that when you have your data back to where it is not encoded.
            Last edited by Jorrit Gosens; 21 Jun 2019, 01:11.

            Comment


            • #7
              all my vars contain nonnumeric characters (although they didn't)
              We need to focus on this. My experience is that Stata is always right on this call, although my bias is increased by personal association with destring.

              Further, different encodings of different string variables are a recipe for disaster. Even your short data example shows that 33 sometimes comes from "53", sometimes from "58" and sometimes from "62". reshape can't possibly fix that kind of inconsistency, as in essence it just shuffles the data into a different layout.

              In short, you didn't suffer from reshape perversely messing up your data. They were already messed up by unsuitable encoding.

              As you fear, you need to go back to your original and work out why you have string variables in the first place.

              Code:
              destring
              declining to act can arise for all kinds of reasons, including

              * metadata, spreadsheet style, in the first few rows of an original file

              * trailing gunk, ditto, such as side or footnote text

              * string characters such as NA that look sensible to you but don't follow Stata's expectations

              and several more. There is an overarching survey within https://journals.sagepub.com/doi/abs...867X1801800413 to which you may have access. You will have access to https://www.statalist.org/forums/for...iable-problems

              The problemstring program posted there works like this. Here I just load the auto data and create a variable that would be fine to destring, except that there is one problematic value, which problemstring tells you about.


              Code:
              . sysuse auto, clear
              (1978 Automobile Data)
              
              . gen tocheck = cond(_n == 42, "NA", "666")
              
              . problemstring
              problematic values found: see help for destring
              
              tocheck
              
              
                  tocheck |      Freq.     Percent        Cum.
              ------------+-----------------------------------
                       NA |          1      100.00      100.00
              ------------+-----------------------------------
                    Total |          1      100.00

              NOTE: This took a long time to write, as I was doing something else in the middle. Jorrit Gosens made some of the same points in his post, which I have just seen.
              Last edited by Nick Cox; 21 Jun 2019, 02:01.

              Comment


              • #8
                Thanks guys! I seem to find the reason why destring didn't work in the first place, and why I used decode which created all the problems. So apparently somehow my values were separated by a space instead of comma for thousands so this is why STATA thinks that it contains the non numeric value. I easily went back to the original dataset and destring using this code:
                Code:
                forvalues i=99(1)17{
                 destring live`i', ignore(" ") replace
                }
                I successfully changed the data type, and the reshape has been working wonderfully. Thanks for the dataex suggestion which helped me find the culprit and encode/destring which helped me to fix the problem.

                You are right, Nick! STATA recognized my values as non-numeric because of the spaces!

                Comment

                Working...
                X