Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Excel to STATA string variable problems

    Problem: I have an Excel datasheet with numeric variables that only show numeric values in Excel. When I import into STATA, paste into the data editor, or convert the Excel file-->SAS-->STATA, I still wind up with string variables. I have attempted to use the encode oldvar, gen(newvar) function as well as gen newvar=real(oldvar) function. Some of the numeric values appear in the new variable, but most of them are erased. I'm assuming something is wrong in my Excel datafile, but don't know what to do. I have tried reformatting the data values in Excel and everything appears to be numeric, but I'm still getting a STATA dataset full of string variables for numbers that I can't analyze.

  • #2
    Weird. Have you tied exporting from Excel as comma-separated values? Seems like you've tried some reasonable approaches, but exporting to .csv will ensure any strange formatting goes away and lets you see the raw data in its most raw form. Oh, and tab-delimited is a good approach as well.
    Last edited by ben earnhart; 14 Apr 2015, 18:06.

    Comment


    • #3
      MIke,

      I would suggest that you try to identify non-numeric characters, you can do it using the code below (which was taken from original discussion here):
      Code:
      destring number, generate (newno) force /*generate a new variable*/
      list number if newno>=. /*will show which have nonnumeric*/
      In order to know what's the problem, we would have to have a look at your spreadsheet but it's fairly common as Excel formats may carry characters that will force Stata to treat the imported variables as strings (empty spaces, dots and commas, etc.) There is a lot of info on the net how to prepare your data for import (example).
      Kind regards,
      Konrad
      Version: Stata/IC 13.1

      Comment


      • #4
        The commands (not functions) encode oldvar, gen(newvar)and gen newvar=real(oldvar) are likely to be quite wrong here.

        encode in particular is likely to produce pure garbage. Thus "1", "2", "3", "11", "111", ... would get sorted to "1", "11", "111", "2", "3", ... and then encoded as 1, 2, 3, 4, 5, ... and the numeric values would be taken literally thereafter. That's not what you want.

        The problem is that something in your variables (numeric as you think) is preventing Stata agreeing that they are numeric. It could be something small but systematic such as hyphens "-" for missing. I suggest reading the help for destring and learning to look at the results of

        Code:
        tab oldvar if missing(real(oldvar))
        to see what Stata is rebelling against. charlist (SSC) is also there for problematic cases.
        Last edited by Nick Cox; 15 Apr 2015, 04:31.

        Comment


        • #5
          Here's another possibility I have seen before: If there happens to be a "space" (blank) or other non-displayed character in one of the cells in some row following the last row of your data in Excel, and if you are importing data in such a way that Stata has to figure out for itself which row is the last row in Excel, Stata will believe that that row is actual data, and it will presume the column with that cell as a string variable.

          Regards, Mike

          Comment


          • #6
            Here's a token wrapper for my main suggestion.

            Code:
             
            *! 1.0.0 NJC 15 April 2015 
            program problemstring 
                version 8.2 
                syntax [varlist] [, MAXimum(numlist int >0) ] 
            
                foreach v of local varlist { 
                    capture confirm str variable `v' 
                    if _rc == 0 { 
                        local newvarlist `newvarlist' `v' 
                    } 
                } 
            
                local varlist `newvarlist' 
            
                if "`varlist'" == "" { 
                    di as err "no string variables specified" 
                    exit 100
                } 
            
                if "`maximum'" == "" local maximum = _N  
            
                local flag 0 
                foreach v of local varlist { 
            
                    qui count if missing(real(`v')) 
            
                    if r(N) > 0 & r(N) < `maximum' { 
                        if `flag' == 0 { 
                            di "problematic values found: " /// 
                            "see help for {help destring}" _n 
                        }
                        else di _n 
            
                        di "{title:`v'}" _n 
                        tab `v' if missing(real(`v')) 
                        local flag = 1 
                    } 
                }  
            end
            The idea is that you run problemstring on variables that should be numeric, but are string, and it will show you the values that can't be converted to numeric with information loss. By default it ignores variables whose values are all such, on the presumption that those variables are likely to be genuine string variables.

            Comment

            Working...
            X