Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Importing Excel file with odd variable/label setup

    Howdy all,

    I'm struggling to correctly import an xlsx file into stata. The file has variable labels in the first row of the file and variable names in the second row.

    If I just use the "firstrow" option when importing, it uses the xlsx labels as stata variable names and uses the xlsx variable names as the first instance of the data.

    If I instead use "firstrow cellrange(A2)", the xlsx variable names are correctly used as stata variable names, but now the xlsx labels are lost in the ether.

    How may I correctly import the file such that I can correctly use the second row as variable names without losing the variable labels in the first row? I would just switch the first and second rows manually, but there are dozens of files that I need to do this with and I've been asked to not touch the data outside of stata. Thanks all for your time


  • #2
    This example may point you in a useful direction.

    Click image for larger version

Name:	example.png
Views:	1
Size:	274.6 KB
ID:	1643263
    Code:
    // first read and save the variable labels
    import excel using example.xlsx, clear cellrange(a1:c1)
    describe
    list
    local i 0
    foreach v of varlist * {
        local lab`++i' = `v'[1]
    }
    // next read the data without the labels
    import excel using example.xlsx, clear cellrange(a2) firstrow
    describe
    // add the labels to the variables
    local i 0
    foreach v of varlist * {
        label variable `v' `"`lab`++i''"'
    }
    describe
    Code:
    . describe
    
    Contains data
     Observations:             2                  
        Variables:             3                  
    ------------------------------------------------------------
    Variable      Storage   Display    Value
        name         type    format    label      Variable label
    ------------------------------------------------------------
    id              str1    %9s                   a label
    x               byte    %10.0g                another label
    y               byte    %10.0g                third label
    ------------------------------------------------------------

    Comment


    • #3
      The simplest way I can think of to do this is:

      Code:
      clear*
      
      import excel using "Odd Variable Label Setup.xlsx", firstrow cellrange(A2)
      
      frame create labels
      frame change labels
      import excel using "Odd Variable Label Setup.xlsx"
      keep in 1
      local labels
      local i = 1
      foreach v of varlist _all {
          local label`i' = `v'[1]
          local ++i
      }
      frame change default
      local i = 1
      foreach v of varlist _all {
          label var `v' `"`label`i''"'
          local ++i
      }
      Added: Crossed with #2 which offers a variant of the same solution. His is simpler in that it does not use frames.
      Last edited by Clyde Schechter; 01 Jan 2022, 13:27.

      Comment


      • #4
        It appears this was answered while I was working on a solution. Here's a variation that uses frames.

        Code:
        sysuse auto, clear
        
        local myxl "mydata.xlsx"
        local myxlsheet "Sheet"
        
        /* create example file */
        export excel `"`myxl'"' in 1/1, sheet("`myxlsheet'") cell(A1) firstrow(varlab) replace
        export excel `"`myxl'"', sheet("`myxlsheet'", modify) cell(A2) firstrow(variable)
        drop _all
        
        /* import data in 2 steps */
        frame create Mylabs
        frame change Mylabs
        import excel `"`myxl'"', sheet("`myxlsheet'") cellrange(A1:L2) clear
        
        frame create Mydata
        frame change Mydata
        import excel `"`myxl'"', sheet("`myxlsheet'") cellrange(A3) clear
        
        qui ds
        local xlvars `r(varlist)'
        foreach v of varlist `xlvars' {
          frame Mylabs {
            local nextname "`=`v'[2]'"
            local nextlab "`=`v'[1]'"
          }
          label var `v' `"`nextlab'"'
          rename `v' `nextname'
        }
        frame drop Mylabs

        Comment


        • #5
          Importing twice may be slow if you have many workbooks and/or many variables - here is a solution which only imports once (using the fake data provided in #4):

          Code:
          import excel `"`myxl'"', sheet("`myxlsheet'") clear
          foreach var of varlist *{
              label variable `var' "`=`var'[1]'"
              rename `var' `=`var'[2]'
          }
          drop in 1/2
          destring *,replace
          Last edited by Ali Atia; 01 Jan 2022, 15:18.

          Comment


          • #6
            The code in #5 will solve the problem as stated in #1 taken literally. But compared to the others, it has a drawback. Given that the first row of the spreadsheet contains variable labels, a single import will cause everything to imported as a string variable. If everything in the data set really is string data, that's fine. But most spreadsheets have some numeric variables, even mostly numeric variables. So the solution in #5 would have to be followed up by additional code to properly convert them to numeric. (Just destringing them all will not do: as this will not deal with date variables.) Moreover, O.P. states that there are multiple spreadsheets that this code needs to apply to. If they all have the same complement of numeric and datetime variables, then writing that code one isn't that big a deal. But if they differ, then each one will require custom code to do that. The other solutions all allow Stata to recognize and handle string and date variables from Excel automatically.
            Last edited by Clyde Schechter; 01 Jan 2022, 15:24.

            Comment


            • #7
              Note that post #2 imports just one row of the worksheet, which seems unlikely to add significantly to the time spent importing the data that comprise the bulk of the worksheet.

              With that said, the first import excel command in post #2 should read
              Code:
              import excel using example.xlsx, clear cellrange(a1:c1) allstring
              so that if a label is entirely numeric (e.g. 0042) it will be preserved verbatim.

              Comment


              • #8
                William Lisowski makes an excellent point. The same problem arises with my code in #3, the difference being that in this case it is the second -import excel- command that needs the -allstring- option.

                Comment


                • #9
                  For my variation in #4, since I import variable labels and names in one pass, the -allstring- option wouldn't realistically be needed, though it couldn't hurt to include.

                  For all variations offered, if you wanted to be sure that legal Stata variable names are used, then they should all pass the name through -strtoname()- at a minimum. It would also not be too hard to make this into a program (or a do-file that accepts arguments) to make easier OP's need to repeat the steps.

                  Comment


                  • #10
                    Thanks so much, all! I look forward to trying this out today

                    Comment

                    Working...
                    X