Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Importing long text data from Excel

    I have downloaded data from a survey where the first row of the Excel file (csv file to be specific) contains the variable names, and the second row is the raw question text. Some of those survey questions are very long (think multiple paragraphs of text) and Stata is having a hard time reading the resulting file--the variable names translate fine, but then the first 30 or so rows turn into broken up bits of text from survey questions entered as values in some columns, which also means everything gets uploaded as strings.

    I've tried a few things (like changing the delimiter, the row range, etc.) to no avail. Of course, an easy manual solution is to just delete the problem row directly in Excel before importing into Stata (in which case it uploads fine), but is there any other advanced (and more replicable) workaround while importing that could be useful?

  • #2
    Try reading the .csv file in with -import delimited- specifying the -rowrange(3)- and -varnames(1)- options so that Stata starts reading data just from the third row of the file. Assuming it can correctly find where that third row begins, you will then get the data and the variable names, but the problematic second row will be skipped. Now, I suppose that if the text in the second row is sufficiently pathological, Stata may not be able to correctly identify where the third row starts and then this method won't work. But it's worth a try.

    It's hard to know what else, if anything, you can do if that fails. But I might try saving the file from Excel as an actual Excel spreadsheet, i.e. .xlsx. Then import it into Stata with -import excel-. It may be that Excel will be able to parse that second line of the file appropriately, and -import excel- will recognize the structure that the .xlsx file imposes on it, whereas there is just so much that -import delimited- can do with wild text. Of course, this step also lacks self-documentation, so it may not really be any better for your purpose than just deleting that second row from the text file.

    Comment


    • #3
      Thanks, Clyde. It does seem like saving it as an xlsx file improves things, and even though it can't be documented I feel a little better about doing that than about going in and deleting rows from the csv.

      When using -import excel-, is there a way to simultaneously specify -firstrow- so that the first row becomes the variable names, and then skipping the 2nd and 3rd rows (which contain the long survey text)? When I try "cellrange(A3) firstrow" it seems to use the 3rd row as variable names.

      Comment


      • #4
        When using -import excel-, is there a way to simultaneously specify -firstrow- so that the first row becomes the variable names, and then skipping the 2nd and 3rd rows (which contain the long survey text)? When I try "cellrange(A3) firstrow" it seems to use the 3rd row as variable names.
        I don't know any way to do this.

        What about just -import excel ..., ...firstrow...- followed by -drop in 1- to get rid of that mangled second row? Or does -import excel- also produce unsuitable results when you omit the -cellrange(A3) option? If that's the case, I might try something like this:

        Code:
        import excel using my_spreadsheet.xlsx, firstrow clear
        unab vble_names: _all
        import excel using my_spreadsheet.xlsx, cellrange(A3) clear
        unab all_vars: _all
        rename (`all_vars') (`vble_names')
        That way you automate the naming of the variable names. If the spreadsheet is large, it may be slow reading it in twice like this, but you only have to do this once, and then save the result as a Stata data set, and work from that thereafter.

        Comment


        • #5
          If I omit -cellrange(A3) the mangled row is imported fairly cleanly--the main issue becomes that the mangled row of text leads to all variables being imported as strings, so I have to then convert all relevant variables back to numeric. Which is not at all the end of the world! Though I believe that your proposed solution to automate the variable names actually solves this, because the second import (with cellrange(A3)) should omit the text-only rows in a way that enables proper importing of the numeric variables. I will try that and hope for the best! Thanks for the help.

          Comment

          Working...
          X