Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to import a .txt file into Stata that has the first columns delimited by semicolon ";" and the last columns without any delimiter

    Hi guys,

    I'm trying to import a .txt file into Stata. Please see how the .txt file look in the attachment.

    My main struggle is that this file has the first columns delimited with ";" and the last columns without any delimiter. The first line is the name of the variables. The variables marked from A to G are all under "TX_Respostas". And that's this one I need to split up because they are actually separated variables.

    So, I have tried:
    insheet using TS_QUEST_ALUNO.TXT, delimiter(";")

    But this did't work because the variables under "TX_Respostas" came as one single variable. I have also tried the command infix using a dictionary, but this didn't work because I couldn't get rid of the ";" and it came in the middle of the data, messing it up.


    I also cannot save it to Excel and manually add demiliter there due to it very large size: more than 5 million observations.

    So if you could help me on that...

    Thanks.
    Attached Files
    Last edited by Luana Goveia; 03 Apr 2016, 04:14.

  • #2
    Take 1 on your problem

    There are many approaches. A good text editor would allow you to make delimiters consistent with a single command.

    Here is an all-Stata approach that is crude but often effective. Read in the data as a single string variable, then work at it.

    I created this minimal data file as minimal_example.txt

    Code:
    frog;toad;newt;dragon
    42;24;12;13
    3.14159 2.71828 1 0
    Then consider these kinds of manipulation:

    Code:
    infix str data 1-244 using minimal_example.txt
    replace data = subinstr(data, ";", " ", .)
    local names = data[1]
    drop in 1
    split data, destring
    drop data
    rename (data*) (`names')
    Take 2 on your problem

    On a second reading, the problem is not really one of importing data. You just want to split the last variable. But how do you want to split it? This is not clear to me. Is it that each character is to you a separate variable? What does it mean anyhow?


    Last edited by Nick Cox; 03 Apr 2016, 05:12.

    Comment


    • #3
      Yes, Nick Cox, I want to split the "TX_Respostas" variable. And yes, each character is a separate variable.
      Last edited by Luana Goveia; 03 Apr 2016, 05:34.

      Comment


      • #4
        So, that's not an import problem.

        Here's a silly example of technique.

        Code:
        sysuse auto
        d make
        forval j = 1/18 {
             gen m`j' = substr(make, `j', 1)
        }
        edit m1-m18

        Comment


        • #5
          Depending on the system you are using, you may also be able to use some command line tools to preprocess things a bit:

          Code:
          sed 's/;/,/g' originalFile.txt > originalSansSemicolons.txt
          import delimited originalSansSemicolons.txt, delim(",")
          *nix-based operating systems have some really nice tools to stream data and do some basic manipulations. I've had to use the same kind of technique to clean up fixed-width and delimited files that contained unbalanced quotation marks to make it less painful when trying to get the data into Stata. The other option is to define 1 character wide variables where each of the semicolons exist in your dictionary file then drop them once you've read the data into Stata.

          Comment


          • #6
            Here's another way to split a variable, character by character, using moss (from SSC; with Nick Cox). To install it, type in Stata's Command window:

            Code:
            ssc install moss
            In regular expression patterns, the dot matches any character. The parentheses indicate the subexpression to return. So the pattern "(.)" will return each character in the variable.

            Code:
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input str43 TX_Respostas
            "BBECCBABA*AAADBDCAFABACABBAAAAACDABDAAABAAA"
            "AECCDAACCADCBDBFABASDFJ.ASDDDAAAAAADDADAAAA"
            "QWERTYUIOPASDFGHJKLZXCVBNM01234567890.,/?[]"
            end
            
            moss TX_Respostas, match("(.)") regex

            Comment

            Working...
            X