How to import a .txt file into Stata that has the first columns delimited by semicolon ";" and the last columns without any delimiter

Luana Goveia

Join Date: Dec 2015

Posts: 32
#1

How to import a .txt file into Stata that has the first columns delimited by semicolon ";" and the last columns without any delimiter

03 Apr 2016, 04:12

Hi guys,

I'm trying to import a .txt file into Stata. Please see how the .txt file look in the attachment.

My main struggle is that this file has the first columns delimited with ";" and the last columns without any delimiter. The first line is the name of the variables. The variables marked from A to G are all under "TX_Respostas". And that's this one I need to split up because they are actually separated variables.

So, I have tried:
insheet using TS_QUEST_ALUNO.TXT, delimiter(";")

But this did't work because the variables under "TX_Respostas" came as one single variable. I have also tried the command infix using a dictionary, but this didn't work because I couldn't get rid of the ";" and it came in the middle of the data, messing it up.

I also cannot save it to Excel and manually add demiliter there due to it very large size: more than 5 million observations.

So if you could help me on that...

Thanks.
Attached Files

Last edited by Luana Goveia; 03 Apr 2016, 04:14.
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35436
#2

03 Apr 2016, 05:04

Take 1 on your problem

There are many approaches. A good text editor would allow you to make delimiters consistent with a single command.

Here is an all-Stata approach that is crude but often effective. Read in the data as a single string variable, then work at it.

I created this minimal data file as minimal_example.txt

Code:

frog;toad;newt;dragon 42;24;12;13 3.14159 2.71828 1 0

Then consider these kinds of manipulation:

Code:

infix str data 1-244 using minimal_example.txt replace data = subinstr(data, ";", " ", .) local names = data[1] drop in 1 split data, destring drop data rename (data*) (`names')

Take 2 on your problem

On a second reading, the problem is not really one of importing data. You just want to split the last variable. But how do you want to split it? This is not clear to me. Is it that each character is to you a separate variable? What does it mean anyhow?

Last edited by Nick Cox; 03 Apr 2016, 05:12.
Comment
Luana Goveia

Join Date: Dec 2015

Posts: 32
#3

03 Apr 2016, 05:30

Yes, Nick Cox, I want to split the "TX_Respostas" variable. And yes, each character is a separate variable.

Last edited by Luana Goveia; 03 Apr 2016, 05:34.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35436
#4

03 Apr 2016, 05:41

So, that's not an import problem.

Here's a silly example of technique.

Code:

sysuse auto d make forval j = 1/18 { gen m`j' = substr(make, `j', 1) } edit m1-m18
1 like
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1361
#5

03 Apr 2016, 06:47

Depending on the system you are using, you may also be able to use some command line tools to preprocess things a bit:

Code:

sed 's/;/,/g' originalFile.txt > originalSansSemicolons.txt import delimited originalSansSemicolons.txt, delim(",")

*nix-based operating systems have some really nice tools to stream data and do some basic manipulations. I've had to use the same kind of technique to clean up fixed-width and delimited files that contained unbalanced quotation marks to make it less painful when trying to get the data into Stata. The other option is to define 1 character wide variables where each of the semicolons exist in your dictionary file then drop them once you've read the data into Stata.
1 like
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#6

03 Apr 2016, 09:07

Here's another way to split a variable, character by character, using moss (from SSC; with Nick Cox). To install it, type in Stata's Command window:

Code:

ssc install moss

In regular expression patterns, the dot matches any character. The parentheses indicate the subexpression to return. So the pattern "(.)" will return each character in the variable.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str43 TX_Respostas "BBECCBABA*AAADBDCAFABACABBAAAAACDABDAAABAAA" "AECCDAACCADCBDBFABASDFJ.ASDDDAAAAAADDADAAAA" "QWERTYUIOPASDFGHJKLZXCVBNM01234567890.,/?[]" end moss TX_Respostas, match("(.)") regex
1 like
Comment

Announcement

How to import a .txt file into Stata that has the first columns delimited by semicolon ";" and the last columns without any delimiter

Comment

Comment

Comment

Comment

Comment