Writing .dta into csv is messing up the formatting: how to correct this?

Titir Bhattacharya

Join Date: Mar 2019

Posts: 226
#1

Writing .dta into csv is messing up the formatting: how to correct this?

17 Jul 2023, 11:45

Hi,

I just cleaned a data in stata and need to write it into .csv, but the data in .csv becomes all weird in terms of formatting.

This is the code I use to export:

Code:

outsheet using "$data/data.csv", comma

This worked better than

Code:

export delimited using "$data/data.csv", datafmt replace

where the columns and the entries were all messed up and on top of one another.

However, in the data I obtain from outsheet, I still need to adjust the column width to accommodate values. I cannot do that by opening the csv because (1) the data is too large to fully load in excel and if i save it, I'd be losing observations, and (2) I need to send this csv to some agency for them to perform some linkage and they won't accept the data unless all the columns are of appropriate width.

Is there any way to adjust the column widths while exporting/outsheet -ing to csv from stata? Or any other solution to my problem?

Thanks and I await your response!
Tags: None
Hemanshu Kumar

Join Date: Mar 2015

Posts: 1400
#2

17 Jul 2023, 12:01

This is hard to understand without a data extract that reproduces the problem. What do you mean "the columns and the entries were all messed up and on top of one another"? A csv file is a plain text file that can be opened in any text editor. Please see what it looks like in a simple text editor, and that should help you diagnose whether the problem is in the CSV file itself, or in something Excel is doing while opening it.
1 like
Comment
Titir Bhattacharya

Join Date: Mar 2019

Posts: 226
#3

17 Jul 2023, 12:11

Originally posted by Hemanshu Kumar View Post

This is hard to understand without a data extract that reproduces the problem. What do you mean "the columns and the entries were all messed up and on top of one another"? A csv file is a plain text file that can be opened in any text editor. Please see what it looks like in a simple text editor, and that should help you diagnose whether the problem is in the CSV file itself, or in something Excel is doing while opening it.

the problem remains even when I open in wordpad. This is what it looks like, and this is exactly how it opens in excel. The version I get from outsheet comes in expected format, only the column widths are a problem. Let me know if you require further information to answer this, thanks
Comment
Hemanshu Kumar

Join Date: Mar 2015

Posts: 1400
#4

17 Jul 2023, 12:25

I'm not sure what the intended output is from the image you have above. But if your issue is that part of the address seems be bleeding onto the second line, I can think of two sorts of things for you to try:
include the option quote when you do export delimited -- assuming some of your addresses have commas in them which are perhaps being treated as delimiters because the address is not within quotes

check that your text editor isn't wrapping the text at the right edge of the window or at a specified column number (I don't have wordpad so I can't tell you where this option can be changed)
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2402
#5

17 Jul 2023, 12:34

I suspect that your address variable contains linebreaks which will not show up in a simple -list-ing or -browse-ing the data. If that's the case you have some options:

1) replace linebreaks (carriage return and linefeed characters) with a single space if they aren't needed
2) keep them, but wrap your values in quotes as suggested by Hemanshu
3) use a different export format, such as JSON (though I suspect your need of a csv file is non-negotiable).
Comment
Daniel Schaefer

Join Date: Mar 2020

Posts: 814
#6

17 Jul 2023, 12:38

Yeah, this looks like text wrapping. I would just add to #4 that if there is a newline character in a string (i.e. \n) or a carriage return character (i.e. \r), then parts of an address could be split up on a line. That said, I'm not sure why output would be immune to this kind of thing.

Here is a possible work around:

Code:

clear global data = "YOUR/PATH/HERE" set obs 100 gen x1 = runiform() gen x2 = runiform() gen x3 = runiform() local colwitdh = 30 foreach var of varlist _all { capture confirm numeric variable `var' if !_rc { tostring `var', force replace } replace `var' = ((`colwitdh' - strlen(`var')) * " ") + `var' } export delimited using "$data/data.csv", datafmt replace

This converts each variable to a string, then manually pads each string with spaces. This is less than ideal though, and will still force you to manually format the first row of column names. I would recommend figuring out what is going wrong with your data instead.

Edit: actually doesn't work with this line:

Code:

outsheet using "$data/data.csv", comma

Last edited by Daniel Schaefer; 17 Jul 2023, 12:56.
Comment
Titir Bhattacharya

Join Date: Mar 2019

Posts: 226
#7

17 Jul 2023, 12:41

Originally posted by Hemanshu Kumar View Post

I'm not sure what the intended output is from the image you have above. But if your issue is that part of the address seems be bleeding onto the second line, I can think of two sorts of things for you to try:
include the option quote when you do export delimited -- assuming some of your addresses have commas in them which are perhaps being treated as delimiters because the address is not within quotes

check that your text editor isn't wrapping the text at the right edge of the window or at a specified column number (I don't have wordpad so I can't tell you where this option can be changed)

so currently the excel looks like this.
one address bleeds into the other, there some gap in between two lines of same address. But then the entries are all on top of one another.

I need my excel to look like following:
which I obtain with the outsheet command, the only issue here is the column widths need to be adjusted to fit the text
Comment
Titir Bhattacharya

Join Date: Mar 2019

Posts: 226
#8

17 Jul 2023, 12:52

Originally posted by Daniel Schaefer View Post

Yeah, this looks like text wrapping. I would just add to #4 that if there is a newline character in a string (i.e. \n) or a carriage return character (i.e. \r), then parts of an address could be split up on a line. That said, I'm not sure why output would be immune to this kind of thing.

Here is a possible work around:

Code:

clear global data = "YOUR/PATH/HERE" set obs 100 gen x1 = runiform() gen x2 = runiform() gen x3 = runiform() local colwitdh = 30 foreach var of varlist _all { capture confirm numeric variable `var' if !_rc { tostring `var', force replace } replace `var' = ((`colwitdh' - strlen(`var')) * " ") + `var' } export delimited using "$data/data.csv", datafmt replace

This converts each variable to a string, then manually pads each string with spaces. This is less than ideal though, and will still force you to manually format the first row of column names. I would recommend figuring out what is going wrong with your data instead.

Edit: works with this line at the end too:

Code:

outsheet using "$data/data.csv", comma

Forgot to replace it above.

Also, note that if you open the file in excel, excel will remove the padding characters.

thanks Daniel, I'll check this out
Comment
Daniel Schaefer

Join Date: Mar 2020

Posts: 814
#9

17 Jul 2023, 12:58

Correction, the work around in #6 does not work with outsheet. I apologize, I tested the process incorrectly. All of the white space characters were deleted. This is a clue as to why outsheet works when export delimited does not: it looks like outsheet deletes all whitespace and special characters. So it definitely looks like OP has whitespace return characters in their address strings.
Comment
Titir Bhattacharya

Join Date: Mar 2019

Posts: 226
#10

17 Jul 2023, 13:13

yeah I just found that out as well. your process works with export delimited but I would still need to format the column width which is a problem. I can't edit the file in excel as I'll lose observations

I'm having hard time figuring out what to do
Comment
Daniel Schaefer

Join Date: Mar 2020

Posts: 814
#11

17 Jul 2023, 13:22

Seems like you need something like this:

Code:

replace ADDRESS = subinstr(ADDRESS, char(15), "", .)

Which deletes the ascii carriage return character (this is what char(15) is), but it depends on what the offending character is and whether or not the character in question is ascii or unicode encoded. If it's unicode encoded, you need uchar() instead. I think the hard part is going to be tracking down the offending character. I've done this in python before knowing that windows uses \n\r for newlines, whereas linux and macos use \n only, but it's a little harder without knowing more details about where the offending character is coming from. See https://www.ascii-code.com/ for ascii character codes.
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2402
#12

17 Jul 2023, 13:36

Code:

gen clean_addr = ustrregexra(address, "[\s]+", " ", .)

This will sanitize all instances of one or more whitespace characters to a single, plain space. This should solve your embedded linebreak issue.
3 likes
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#13

17 Jul 2023, 15:22

For the more general problem of identifying "offending characters" in other contexts, Robert Picard's -chartab- is a handy tool, available from SSC. It will tell you all the characters that appear in a variable, or list of variables. You don't have to know anything about them in advance. And the results come out showing the names of the characters along with the appearance of the character, and its decimal and hexadecimal representations. So then you are equipped to use whatever string functions are appropriate to the particular problem they have caused. Non-printing characters are a particularly frustrating fact of life in computer work--you can't see them, but the computer can. They mess up all kinds of string operations, and to fix the problem you need to discover not only that they are there, but which ones they are. -chartab- makes the job much easier.

But in this case, assuming that the only non-printing characters here are the whitespace characters, the solution in #12 will do the trick without any further ado. (Pun intended.)
4 likes
Comment

Announcement

Writing .dta into csv is messing up the formatting: how to correct this?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment