Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Writing .dta into csv is messing up the formatting: how to correct this?

    Hi,

    I just cleaned a data in stata and need to write it into .csv, but the data in .csv becomes all weird in terms of formatting.

    This is the code I use to export:

    Code:
    outsheet  using "$data/data.csv", comma
    This worked better than
    Code:
    export delimited using "$data/data.csv", datafmt replace
    where the columns and the entries were all messed up and on top of one another.

    However, in the data I obtain from outsheet, I still need to adjust the column width to accommodate values. I cannot do that by opening the csv because (1) the data is too large to fully load in excel and if i save it, I'd be losing observations, and (2) I need to send this csv to some agency for them to perform some linkage and they won't accept the data unless all the columns are of appropriate width.

    Is there any way to adjust the column widths while exporting/outsheet -ing to csv from stata? Or any other solution to my problem?

    Thanks and I await your response!

  • #2
    This is hard to understand without a data extract that reproduces the problem. What do you mean "the columns and the entries were all messed up and on top of one another"? A csv file is a plain text file that can be opened in any text editor. Please see what it looks like in a simple text editor, and that should help you diagnose whether the problem is in the CSV file itself, or in something Excel is doing while opening it.

    Comment


    • #3
      Originally posted by Hemanshu Kumar View Post
      This is hard to understand without a data extract that reproduces the problem. What do you mean "the columns and the entries were all messed up and on top of one another"? A csv file is a plain text file that can be opened in any text editor. Please see what it looks like in a simple text editor, and that should help you diagnose whether the problem is in the CSV file itself, or in something Excel is doing while opening it.
      the problem remains even when I open in wordpad. This is what it looks like, and this is exactly how it opens in excel. The version I get from outsheet comes in expected format, only the column widths are a problem. Let me know if you require further information to answer this, thanks
      Click image for larger version

Name:	Screenshot 2023-07-17 190904.png
Views:	1
Size:	10.2 KB
ID:	1720850

      Comment


      • #4
        I'm not sure what the intended output is from the image you have above. But if your issue is that part of the address seems be bleeding onto the second line, I can think of two sorts of things for you to try:
        • include the option quote when you do export delimited -- assuming some of your addresses have commas in them which are perhaps being treated as delimiters because the address is not within quotes
        • check that your text editor isn't wrapping the text at the right edge of the window or at a specified column number (I don't have wordpad so I can't tell you where this option can be changed)

        Comment


        • #5
          I suspect that your address variable contains linebreaks which will not show up in a simple -list-ing or -browse-ing the data. If that's the case you have some options:

          1) replace linebreaks (carriage return and linefeed characters) with a single space if they aren't needed
          2) keep them, but wrap your values in quotes as suggested by Hemanshu
          3) use a different export format, such as JSON (though I suspect your need of a csv file is non-negotiable).

          Comment


          • #6
            Yeah, this looks like text wrapping. I would just add to #4 that if there is a newline character in a string (i.e. \n) or a carriage return character (i.e. \r), then parts of an address could be split up on a line. That said, I'm not sure why output would be immune to this kind of thing.

            Here is a possible work around:

            Code:
            clear
            global data = "YOUR/PATH/HERE"
            set obs 100
            gen x1 = runiform()
            gen x2 = runiform()
            gen x3 = runiform()
            
            local colwitdh = 30
            foreach var of varlist _all {
                capture confirm numeric variable `var'
                if !_rc {
                    tostring `var', force replace
                }
                replace `var' = ((`colwitdh' - strlen(`var')) * " ") + `var'
            }
            
            export delimited using "$data/data.csv", datafmt replace
            This converts each variable to a string, then manually pads each string with spaces. This is less than ideal though, and will still force you to manually format the first row of column names. I would recommend figuring out what is going wrong with your data instead.

            Edit: actually doesn't work with this line:
            Code:
            outsheet using "$data/data.csv", comma
            Last edited by Daniel Schaefer; 17 Jul 2023, 12:56.

            Comment


            • #7
              Originally posted by Hemanshu Kumar View Post
              I'm not sure what the intended output is from the image you have above. But if your issue is that part of the address seems be bleeding onto the second line, I can think of two sorts of things for you to try:
              • include the option quote when you do export delimited -- assuming some of your addresses have commas in them which are perhaps being treated as delimiters because the address is not within quotes
              • check that your text editor isn't wrapping the text at the right edge of the window or at a specified column number (I don't have wordpad so I can't tell you where this option can be changed)
              so currently the excel looks like this.
              one address bleeds into the other, there some gap in between two lines of same address. But then the entries are all on top of one another.
              Click image for larger version

Name:	Screenshot 2023-07-17 193248.png
Views:	1
Size:	15.2 KB
ID:	1720861


              I need my excel to look like following:
              Click image for larger version

Name:	Screenshot 2023-07-17 193934.png
Views:	1
Size:	5.7 KB
ID:	1720862
              which I obtain with the outsheet command, the only issue here is the column widths need to be adjusted to fit the text

              Comment


              • #8
                Originally posted by Daniel Schaefer View Post
                Yeah, this looks like text wrapping. I would just add to #4 that if there is a newline character in a string (i.e. \n) or a carriage return character (i.e. \r), then parts of an address could be split up on a line. That said, I'm not sure why output would be immune to this kind of thing.

                Here is a possible work around:

                Code:
                clear
                global data = "YOUR/PATH/HERE"
                set obs 100
                gen x1 = runiform()
                gen x2 = runiform()
                gen x3 = runiform()
                
                local colwitdh = 30
                foreach var of varlist _all {
                capture confirm numeric variable `var'
                if !_rc {
                tostring `var', force replace
                }
                replace `var' = ((`colwitdh' - strlen(`var')) * " ") + `var'
                }
                
                export delimited using "$data/data.csv", datafmt replace
                This converts each variable to a string, then manually pads each string with spaces. This is less than ideal though, and will still force you to manually format the first row of column names. I would recommend figuring out what is going wrong with your data instead.

                Edit: works with this line at the end too:

                Code:
                outsheet using "$data/data.csv", comma
                Forgot to replace it above.

                Also, note that if you open the file in excel, excel will remove the padding characters.
                thanks Daniel, I'll check this out

                Comment


                • #9
                  Correction, the work around in #6 does not work with outsheet. I apologize, I tested the process incorrectly. All of the white space characters were deleted. This is a clue as to why outsheet works when export delimited does not: it looks like outsheet deletes all whitespace and special characters. So it definitely looks like OP has whitespace return characters in their address strings.

                  Comment


                  • #10
                    yeah I just found that out as well. your process works with export delimited but I would still need to format the column width which is a problem. I can't edit the file in excel as I'll lose observations

                    I'm having hard time figuring out what to do

                    Comment


                    • #11
                      Seems like you need something like this:

                      Code:
                      replace ADDRESS  = subinstr(ADDRESS, char(15), "", .)
                      Which deletes the ascii carriage return character (this is what char(15) is), but it depends on what the offending character is and whether or not the character in question is ascii or unicode encoded. If it's unicode encoded, you need uchar() instead. I think the hard part is going to be tracking down the offending character. I've done this in python before knowing that windows uses \n\r for newlines, whereas linux and macos use \n only, but it's a little harder without knowing more details about where the offending character is coming from. See https://www.ascii-code.com/ for ascii character codes.

                      Comment


                      • #12
                        Code:
                        gen clean_addr = ustrregexra(address, "[\s]+", " ", .)
                        This will sanitize all instances of one or more whitespace characters to a single, plain space. This should solve your embedded linebreak issue.

                        Comment


                        • #13
                          For the more general problem of identifying "offending characters" in other contexts, Robert Picard's -chartab- is a handy tool, available from SSC. It will tell you all the characters that appear in a variable, or list of variables. You don't have to know anything about them in advance. And the results come out showing the names of the characters along with the appearance of the character, and its decimal and hexadecimal representations. So then you are equipped to use whatever string functions are appropriate to the particular problem they have caused. Non-printing characters are a particularly frustrating fact of life in computer work--you can't see them, but the computer can. They mess up all kinds of string operations, and to fix the problem you need to discover not only that they are there, but which ones they are. -chartab- makes the job much easier.

                          But in this case, assuming that the only non-printing characters here are the whitespace characters, the solution in #12 will do the trick without any further ado. (Pun intended.)

                          Comment

                          Working...
                          X