Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem with import delim, values usually separated by commas, sometimes by quotations as well

    Apologies if this has been asked before; I couldn't find a similar situation:

    I'm developing a do file to take CSV's from our information management system, merge them into one large database, and create reports based on the descriptive statistics. I've run into a problem importing one of the databases to Stata, and it seems this has to do with how the system-generated CSV's handle commas within an observation.

    The code is just your standard import delim:

    import delim "$repdir\registros.csv"

    The database is the initial registry when someone requests our organization's services for the first time, and includes a short questionnaire with several open fields. This is how some observations are imported (database is in Spanish):

    clave_registro
    14731
    14732
    PLANTAS EXÓTICAS Y CAPACITACIÓN DE ORQUÍDEAS",Tengo una idea de negocio que me gustaría desarrollar,,0,0,0,0,0,,,,,Ninguna de las anteriores,,No,,,,,,,,,,0,,,,,Si,No,Sin IDE,,Activo
    14737
    14740

    Using Excel's text import function to look at how the database is structured, it looks like when the system generates the CSV, it usually separates each observation with only a comma, but when the observation itself contains a comma, it adds quotation marks around the observation. The problem, it seems, is that Stata only recognizes the comma as the delimiter, as the quotations are not present for all observations, and in ignoring the quotations fails to separate the observations correctly.

    I've tried using the delimiters("chars") option with several different specifications, but with no luck.

    I use Stata/MP 13.1, 64-bit version. Also, I've never had this issue with Excel 2013.

  • #2
    Welcome to Statalist.

    Since you don't mention them, have you looked into various settings of the stripquotes and bindquotes options to the import delimited command? You should understand that Excel is not the only program that produces "csv" files, and there is no consistent standard among programs for how they are produced.

    You would probably be better advised to have your information management system produce, not comma-separated-value files, but rather choose another character as the delimiter between values, so this problem won't arise. Using the tab character is generally useful, and another popular choice is the "pipe" character ("|"), since neither of these commonly appear as part of a text data element.

    Beyond that, please review the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post, looking especially at sections 9-12 on how to best pose your question. It would be particularly helpful to post a small hand-made example, with just a few observations, showing the data before importing and how you expect it to look after importing. In particular, please read FAQ #12 and use CODE delimiters when posting to Statalist.
    Last edited by William Lisowski; 16 Dec 2016, 07:18.

    Comment


    • #3
      I will say that csv files with quoted string variables that have embedded commas are very difficult to work with, in my experience. I have found that the -stripquotes- and -bindquotes()- options do not always resolve the problem. My workaround when I cannot get -import delimited- to correctly read such files is to use Stat Transfer instead, which seems to handle them without difficulty.

      Comment


      • #4
        Dear Mr Schechter,

        Did you find a solution to the problem mentioned above in the meantime?

        I experience similar problems when importing a CSV file that contains data in quotation marks.

        Comment


        • #5
          You could also try -insheet- which sometimes does better than -import delimited-. However, I wonder if your file is usable at all. The rows don't have the same number of items and the "PLANTAS" cell doesn't have a leading quote to go with the trailing quote. Every line must have the same number of items. CSV files are widely denounced as non-standard, but as long as quoted strings do not contain quotes or newlines, there shouldn't be difficulties. Stata does have peculiarity (if space is the delimiter) that a leading space on a line is interpreted as a cell with a missing value.

          Comment

          Working...
          X