Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Importing Delimited File - How To Selectively Apply Delimiters?

    Good day!

    I am trying to import a .csv file into Stata and attempt to use commas as variable delimiters, which generally succeeds in assigning the observations´ values into the appropriate variable columns. However one string variable contains a large amount of textual content, which sometimes includes commas in standard grammatical use (and always includes the text in quotation marks); naturally, importing the file will consider these as delimiters and thus the affected observations will fall into mismatched variables columns.

    To provide a hypothetical example (with each row of values resembling a typical unit of observation from the dataset):

    (Instances of correct matching into 5 variables):
    2105, 1965, 1, 3, Solution to the drought crisis
    2194, 1967, 1, 4, Budgetary reallocations to combat urban poverty

    (Instance of incorrect matching into 6 or 7 variables instead of 5):
    2561, 1973, 2, 4, "Proposal for new spending package, which was blocked by the treasury"
    2671, 1981, 3, 5, "The PM proposed a new national action plan, which was applauded by the governing coalition, but rejected by the opposition."

    Reading the Stata documentation I have unfortunately not found an appropriate way to solve this problem beyond handcoding the commas into alternative signs and deleting the quotation marks, which however is not feasible as a pathway for the entire dataset due to its size. Might it for example be possible to code this variable in such a way that different delimiter rules would be applied from those of the other variables in question?

    I would be very grateful if someone in this forum could advise me how I should approach this problem and try to remedy it. Please excuse me for asking such a - as I assume - relatively straightforward question, but I am truly uncertain how to solve this problem.
    Last edited by Sebastian Gabel; 12 Feb 2024, 04:32.

  • #2
    Hi Sebastian,
    I'm sure others will respond with more direct experience of such issues. But to get things started, I would suggest at least one of:
    • Use of an alternative delimiter for the entire dataset, such as tab or semicolon
    • Importing these free-text columns separately, thus avoiding the need for a delimiter.
    In particular, my second suggestion sort-of addresses the thread title -- you would effectively be applying a delimiter selectively. It depends how easy it is to separate out the free-text columns from the source data prior to import, however.

    Finally, I would say that, whatever your method of data import, you should consider the possibility of non-standard characters hiding within your free-text columns, which might lead to import or formatting issues.



    Comment


    • #3
      Are you using -insheet- or -import delimited-? What is the exact command? Is it possible there is a data line with unmatched quotes? Stata should be able to handle commas in quoted strings. See, for example, section 2.6 of https://www.ietf.org/rfc/rfc4180.txt

      Comment


      • #4
        Dear David and Daniel,

        thank you very much for your responses and suggestions for dealing with my situation!

        In the attempt to separately import data from the discussed dataset, the problem eventually disappeared upon reinstalling and reimporting the dataset. As you, David, suggested, there must have been an issue with unexpected characters emerging in the formatting process that occurred to as of yet unknown reasons.

        In any case, the matter has been solved. Thanks again!

        Comment

        Working...
        X