Good day!
I am trying to import a .csv file into Stata and attempt to use commas as variable delimiters, which generally succeeds in assigning the observations´ values into the appropriate variable columns. However one string variable contains a large amount of textual content, which sometimes includes commas in standard grammatical use (and always includes the text in quotation marks); naturally, importing the file will consider these as delimiters and thus the affected observations will fall into mismatched variables columns.
To provide a hypothetical example (with each row of values resembling a typical unit of observation from the dataset):
(Instances of correct matching into 5 variables):
2105, 1965, 1, 3, Solution to the drought crisis
2194, 1967, 1, 4, Budgetary reallocations to combat urban poverty
(Instance of incorrect matching into 6 or 7 variables instead of 5):
2561, 1973, 2, 4, "Proposal for new spending package, which was blocked by the treasury"
2671, 1981, 3, 5, "The PM proposed a new national action plan, which was applauded by the governing coalition, but rejected by the opposition."
Reading the Stata documentation I have unfortunately not found an appropriate way to solve this problem beyond handcoding the commas into alternative signs and deleting the quotation marks, which however is not feasible as a pathway for the entire dataset due to its size. Might it for example be possible to code this variable in such a way that different delimiter rules would be applied from those of the other variables in question?
I would be very grateful if someone in this forum could advise me how I should approach this problem and try to remedy it. Please excuse me for asking such a - as I assume - relatively straightforward question, but I am truly uncertain how to solve this problem.
I am trying to import a .csv file into Stata and attempt to use commas as variable delimiters, which generally succeeds in assigning the observations´ values into the appropriate variable columns. However one string variable contains a large amount of textual content, which sometimes includes commas in standard grammatical use (and always includes the text in quotation marks); naturally, importing the file will consider these as delimiters and thus the affected observations will fall into mismatched variables columns.
To provide a hypothetical example (with each row of values resembling a typical unit of observation from the dataset):
(Instances of correct matching into 5 variables):
2105, 1965, 1, 3, Solution to the drought crisis
2194, 1967, 1, 4, Budgetary reallocations to combat urban poverty
(Instance of incorrect matching into 6 or 7 variables instead of 5):
2561, 1973, 2, 4, "Proposal for new spending package, which was blocked by the treasury"
2671, 1981, 3, 5, "The PM proposed a new national action plan, which was applauded by the governing coalition, but rejected by the opposition."
Reading the Stata documentation I have unfortunately not found an appropriate way to solve this problem beyond handcoding the commas into alternative signs and deleting the quotation marks, which however is not feasible as a pathway for the entire dataset due to its size. Might it for example be possible to code this variable in such a way that different delimiter rules would be applied from those of the other variables in question?
I would be very grateful if someone in this forum could advise me how I should approach this problem and try to remedy it. Please excuse me for asking such a - as I assume - relatively straightforward question, but I am truly uncertain how to solve this problem.
Comment