Hi there! I am trying to transform a dataset, which is not in the most friendly format for analysis, into something usable for my research. The current dataset (which lists various reemployment projects, with funding received as well as counties and states affected) is something like this:
My aim is to instead obtain a dataset at county level, i.e., summing the funding received on all projects for each county.
The main difficulty with this task is that several of the observations, i.e. projects, affect multiple states, implying that the list of counties affected contains the names of all counties regardless of the state they belong to. The only way to (manually) identify the counties pertaining to each state is to identify a change in alphabetical order: as you can see for the #3 observation, e.g., counties are listed in alphabetical order all the way until Campbell, signalling that counties from Campbell onwards pertain to Tennessee (as opposed to Kentucky).
Therefore, my proposed strategy in order to separate the counties affected and at the same time assign them to the correct state is the following: I change the string variable "counties affected" so as to replace commas with periods, whenever a change in the alphabetical order followed occurs (this relies on the assumption that Stata can identify such a change); then I split the variable in columns on the basis of the periods, attributing the relative state to each, and so on and so forth until I expand the observations for each of the counties affected, in each state.
However, it is with the first step that I am unsure how to proceed, having never performed some sort of text analysis like this with Stata before. I would be very grateful if you had any suggestions or even just an opinion as to the feasibility of such a task.
(I am relatively comfortable with the last steps of the procedure - i.e. after the commas have been replaced by periods - but, if you feel that there might be more straightforward ways to perform the whole procedure, please let me know)
To clarify - with the first step I would want to achieve something like this (with respect to the previous table):
Thank you all for any input you might have!
Project ID | States affected | Counties affected |
1 | AL,MS,TN | Walker,Tishomingo,Lawrence |
2 | KY,NC,WV | Floyd,Magoffin,Pike,Caldwell,Boone,Fayette,Jackson ,Kanawha,Lincoln,Logan,Preston,Roane |
3 | KY,TN | Clay,McCreary,Wayne,Whitley,Campbell,Fentress,Morg an,Scott |
My aim is to instead obtain a dataset at county level, i.e., summing the funding received on all projects for each county.
The main difficulty with this task is that several of the observations, i.e. projects, affect multiple states, implying that the list of counties affected contains the names of all counties regardless of the state they belong to. The only way to (manually) identify the counties pertaining to each state is to identify a change in alphabetical order: as you can see for the #3 observation, e.g., counties are listed in alphabetical order all the way until Campbell, signalling that counties from Campbell onwards pertain to Tennessee (as opposed to Kentucky).
Therefore, my proposed strategy in order to separate the counties affected and at the same time assign them to the correct state is the following: I change the string variable "counties affected" so as to replace commas with periods, whenever a change in the alphabetical order followed occurs (this relies on the assumption that Stata can identify such a change); then I split the variable in columns on the basis of the periods, attributing the relative state to each, and so on and so forth until I expand the observations for each of the counties affected, in each state.
However, it is with the first step that I am unsure how to proceed, having never performed some sort of text analysis like this with Stata before. I would be very grateful if you had any suggestions or even just an opinion as to the feasibility of such a task.
(I am relatively comfortable with the last steps of the procedure - i.e. after the commas have been replaced by periods - but, if you feel that there might be more straightforward ways to perform the whole procedure, please let me know)
To clarify - with the first step I would want to achieve something like this (with respect to the previous table):
Project ID | States affected | Counties affected |
1 | AL,MS,TN | Walker.Tishomingo.Lawrence |
2 | KY,NC,WV | Floyd,Magoffin,Pike.Caldwell.Boone,Fayette,Jackson ,Kanawha,Lincoln,Logan,Preston,Roane |
3 | KY,TN | Clay,McCreary,Wayne,Whitley.Campbell,Fentress,Morg an,Scott |
Thank you all for any input you might have!

Comment