Identify change in alphabetical order within string variable

Valentina Farinelli

Join Date: Mar 2020
Posts: 5

Identify change in alphabetical order within string variable

24 Feb 2021, 11:32

Hi there! I am trying to transform a dataset, which is not in the most friendly format for analysis, into something usable for my research. The current dataset (which lists various reemployment projects, with funding received as well as counties and states affected) is something like this:

Project ID	States affected	Counties affected
1	AL,MS,TN	Walker,Tishomingo,Lawrence
2	KY,NC,WV	Floyd,Magoffin,Pike,Caldwell,Boone,Fayette,Jackson ,Kanawha,Lincoln,Logan,Preston,Roane
3	KY,TN	Clay,McCreary,Wayne,Whitley,Campbell,Fentress,Morg an,Scott

My aim is to instead obtain a dataset at county level, i.e., summing the funding received on all projects for each county.

The main difficulty with this task is that several of the observations, i.e. projects, affect multiple states, implying that the list of counties affected contains the names of all counties regardless of the state they belong to. The only way to (manually) identify the counties pertaining to each state is to identify a change in alphabetical order: as you can see for the #3 observation, e.g., counties are listed in alphabetical order all the way until Campbell, signalling that counties from Campbell onwards pertain to Tennessee (as opposed to Kentucky).

Therefore, my proposed strategy in order to separate the counties affected and at the same time assign them to the correct state is the following: I change the string variable "counties affected" so as to replace commas with periods, whenever a change in the alphabetical order followed occurs (this relies on the assumption that Stata can identify such a change); then I split the variable in columns on the basis of the periods, attributing the relative state to each, and so on and so forth until I expand the observations for each of the counties affected, in each state.

However, it is with the first step that I am unsure how to proceed, having never performed some sort of text analysis like this with Stata before. I would be very grateful if you had any suggestions or even just an opinion as to the feasibility of such a task.
(I am relatively comfortable with the last steps of the procedure - i.e. after the commas have been replaced by periods - but, if you feel that there might be more straightforward ways to perform the whole procedure, please let me know)

To clarify - with the first step I would want to achieve something like this (with respect to the previous table):

Project ID	States affected	Counties affected
1	AL,MS,TN	Walker.Tishomingo.Lawrence
2	KY,NC,WV	Floyd,Magoffin,Pike.Caldwell.Boone,Fayette,Jackson ,Kanawha,Lincoln,Logan,Preston,Roane
3	KY,TN	Clay,McCreary,Wayne,Whitley.Campbell,Fentress,Morg an,Scott

Thank you all for any input you might have!

Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35448

24 Feb 2021, 12:16

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte projectid str8 statesaffected str87 countiesaffected
1 "AL,MS,TN" "Walker,Tishomingo,Lawrence"                                                             
2 "KY,NC,WV" "Floyd,Magoffin,Pike,Caldwell,Boone,Fayette,Jackson ,Kanawha,Lincoln,Logan,Preston,Roane"
3 "KY,TN"    "Clay,McCreary,Wayne,Whitley,Campbell,Fentress,Morg an,Scott"                            
end

rename (*affected) (*)
gen n_counties = length(counties) - length(subinstr(counties, ",", "", .))
split counties, parse(,)
drop counties
reshape long counties, i(projectid) j(which)
drop if missing(counties)
replace counties = subinstr(counties, " ", "", .)
replace states = subinstr(states, ",", " ", .)
bysort projectid (which) : gen guess = 1 + sum(counties < counties[_n-1])
gen wanted = word(state, guess)

bysort projectid (guess) : gen bad = guess[_N] != wordcount(states)
list, sepby(projectid) 

     +----------------------------------------------------------------------------+
     | projec~d   which     states   n_coun~s     counties   guess   wanted   bad |
     |----------------------------------------------------------------------------|
  1. |        1       1   AL MS TN          2       Walker       1       AL     0 |
  2. |        1       2   AL MS TN          2   Tishomingo       2       MS     0 |
  3. |        1       3   AL MS TN          2     Lawrence       3       TN     0 |
     |----------------------------------------------------------------------------|
  4. |        2       1   KY NC WV         11        Floyd       1       KY     0 |
  5. |        2       2   KY NC WV         11     Magoffin       1       KY     0 |
  6. |        2       3   KY NC WV         11         Pike       1       KY     0 |
  7. |        2       4   KY NC WV         11     Caldwell       2       NC     0 |
  8. |        2       5   KY NC WV         11        Boone       3       WV     0 |
  9. |        2       6   KY NC WV         11      Fayette       3       WV     0 |
 10. |        2       7   KY NC WV         11      Jackson       3       WV     0 |
 11. |        2       8   KY NC WV         11      Kanawha       3       WV     0 |
 12. |        2       9   KY NC WV         11      Lincoln       3       WV     0 |
 13. |        2      10   KY NC WV         11        Logan       3       WV     0 |
 14. |        2      11   KY NC WV         11      Preston       3       WV     0 |
 15. |        2      12   KY NC WV         11        Roane       3       WV     0 |
     |----------------------------------------------------------------------------|
 16. |        3       1      KY TN          7         Clay       1       KY     0 |
 17. |        3       2      KY TN          7     McCreary       1       KY     0 |
 18. |        3       3      KY TN          7        Wayne       1       KY     0 |
 19. |        3       4      KY TN          7      Whitley       1       KY     0 |
 20. |        3       5      KY TN          7     Campbell       2       TN     0 |
 21. |        3       6      KY TN          7     Fentress       2       TN     0 |
 22. |        3       7      KY TN          7       Morgan       2       TN     0 |
 23. |        3       8      KY TN          7        Scott       2       TN     0 |
     +----------------------------------------------------------------------------+

The algorithm can easily fail: what about Autauga Bacon Campbell for AL GA TN?

Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10089

24 Feb 2021, 12:29

Another way

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte projectid str8 statesaffected str87 countiesaffected
1 "AL,MS,TN" "Walker.Tishomingo.Lawrence"                                                             
2 "KY,NC,WV" "Floyd,Magoffin,Pike.Caldwell.Boone,Fayette,Jackson ,Kanawha,Lincoln,Logan,Preston,Roane"
3 "KY,TN"    "Clay,McCreary,Wayne,Whitley.Campbell,Fentress,Morg an,Scott"                            
end

split statesaffected, p(,) g(state)
reshape long state, i(projectid)
drop _j statesaffected
replace countiesaffected= ustrregexra(trim(countiesaffected),"\s","",.)
replace countiesaffected= ustrregexra(trim(countiesaffected),"(\w+)([\.\,\s])","\$1,",.)
split countiesaffected , p(,) g(county)
reshape long county, i(projectid state) j(which)
drop if missing(county)
drop if missing(state)
bys projectid (state which):gen sid= sum(state!=state[_n-1])
bys projectid state (which):gen cid= sum(county<county[_n-1])+1
keep if sid==cid
drop which sid cid countiesaffected

Res.:

Code:

. l, sepby(state)

     +-------------------------------+
     | projec~d   state       county |
     |-------------------------------|
  1. |        1      AL       Walker |
     |-------------------------------|
  2. |        1      MS   Tishomingo |
     |-------------------------------|
  3. |        1      TN     Lawrence |
     |-------------------------------|
  4. |        2      KY        Floyd |
  5. |        2      KY     Magoffin |
  6. |        2      KY         Pike |
     |-------------------------------|
  7. |        2      NC     Caldwell |
     |-------------------------------|
  8. |        2      WV        Boone |
  9. |        2      WV      Fayette |
 10. |        2      WV      Jackson |
 11. |        2      WV      Kanawha |
 12. |        2      WV      Lincoln |
 13. |        2      WV        Logan |
 14. |        2      WV      Preston |
 15. |        2      WV        Roane |
     |-------------------------------|
 16. |        3      KY         Clay |
 17. |        3      KY     McCreary |
 18. |        3      KY        Wayne |
 19. |        3      KY      Whitley |
     |-------------------------------|
 20. |        3      TN     Campbell |
 21. |        3      TN     Fentress |
 22. |        3      TN       Morgan |
 23. |        3      TN        Scott |
     +-------------------------------+

.

Comment

Valentina Farinelli

Join Date: Mar 2020

Posts: 5
#4

25 Feb 2021, 15:47

Thank you both for your replies, they've been super helpful! Sorry for taking some time to get back - I've actually tried to play around a bit with both solutions first. Unfortunately it happened just as Nick Cox said: there's a vast majority of cases in which the state change can't be captured by an alphabetical order change (e.g. Autaga Bacon Campbell), so I'm still trying to work out some other solution/ compromise!
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10089
#5

25 Feb 2021, 17:37

Download the full list of counties from the Census Bureau, split the column labeled "Areaname" into state and county and merge on both variables with your list, keeping matches. This should be after the line

drop if missing(state)

in the code in #3.
1 like
Comment

Announcement

Identify change in alphabetical order within string variable

Comment

Comment

Comment

Comment