Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Identify change in alphabetical order within string variable

    Hi there! I am trying to transform a dataset, which is not in the most friendly format for analysis, into something usable for my research. The current dataset (which lists various reemployment projects, with funding received as well as counties and states affected) is something like this:

    Project ID States affected Counties affected
    1 AL,MS,TN Walker,Tishomingo,Lawrence
    2 KY,NC,WV Floyd,Magoffin,Pike,Caldwell,Boone,Fayette,Jackson ,Kanawha,Lincoln,Logan,Preston,Roane
    3 KY,TN Clay,McCreary,Wayne,Whitley,Campbell,Fentress,Morg an,Scott

    My aim is to instead obtain a dataset at county level, i.e., summing the funding received on all projects for each county.

    The main difficulty with this task is that several of the observations, i.e. projects, affect multiple states, implying that the list of counties affected contains the names of all counties regardless of the state they belong to. The only way to (manually) identify the counties pertaining to each state is to identify a change in alphabetical order: as you can see for the #3 observation, e.g., counties are listed in alphabetical order all the way until Campbell, signalling that counties from Campbell onwards pertain to Tennessee (as opposed to Kentucky).

    Therefore, my proposed strategy in order to separate the counties affected and at the same time assign them to the correct state is the following: I change the string variable "counties affected" so as to replace commas with periods, whenever a change in the alphabetical order followed occurs (this relies on the assumption that Stata can identify such a change); then I split the variable in columns on the basis of the periods, attributing the relative state to each, and so on and so forth until I expand the observations for each of the counties affected, in each state.

    However, it is with the first step that I am unsure how to proceed, having never performed some sort of text analysis like this with Stata before. I would be very grateful if you had any suggestions or even just an opinion as to the feasibility of such a task.
    (I am relatively comfortable with the last steps of the procedure - i.e. after the commas have been replaced by periods - but, if you feel that there might be more straightforward ways to perform the whole procedure, please let me know)



    To clarify - with the first step I would want to achieve something like this (with respect to the previous table):

    Project ID States affected Counties affected
    1 AL,MS,TN Walker.Tishomingo.Lawrence
    2 KY,NC,WV Floyd,Magoffin,Pike.Caldwell.Boone,Fayette,Jackson ,Kanawha,Lincoln,Logan,Preston,Roane
    3 KY,TN Clay,McCreary,Wayne,Whitley.Campbell,Fentress,Morg an,Scott



    Thank you all for any input you might have!

  • #2
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte projectid str8 statesaffected str87 countiesaffected
    1 "AL,MS,TN" "Walker,Tishomingo,Lawrence"                                                             
    2 "KY,NC,WV" "Floyd,Magoffin,Pike,Caldwell,Boone,Fayette,Jackson ,Kanawha,Lincoln,Logan,Preston,Roane"
    3 "KY,TN"    "Clay,McCreary,Wayne,Whitley,Campbell,Fentress,Morg an,Scott"                            
    end
    
    rename (*affected) (*)
    gen n_counties = length(counties) - length(subinstr(counties, ",", "", .))
    split counties, parse(,)
    drop counties
    reshape long counties, i(projectid) j(which)
    drop if missing(counties)
    replace counties = subinstr(counties, " ", "", .)
    replace states = subinstr(states, ",", " ", .)
    bysort projectid (which) : gen guess = 1 + sum(counties < counties[_n-1])
    gen wanted = word(state, guess)
    
    bysort projectid (guess) : gen bad = guess[_N] != wordcount(states)
    list, sepby(projectid) 
    
         +----------------------------------------------------------------------------+
         | projec~d   which     states   n_coun~s     counties   guess   wanted   bad |
         |----------------------------------------------------------------------------|
      1. |        1       1   AL MS TN          2       Walker       1       AL     0 |
      2. |        1       2   AL MS TN          2   Tishomingo       2       MS     0 |
      3. |        1       3   AL MS TN          2     Lawrence       3       TN     0 |
         |----------------------------------------------------------------------------|
      4. |        2       1   KY NC WV         11        Floyd       1       KY     0 |
      5. |        2       2   KY NC WV         11     Magoffin       1       KY     0 |
      6. |        2       3   KY NC WV         11         Pike       1       KY     0 |
      7. |        2       4   KY NC WV         11     Caldwell       2       NC     0 |
      8. |        2       5   KY NC WV         11        Boone       3       WV     0 |
      9. |        2       6   KY NC WV         11      Fayette       3       WV     0 |
     10. |        2       7   KY NC WV         11      Jackson       3       WV     0 |
     11. |        2       8   KY NC WV         11      Kanawha       3       WV     0 |
     12. |        2       9   KY NC WV         11      Lincoln       3       WV     0 |
     13. |        2      10   KY NC WV         11        Logan       3       WV     0 |
     14. |        2      11   KY NC WV         11      Preston       3       WV     0 |
     15. |        2      12   KY NC WV         11        Roane       3       WV     0 |
         |----------------------------------------------------------------------------|
     16. |        3       1      KY TN          7         Clay       1       KY     0 |
     17. |        3       2      KY TN          7     McCreary       1       KY     0 |
     18. |        3       3      KY TN          7        Wayne       1       KY     0 |
     19. |        3       4      KY TN          7      Whitley       1       KY     0 |
     20. |        3       5      KY TN          7     Campbell       2       TN     0 |
     21. |        3       6      KY TN          7     Fentress       2       TN     0 |
     22. |        3       7      KY TN          7       Morgan       2       TN     0 |
     23. |        3       8      KY TN          7        Scott       2       TN     0 |
         +----------------------------------------------------------------------------+
    The algorithm can easily fail: what about Autauga Bacon Campbell for AL GA TN?

    Comment


    • #3
      Another way

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input byte projectid str8 statesaffected str87 countiesaffected
      1 "AL,MS,TN" "Walker.Tishomingo.Lawrence"                                                             
      2 "KY,NC,WV" "Floyd,Magoffin,Pike.Caldwell.Boone,Fayette,Jackson ,Kanawha,Lincoln,Logan,Preston,Roane"
      3 "KY,TN"    "Clay,McCreary,Wayne,Whitley.Campbell,Fentress,Morg an,Scott"                            
      end
      
      split statesaffected, p(,) g(state)
      reshape long state, i(projectid)
      drop _j statesaffected
      replace countiesaffected= ustrregexra(trim(countiesaffected),"\s","",.)
      replace countiesaffected= ustrregexra(trim(countiesaffected),"(\w+)([\.\,\s])","\$1,",.)
      split countiesaffected , p(,) g(county)
      reshape long county, i(projectid state) j(which)
      drop if missing(county)
      drop if missing(state)
      bys projectid (state which):gen sid= sum(state!=state[_n-1])
      bys projectid state (which):gen cid= sum(county<county[_n-1])+1
      keep if sid==cid
      drop which sid cid countiesaffected
      Res.:

      Code:
      . l, sepby(state)
      
           +-------------------------------+
           | projec~d   state       county |
           |-------------------------------|
        1. |        1      AL       Walker |
           |-------------------------------|
        2. |        1      MS   Tishomingo |
           |-------------------------------|
        3. |        1      TN     Lawrence |
           |-------------------------------|
        4. |        2      KY        Floyd |
        5. |        2      KY     Magoffin |
        6. |        2      KY         Pike |
           |-------------------------------|
        7. |        2      NC     Caldwell |
           |-------------------------------|
        8. |        2      WV        Boone |
        9. |        2      WV      Fayette |
       10. |        2      WV      Jackson |
       11. |        2      WV      Kanawha |
       12. |        2      WV      Lincoln |
       13. |        2      WV        Logan |
       14. |        2      WV      Preston |
       15. |        2      WV        Roane |
           |-------------------------------|
       16. |        3      KY         Clay |
       17. |        3      KY     McCreary |
       18. |        3      KY        Wayne |
       19. |        3      KY      Whitley |
           |-------------------------------|
       20. |        3      TN     Campbell |
       21. |        3      TN     Fentress |
       22. |        3      TN       Morgan |
       23. |        3      TN        Scott |
           +-------------------------------+
      
      .

      Comment


      • #4
        Thank you both for your replies, they've been super helpful! Sorry for taking some time to get back - I've actually tried to play around a bit with both solutions first. Unfortunately it happened just as Nick Cox said: there's a vast majority of cases in which the state change can't be captured by an alphabetical order change (e.g. Autaga Bacon Campbell), so I'm still trying to work out some other solution/ compromise!

        Comment


        • #5
          Download the full list of counties from the Census Bureau, split the column labeled "Areaname" into state and county and merge on both variables with your list, keeping matches. This should be after the line

          drop if missing(state)
          in the code in #3.

          Comment

          Working...
          X