Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • select two countries by region based on third value

    Hi,

    I have panel data similar to example census data in sysuse:

    Code:
    sysuse census, clear
    and I would like to select first two in each region with highest divorce in string variable:

    Code:
    gsort region -divorce
    so for example in region "NE" would be "NY" and "PA", and record it into variable score, that would have ideally first two entries "NY","PA", then blanks then below for next region "OH","IL" etc.

    the only thing that crosses my mind is something like this:

    Code:
    sysuse census.dta, clear
    
    levelsof region, local(region)
    gen score = ""
    
    foreach l of local region {
    sum divorce if region == `l'
    replace score = state2 if divorce == r(max) & region==`l'
    
    sum divorce if region==`l' & score==""
    replace score = state2 if divorce == r(max) & region==`l'
    
    }
    
    gsort region -divorce

    But in my data i have also third level of years in the panel and this loop thing becomes complicated, is there any simpler way to do the same?

    Thank you

  • #2
    You can do this more directly without any loops at all (except as always what Stata does on your behalf).

    0. Segregate any missing values as otherwise they will mess up what follows if present.

    1. Sort on the variable of interest within groups: the two highest values will be the last two, provided as above that missing values have been set on one side.


    Code:
    sysuse census, clear
    
    gen isOK = !missing(divorce)
    
    bysort isOK region (divorce) : gen flag = isOK & _n >= _N - 1
    
    gen which = state if flag
    
    list region state divorce which if flag, sepby(region)
    
         +-------------------------------------------------+
         | region    state          divorce          which |
         |-------------------------------------------------|
      8. | NE        Pennsylvania    34,922   Pennsylvania |
      9. | NE        New York        61,972       New York |
         |-------------------------------------------------|
     20. | N Cntrl   Illinois        50,997       Illinois |
     21. | N Cntrl   Ohio            58,809           Ohio |
         |-------------------------------------------------|
     36. | South     Florida         71,579        Florida |
     37. | South     Texas           96,809          Texas |
         |-------------------------------------------------|
     49. | West      Washington      28,642     Washington |
     50. | West      California     133,541     California |
         +-------------------------------------------------+
    Now comes the bonus. Your needed code is going to be almost the same: you just need something like


    Code:
    gen isOK = !missing(whatever)
    
    bysort isOK variableA variableB (whatever) : gen flag = isOK & _n >= _N - 1
    -- so one more variable in the first part of the by: command.
    Last edited by Nick Cox; 06 Feb 2024, 05:54.

    Comment


    • #3
      Thanks a lot, i guessed there is some trick with _n _N but could not figure it out..

      Many thanks!

      Comment

      Working...
      X