select two countries by region based on third value

Neko Novi

Join Date: Mar 2017

Posts: 22
#1

select two countries by region based on third value

06 Feb 2024, 04:14

Hi,

I have panel data similar to example census data in sysuse:

Code:

sysuse census, clear

and I would like to select first two in each region with highest divorce in string variable:

Code:

gsort region -divorce

so for example in region "NE" would be "NY" and "PA", and record it into variable score, that would have ideally first two entries "NY","PA", then blanks then below for next region "OH","IL" etc.

the only thing that crosses my mind is something like this:

Code:

sysuse census.dta, clear levelsof region, local(region) gen score = "" foreach l of local region { sum divorce if region == `l' replace score = state2 if divorce == r(max) & region==`l' sum divorce if region==`l' & score=="" replace score = state2 if divorce == r(max) & region==`l' } gsort region -divorce

But in my data i have also third level of years in the panel and this loop thing becomes complicated, is there any simpler way to do the same?

Thank you
Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35698

06 Feb 2024, 04:40

You can do this more directly without any loops at all (except as always what Stata does on your behalf).

0. Segregate any missing values as otherwise they will mess up what follows if present.

1. Sort on the variable of interest within groups: the two highest values will be the last two, provided as above that missing values have been set on one side.

Code:

sysuse census, clear

gen isOK = !missing(divorce)

bysort isOK region (divorce) : gen flag = isOK & _n >= _N - 1

gen which = state if flag

list region state divorce which if flag, sepby(region)

     +-------------------------------------------------+
     | region    state          divorce          which |
     |-------------------------------------------------|
  8. | NE        Pennsylvania    34,922   Pennsylvania |
  9. | NE        New York        61,972       New York |
     |-------------------------------------------------|
 20. | N Cntrl   Illinois        50,997       Illinois |
 21. | N Cntrl   Ohio            58,809           Ohio |
     |-------------------------------------------------|
 36. | South     Florida         71,579        Florida |
 37. | South     Texas           96,809          Texas |
     |-------------------------------------------------|
 49. | West      Washington      28,642     Washington |
 50. | West      California     133,541     California |
     +-------------------------------------------------+

Now comes the bonus. Your needed code is going to be almost the same: you just need something like

Code:

gen isOK = !missing(whatever)

bysort isOK variableA variableB (whatever) : gen flag = isOK & _n >= _N - 1

-- so one more variable in the first part of the by: command.

Last edited by Nick Cox; 06 Feb 2024, 04:54.

Comment

Neko Novi

Join Date: Mar 2017

Posts: 22
#3

06 Feb 2024, 04:49

Thanks a lot, i guessed there is some trick with _n _N but could not figure it out..

Many thanks!
Comment

Announcement