Create a dummy variable based on ranking of variables

Stephen Okiya

Join Date: Jul 2025
Posts: 280

Create a dummy variable based on ranking of variables

30 Mar 2024, 13:19

Dear Stata users,
I am using Stata 18 and would like to create a dummy variable if one of the three variables - remittances, govt_aid and hum_aid is among the top 3 of the variables empl - hum_aid.
The example dataset is below

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input double HHID str5 strata double(rand empl agric loans assets remittances govt_aid hum_aid) long unstable_inc
 1 "Urban" 23145 1000  750    0 500    0    0    0 0
 2 "Rural" 13755 1750  650  350   0  200    0    0 0
 3 "Rural" 17890  480  400  300   0 1500 1200  800 1
 4 "Urban" 25634  630  400    0   0 1250 1000 1200 1
 5 "Rural" 56231 1200    0    0 700    0    0    0 0
 6 "Rural" 96541 1000  750 1200   0  500    0    0 0
 7 "Urban" 23654 1700    0  500   0  700    0    . 0
 8 "Urban" 52361  500    0  250 500  900  900 1400 1
 9 "Rural" 74125  300  750  500 150 1200 1300 1500 1
10 "Rural" 96124    0    0    0   0    0    0    0 0
11 "Urban" 37851  400  300  700 500 1000  800  200 0
12 "Rural" 85321  900  900    0 750    0    0 1100 0
13 "Urban" 64123    . 1400  700 600  800  400 1200 1
14 "Urban" 65241  800  400  400 500 1100  900  400 1
15 "Urban" 10789  350  450  750   0    0  850  950 1
16 "Rural" 36789  850  850  500 600 1050  850    0 1
end

My desired variable is unstable_inc.
Thanks in advance!

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

30 Mar 2024, 14:04

I'm not sure I understand what you want. As I understand your problem description, in HHIDs 11 and 12, we should have unstable_inc == 1 because remittances and hum_aid are, respectively, the largest among the variable values. But you have unstable_inc == 0. I'm going to ignore your unstable_inc variable on the assumption that you made a mistake doing the calculations. The following code does what you described in words, although it disagrees with unstable_inc in HHIDs 11 and 12:

Code:

rename empl-hum_aid amt= reshape long amt, i(HHID) j(vble) string drop if missing(amt) bysort HHID (amt): gen byte top_three = (_N-_n < 3) by HHID: egen byte wanted = max(inlist(vble, "remittances", "govt_aid", "hum_aid") /// & top_three) drop top_three reshape wide rename amt* *

Now, there is another glitch remaining to resolve. Although it does not happen in the example data, it is possible that one of the three focal categories (remittances, govt_aid, hum_aid) is tied for third place with one or more of the non-focal categories, and the first and second place categories are both non-focal. Then depending on how you rank the tied values, that tied focal value might be called rank 3 or rank 4 or even higher, and the choice of ranking would change the value of unstable_inc. The above code will break such ties randomly and irreproducibly, so it is not ideal for the purpose unless you can guarantee that you will never encounter such ties. Safer would be to state a rule that covers this situation and implement that rule in the code.

Note: This is one of many situations in Stata where having the data in long layout makes a computation much easier to do.

Last edited by Clyde Schechter; 30 Mar 2024, 14:07.
Comment
Stephen Okiya

Join Date: Jul 2025

Posts: 280
#3

30 Mar 2024, 14:47

Thanks so much Clyde Schechter for your proposed solution and catching the error in my data example. You are right - values for unstable_inc are inaccurate in HHIDs 11 and 12.

About the ties - there are high chances there will be some in third cases and unstable_inc should take the value 1.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#4

30 Mar 2024, 15:59

So we can implement that with

Code:

rename empl-hum_aid amt= reshape long amt, i(HHID) j(vble) string drop if missing(amt) bysort HHID (amt): gen byte top_three = amt >= amt[_N-2] by HHID: egen byte wanted = max(inlist(vble, "remittances", "govt_aid", "hum_aid") /// & top_three) drop top_three reshape wide rename amt* *

Now, having implemented this, I realize that I misspoke when I said that there are no instances of such ties in your example data. There are, indeed, such instances. In particular HHIDs 5 and 10. In both of those cases, there are 2 or no non-zero sources of income, respectively, and so one of the focused sources, although it is zero, is now tied for third place.

I don't know how that works for your purposes. On the one hand there is something odd about counting something that is 0 as being among the top three. Yet, on the other hand, with 2 or no other sources of income, it seems reasonable to consider the household as having unstable income. Needless to say, this decision should not be based on my intuitions as this is not an area where I have anything more than layman's understanding. So you may want to think about whether you want to further modify this to exclude counting anything is top three when it is zero. If you decide you want to do that, the bold-faced command above need only be modified as follows:

Code:

bysort HHID (amt): gen byte top_three = (amt >= amt[_N-2]) & (amt > 0)
Comment
Stephen Okiya

Join Date: Jul 2025

Posts: 280
#5

30 Mar 2024, 23:22

Thanks a tone Clyde Schechter for not only providing a more appropriate approach and for the thought provoking possibilities in the data. I will explore the dataset to see what would be the best fit.
Comment

Announcement

Create a dummy variable based on ranking of variables

Comment

Comment

Comment

Comment