Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Create a dummy variable based on ranking of variables

    Dear Stata users,
    I am using Stata 18 and would like to create a dummy variable if one of the three variables - remittances, govt_aid and hum_aid is among the top 3 of the variables empl - hum_aid.
    The example dataset is below
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input double HHID str5 strata double(rand empl agric loans assets remittances govt_aid hum_aid) long unstable_inc
     1 "Urban" 23145 1000  750    0 500    0    0    0 0
     2 "Rural" 13755 1750  650  350   0  200    0    0 0
     3 "Rural" 17890  480  400  300   0 1500 1200  800 1
     4 "Urban" 25634  630  400    0   0 1250 1000 1200 1
     5 "Rural" 56231 1200    0    0 700    0    0    0 0
     6 "Rural" 96541 1000  750 1200   0  500    0    0 0
     7 "Urban" 23654 1700    0  500   0  700    0    . 0
     8 "Urban" 52361  500    0  250 500  900  900 1400 1
     9 "Rural" 74125  300  750  500 150 1200 1300 1500 1
    10 "Rural" 96124    0    0    0   0    0    0    0 0
    11 "Urban" 37851  400  300  700 500 1000  800  200 0
    12 "Rural" 85321  900  900    0 750    0    0 1100 0
    13 "Urban" 64123    . 1400  700 600  800  400 1200 1
    14 "Urban" 65241  800  400  400 500 1100  900  400 1
    15 "Urban" 10789  350  450  750   0    0  850  950 1
    16 "Rural" 36789  850  850  500 600 1050  850    0 1
    end
    My desired variable is unstable_inc.
    Thanks in advance!

  • #2
    I'm not sure I understand what you want. As I understand your problem description, in HHIDs 11 and 12, we should have unstable_inc == 1 because remittances and hum_aid are, respectively, the largest among the variable values. But you have unstable_inc == 0. I'm going to ignore your unstable_inc variable on the assumption that you made a mistake doing the calculations. The following code does what you described in words, although it disagrees with unstable_inc in HHIDs 11 and 12:
    Code:
    rename empl-hum_aid amt=
    reshape long amt, i(HHID) j(vble) string
    drop if missing(amt)
    bysort HHID (amt): gen byte top_three = (_N-_n < 3)
    by HHID: egen byte wanted = max(inlist(vble, "remittances", "govt_aid", "hum_aid") ///
        & top_three)
    drop top_three
    reshape wide
    rename amt* *
    Now, there is another glitch remaining to resolve. Although it does not happen in the example data, it is possible that one of the three focal categories (remittances, govt_aid, hum_aid) is tied for third place with one or more of the non-focal categories, and the first and second place categories are both non-focal. Then depending on how you rank the tied values, that tied focal value might be called rank 3 or rank 4 or even higher, and the choice of ranking would change the value of unstable_inc. The above code will break such ties randomly and irreproducibly, so it is not ideal for the purpose unless you can guarantee that you will never encounter such ties. Safer would be to state a rule that covers this situation and implement that rule in the code.

    Note: This is one of many situations in Stata where having the data in long layout makes a computation much easier to do.
    Last edited by Clyde Schechter; 30 Mar 2024, 15:07.

    Comment


    • #3
      Thanks so much Clyde Schechter for your proposed solution and catching the error in my data example. You are right - values for unstable_inc are inaccurate in HHIDs 11 and 12.

      About the ties - there are high chances there will be some in third cases and unstable_inc should take the value 1.

      Comment


      • #4
        So we can implement that with
        Code:
        rename empl-hum_aid amt=
        reshape long amt, i(HHID) j(vble) string
        drop if missing(amt)
        bysort HHID (amt): gen byte top_three = amt >= amt[_N-2]
        by HHID: egen byte wanted = max(inlist(vble, "remittances", "govt_aid", "hum_aid") ///
            & top_three)
        drop top_three
        reshape wide
        rename amt* *
        Now, having implemented this, I realize that I misspoke when I said that there are no instances of such ties in your example data. There are, indeed, such instances. In particular HHIDs 5 and 10. In both of those cases, there are 2 or no non-zero sources of income, respectively, and so one of the focused sources, although it is zero, is now tied for third place.

        I don't know how that works for your purposes. On the one hand there is something odd about counting something that is 0 as being among the top three. Yet, on the other hand, with 2 or no other sources of income, it seems reasonable to consider the household as having unstable income. Needless to say, this decision should not be based on my intuitions as this is not an area where I have anything more than layman's understanding. So you may want to think about whether you want to further modify this to exclude counting anything is top three when it is zero. If you decide you want to do that, the bold-faced command above need only be modified as follows:
        Code:
        bysort HHID (amt): gen byte top_three = (amt >= amt[_N-2]) & (amt > 0) 

        Comment


        • #5
          Thanks a tone Clyde Schechter for not only providing a more appropriate approach and for the thought provoking possibilities in the data. I will explore the dataset to see what would be the best fit.

          Comment

          Working...
          X