Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • find the second maximum value in row

    Dear Stata Users,

    Using -rowmax()- of egen function, we can find the maximum value in varlist for each observation (row). And now I want to find the second maximum value besides the first in each row. Is there any way to solve the problem? Commands and data example are as follows. Thank you.

    Code:
    egen firstmax=rowmax(h1a h1b h1c h1d h1e h1f h1n h1o h1p h1q)
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long(h1a h1b h1c h1d h1e h1f) int h1g long h1h int h1i long h1j int h1k long h1l int h1m long(h1n h1o h1p h1q)
     40000  2000 0      0 20000 0 0 2000 0  3000    0  300     0  7000 0     0     0
         0  3000 0   1000     0 0 0    0 0     0 2220  325     0     0 0     0     0
         0 50000 0      0     0 0 0    0 0     0    0  900     0     0 0     0 20000
     30000 15000 0      0     0 0 0    0 0  2500    0    0     0     0 0     0     0
         0     0 0      0     0 0 0    0 0     0    0    0     0     0 0 50000     0
         0 50000 0  36500  2000 0 0    0 0  3420    0    0   960     0 0     0     0
      1500 30000 0      0     0 0 0    0 0     0    0 1600     0  1000 0     0     0
     50000     0 0      0     0 0 0    0 0 30000 1200    0  3600     0 0     0     0
     60000 30000 0      0     0 0 0    0 0 10000    0  630 10000 10000 0     0  5500
     70000     0 0      0   500 0 0 1500 0     0    0    0     0     0 0     0     0
     20000  4000 0      0     0 0 0 5000 0  2400    0  500     0  1000 0     0     0
      6000  5500 0      0     0 0 0    0 0  2600    0  375     0   500 0     0     0
         0     0 0      0     0 0 0    0 0     0    0    0     0     0 0  5000     0
     90000     0 0      0     0 0 0 4500 0     0    0    0     0     0 0     0     0
     40000  1800 0  60000     0 0 0    0 0     0    0  375     0     0 0     0     0
    100000     0 0      0     0 0 0 5000 0     0    0  500     0     0 0     0     0
     50000 22000 0  30000     0 0 0    0 0  1800    0  600     0   200 0  2000     0
         0   500 0      0     0 0 0    0 0   500    0  500     0     0 0  1000     0
     30000 30000 0 300000  3000 0 0    0 0     0    0    0     0  1500 0     0     0
     20000     0 0      0     0 0 0  500 0   200    0    0     0     0 0     0     0
    end

  • #2
    So, there are two obstacles to doing this. One is intrinsic to the problem, the other is having the data in wide layout. The intrinsic difficulty is that you have a series of h1* variables, but some of them play no role in the calculation of firstmax and secondmax, so that has to be accounted for somehow in the code.

    This works:

    Code:
    // GO TO LONG LAYOUT
    gen long obs_no = _n
    reshape long h1, i(obs_no) j(suffix) string
    
    // INDICATE WHICH OBSERVATIONS (ORIGINALLY VARIABLES)
    // ARE INCLUDED IN THE CALCULATION
    gen byte include = !inlist(suffix, "g", "h", "i", "j", "k", "l", "m")
    
    // FIND THE FIST AND SECOND MAX
    by obs_no include (h1), sort: gen firstmax = h1[_N]
    by obs_no include (h1): gen secondmax = h1[_N-1]
    
    // CLOBBER FIRST AND SECOND MAXES THAT ARISE
    // FROM EXCLUDED VARIABLES
    by obs_no (include): replace firstmax = firstmax[_N]
    by obs_no (include): replace secondmax = secondmax[_N]
    
    //  CLEAN UP
    drop include
    
    //    AND IF THERE IS SOME REAL REASON TO GO BACK TO WIDE LAYOUT
    //    YOU CAN NOW DO:
    reshape wide
    I don't know what your plans for this data are, but if you have additional analysis in mind, you are most likely better off skiping the final -reshape wide- command and leaving the data in long layout. There are only a few things in Stata that can be done easily (or even at all) in wide layout; Stata is mostly optimized for working with long data. So I recommend you keep things long unless you have specific analyses that work well with wide. If, in the end, you need to return to wide layout to create a layout that is easily read by human eyes, you can do that after all the analyses are done.

    Comment


    • #3
      Dear Clyde Schechter , thank you very much. The data I showed above is extracted from a household income survey. Variables from h1a to h1q indicate separate income sources. I want to know each household's first and second main income sources (have the largest and second largest values). Before I open this topic, I have tried to drop the first main income sources and re-generate the secondmax variable using the -rowmax()- function, see the code below. However, I don't want to drop any variables (actually I replace it with missing values) before taking the next step. And your answer demonstrate again that maybe I have to destroy something before stepping forward. Thank you again.

      Code:
      egen incfirst=rowmax(h1a-h1q)  /*first main income sources*/
      
      foreach v of varlist h1a-h1r {
       replace `v'=. if `v'==incfirst
       }
       
      egen incsecond=rowmax(h1a-h1q) /*second main income sources*/

      Comment


      • #4
        And your answer demonstrate again that maybe I have to destroy something before stepping forward.
        I don't understand this. The code that I wrote in #2 destroys nothing at all. Every bit of information in the original data is preserved intact throughout.

        By the way, the code you show in #3, in addition to destroying some of the data, is also incorrect. It will not produce the right results if the first and second largest sources of income are tied, because it will replace with missing all values that are equal to the largest source and then find the next largest remaining. But if the first and second largest are tied, then that next largest remaining source will be the third, or fourth, or conceivably even lower ranked.
        Last edited by Clyde Schechter; 12 Sep 2018, 00:26.

        Comment


        • #5
          See https://www.stata-journal.com/sjpdf....iclenum=pr0046 for a review of working rowwise. The command rowsort helps here:


          Code:
          . rowsort h1*, gen(H1-H17) descending 
          
          . list H1 H2 
          
               +----------------+
               |     H1      H2 |
               |----------------|
            1. |  40000   20000 |
            2. |   3000    2220 |
            3. |  50000   20000 |
            4. |  30000   15000 |
            5. |  50000       0 |
               |----------------|
            6. |  50000   36500 |
            7. |  30000    1600 |
            8. |  50000   30000 |
            9. |  60000   30000 |
           10. |  70000    1500 |
               |----------------|
           11. |  20000    5000 |
           12. |   6000    5500 |
           13. |   5000       0 |
           14. |  90000    4500 |
           15. |  60000   40000 |
               |----------------|
           16. | 100000    5000 |
           17. |  50000   30000 |
           18. |   1000     500 |
           19. | 300000   30000 |
           20. |  20000     500 |
               +----------------+
          As the maximum occurs in different variables (and is not guaranteed to occur just once) dropping the variable with the maximum is not going to work.

          In strategic terms I agree with Clyde. Long layout is generally preferable in Stata.

          Comment


          • #6
            Dear Mr. Clyde Schechter, I am sorry, the word 'destory' was misused, I actually mean some 'dirty work' (maybe also misused) such as dropping certain variable or changing data format.

            Nick Cox, thank you for introducing -rowsort-, it works very well. However, there's something wrong with the option -descending-, it should be wrote in brief as -descend-. (installed through SSC, and version information *! NJC 1.2.0 22 November 2005)

            And just as what you've said in rowsort remarks:
            rowsort loops over observations and may be relatively slow. It may be faster to reshape, sort within blocks, and reshape again.
            there's a consensus on how to find the first and second maximum value between you and Schechter. And my solution in #3 ignores the tied data.
            Last edited by Chen Samulsion; 12 Sep 2018, 01:56.

            Comment


            • #7
              My code in #5 was an exact copy of code that works. As flagged by the link, I am using rowsort from the Stata Journal which starts

              Code:
              *! NJC 2.0.0 30 January 2009 
              * NJC 1.2.0 22 November 2005 
              * NJC 1.1.0 21 November 2000 
              program rowsort  
                  version 9 
                  syntax varlist [if] [in], Generate(str) [ Descending HIGHmissing ]
              There is no reason to use the older rowsort from SSC unless you're using Stata 7 or 8.

              Comment

              Working...
              X