Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • return first non-zero value

    Dear All, Suppose that I have this dataset
    Code:
    clear
    input id x
    1 .
    1 0
    1 3
    1 4
    2 0
    2 . 
    2 0
    2 3
    end
    For each `id', I'd like to return the first non-zero value. Any suggesitons? Thanks.
    Ho-Chuan (River) Huang
    Stata 19.0, MP(4)

  • #2
    In the data you show, if I take your request literally, the first non-zero value for both id's is a missing value. I'm going to assume that you really mean first non-zero and non-missing value.

    Code:
    gen byte not_these = missing(x) | x == 0
    gen long obs_no = _n
    
    by id (not_these obs_no), sort: gen wanted = x[1]
    If my assumption is wrong and missing value is acceptable as a non-zero value for your purposes, then just change the first command to eliminate -missing(x) | -.

    Comment


    • #3
      Not quite clear what you mean by "return": would collapse to a single observation per id be what you want, or do you want to add a new variable to your existing dataset?

      In the latter case, this seems to do what you want.
      Code:
      generate seq = _n
      generate fnz = .
      bysort id (seq): replace fnz = cond(!missing(fnz[_n-1]),fnz[_n-1],cond(x!=0,x,.))
      bysort id (seq): replace fnz = fnz[_N]
      drop seq
      list, sepby(id) noobs
      Code:
      . list, sepby(id) noobs
      
        +--------------+
        | id   x   fnz |
        |--------------|
        |  1   .     3 |
        |  1   0     3 |
        |  1   3     3 |
        |  1   4     3 |
        |--------------|
        |  2   0     3 |
        |  2   .     3 |
        |  2   0     3 |
        |  2   3     3 |
        +--------------+

      Comment


      • #4
        Code:
        bys id: gen wanted = total(x*(sum(x/x)==1))

        Comment


        • #5
          Here's yet another take. The idea of "first" is empty without some idea of an order within groups. Here the criterion is taken to be at least 1 but not missing. Substitute your own criterion if different.


          Code:
          clear
          input id x
          1 .
          1 0
          1 3
          1 4
          2 0
          2 . 
          2 0
          2 3
          end
          
          gen long seq = _n 
          
          bysort id (seq): egen when = min(cond(inrange(x, 1, .), seq, .)) 
          gen wanted = x[when]
          
          list , sepby(id) 
          
               +------------------------------+
               | id   x   seq   when   wanted |
               |------------------------------|
            1. |  1   .     1      3        3 |
            2. |  1   0     2      3        3 |
            3. |  1   3     3      3        3 |
            4. |  1   4     4      3        3 |
               |------------------------------|
            5. |  2   0     5      8        3 |
            6. |  2   .     6      8        3 |
            7. |  2   0     7      8        3 |
            8. |  2   3     8      8        3 |
               +------------------------------+
          Code:
          
          


          The cond() trick is written up in Section 9 of https://www.stata-journal.com/articl...article=dm0055

          Related questions:

          https://www.stata.com/support/faqs/d...ing-last-date/

          https://www.stata.com/support/faqs/d...t-occurrences/


          Comment


          • #6
            1. A typo in #4. The code should be:
            Code:
            bys id: egen awanted = sum(x*(sum(x/x)==1))
            2. In #5,
            Code:
            inrange(x, 1, .)
            could be replaced by
            Code:
            !inlist(0,.)
            or
            Code:
            x/x != 1

            That helps the code applicable for more general cases, where x may be less than 1 (including being minus).

            Comment


            • #7
              #6

              Code:
              !inlist(0, .) 
              will be fine so long as there no extended missing values .a to .z to exclude.

              The last line is a clever trick; the worry is that it needs the reader to puzzle out what is going on.

              Comment


              • #8
                Nick, fine, then:
                Code:
                !inlist(0, .) & !inrange(.a, .z)
                I'm no big fan of code golf, but her one-liners are absolutely brilliant. I stand in awe every time.

                Credit where credit is due.

                Comment


                • #9
                  #8 Joseph Coveney I agree: it's a clever trick. The zone between clever tricks and unclear code is large and tempting to me too.

                  Code:
                  if x != 0 & !missing(x)
                  is another way to do it.

                  Note that in #6 #7 #8 we all mean

                  Code:
                   
                   !inlist(x, 0,.)

                  Comment


                  • #10
                    Clyde Schechter , William Lisowski , Nick Cox , Joseph Coveney
                    How about we start running "competitions" on Stata code, say:
                    1. Most efficient Stata code doing something in shortest way possible.
                    2. Most obscure Stata code, which still does something useful.
                    3. Most readable and self explanatory Stata code?

                    At some point William posted something on languages which are on purpose designed to be obscure, and following the links I read something about competitions in C (I think) like the ones above.

                    Regarding Romalpa Akzo one liners, apart from the considerations of Nick and Joseph regarding readability vs efficiency, there is one more consideration. Understanding what Romalpa did is not easy and might take some time, but if one does understand, this understanding is helpful because it clarified how Stata thinks.

                    On another level, it is funny how so many people did this apparently simple task in so many different ways.

                    To me, the natural way to do this is:

                    Code:
                    . gen firstnomiss = x
                    (2 missing values generated)
                    
                    . bysort id: replace firstnomiss = firstnomiss[_n-1] if !missing(firstnomiss[_n-1]) & firstnomiss[_n-1]!=0
                    (1 real change made)
                    
                    . by id: replace firstnomiss = firstnomiss[_N]
                    (5 real changes made)

                    Comment


                    • #11
                      @Joro Kolev: Your code depends on stable sorting. As implied in #5 why is anyone interested in first occurrence without information on what defines order of occurrence.

                      Comment


                      • #12
                        I automatically assumed what you questioned in #5, that the sort has a meaning, and has to be preserved.

                        However what you are saying indeed raises 2 questions regarding how Stata operates, to which I do not know the answers.

                        Questions 1) How does Stata determine the current sort of the data? In this simple example, I firstly tried to do what I had in mind without any sorting, and Stata incorrectly claimed that my data is not sorted

                        Code:
                        . gen firstnomiss = x
                        (2 missing values generated)
                        
                        . by id: replace firstnomiss = firstnomiss[_n-1] if !missing(firstnomiss[_n-1]) & firstnomiss[_n-1]!=0
                        not sorted
                        r(5);
                        what Stata claims is incorrect, the data is sorted by id.

                        Question 2) There is the grave warning in the help file of -sort-, that the order of ties with respect to the sorting variable might not be preserved, unless you specify the -sort, stable- option. I do diligently specify the option -,stable- when I deal with data big enough so that I cannot eyeball the data... But in such small example datasets that we are dealing here on Statalist, I cannot recall any example of when sorting without the -,stable- option, messed up the original order of the ties. E.g., this was the case here, I eyeballed the sorted data, after Stata made me sort it, and the ties were sorted in the original order.

                        Then Question 2) is Does Stata ever at all mess up the order of the ties if you do not specify the -sort,stable- option? Is yes, when does this happen?

                        I think Nick says that my code should have been as follows, to avoid the possibility that Stata messes up the order of the ties:

                        Code:
                        . sort id, stable
                        
                        . gen firstnomiss = x
                        (2 missing values generated)
                        
                        . by id: replace firstnomiss = firstnomiss[_n-1] if !missing(firstnomiss[_n-1]) & firstnomiss[_n-1]!=0
                        (1 real change made)
                        
                        . by id: replace firstnomiss = firstnomiss[_N]
                        (5 real changes made)

                        Originally posted by Nick Cox View Post
                        @Joro Kolev: Your code depends on stable sorting. As implied in #5 why is anyone interested in first occurrence without information on what defines order of occurrence.

                        Comment


                        • #13
                          Does Stata ever at all mess up the order of the ties if you do not specify the -sort,stable- option? Is yes, when does this happen?
                          Yes, in the following example.
                          Code:
                          . sysuse auto, clear
                          (1978 Automobile Data)
                          
                          . sort price
                          
                          . sort foreign
                          
                          . list foreign price in 1/10, clean noobs
                          
                               foreign    price  
                              Domestic    7,827  
                              Domestic   11,497  
                              Domestic    4,733  
                              Domestic    4,647  
                              Domestic    4,425  
                              Domestic    6,342  
                              Domestic    3,955  
                              Domestic    6,486  
                              Domestic    4,504  
                              Domestic   15,906  
                          
                          .
                          Added in edit: I would not expect Stata developers to try to explain precisely under what circumstances the stable option is unneeded, because as night follows day someone would rely on that and have their code break when Stata changed the sort algorithms. But — especially as we see users with millions of observations in their data, perhaps imported from a non-Stata source that was sorted beforehand — it would be helpful to have something like the following
                          1. sort varlist, confirm to assert the dataset in memory is sorted by a particular set of variables ...
                          2. which Stata then confirms leaving the dataset unchanged, using code that is designed for efficiency in large datasets ...
                          3. and if the sort order is confirmed, sets the metadata appropriately so the by prefix recognizes it as sorted by the listed variables
                          Last edited by William Lisowski; 09 May 2021, 08:03.

                          Comment


                          • #14
                            So here for added fun I generated some more formidable data, that looks more or less like the data originally presented.

                            1. These methods are not equivalent.

                            2. These methods are not equally fast.

                            Code:
                            . clear
                            
                            . * Generate the data
                            . set obs 10000000
                            number of observations (_N) was 0, now 10,000,000
                            
                            . gen x =  runiformint(0,4)
                            
                            . replace x = . if x==1
                            (1,999,616 real changes made, 1,999,616 to missing)
                            
                            . egen id = seq(), block(10)
                            
                            . gen originaln = _n
                            
                            . 
                            . timer clear
                            
                            . 
                            . * Clyde
                            . timer on 1
                            
                            . gen byte not_these = missing(x) | x == 0
                            
                            . gen long obs_no = _n
                            
                            . by id (not_these obs_no), sort: gen wanted = x[1]
                            (580 missing values generated)
                            
                            . timer off 1
                            
                            . sort originaln
                            
                            . 
                            . 
                            . *William
                            . timer on 2
                            
                            . generate seq = _n
                            
                            . generate fnz = .
                            (10,000,000 missing values generated)
                            
                            . bysort id (seq): replace fnz = cond(!missing(fnz[_n-1]),fnz[_n-1],cond(x!=0,x,.))
                            (9,332,432 real changes made)
                            
                            . bysort id (seq): replace fnz = fnz[_N]
                            (666,498 real changes made)
                            
                            . timer off 2
                            
                            . drop seq
                            
                            . sort originaln
                            
                            . 
                            . *Nick
                            . timer on 3
                            
                            . gen long seq = _n 
                            
                            . bysort id (seq): egen when = min(cond(inrange(x, 1, .), seq, .)) 
                            (1070 missing values generated)
                            
                            . gen bwanted = x[when]
                            (1,070 missing values generated)
                            
                            . timer off 3
                            
                            . sort originaln
                            
                            . 
                            . *Romalpa
                            . timer on 4
                            
                            . bys id: egen awanted = sum(x*(sum(x/x)==1))
                            
                            . timer off 4
                            
                            . sort originaln
                            
                            . 
                            . *Joro
                            . timer on 5
                            
                            . sort id, stable
                            
                            . gen firstnomiss = x
                            (1,999,616 missing values generated)
                            
                            . by id: replace firstnomiss = firstnomiss[_n-1] if !missing(firstnomiss[_n-1]) & firstnomiss[_n-1]!=0
                            (6,664,517 real changes made)
                            
                            . by id: replace firstnomiss = firstnomiss[_N]
                            (666,977 real changes made, 238 to missing)
                            
                            . timer off 5
                            
                            . sort originaln
                            
                            . 
                            . summ wanted fnz bwanted awanted firstnomiss
                            
                                Variable |        Obs        Mean    Std. Dev.       Min        Max
                            -------------+---------------------------------------------------------
                                  wanted |  9,999,420    3.000113    .8168021          0          4
                                     fnz |  9,998,930     3.00026    .8165521          2          4
                                 bwanted |  9,998,930     3.00026    .8165521          2          4
                                 awanted | 10,000,000    2.999939     .817098          0          4
                             firstnomiss |  9,999,450    3.000104    .8168175          0          4
                            
                            . 
                            . timer list 
                               1:     22.35 /        1 =      22.3490
                               2:      9.66 /        1 =       9.6570
                               3:     29.38 /        1 =      29.3820
                               4:     11.42 /        1 =      11.4210
                               5:      9.34 /        1 =       9.3440
                            
                            .

                            Comment


                            • #15
                              I am pleased that my results and Nick's (fnz and bwanted) agree and produce answers between 2 and 4.

                              I believe that Clyde's results would be improved by
                              Code:
                              by id (not_these obs_no), sort: gen wanted = x[1] if not_these[1]==0
                              which would take care of the cases for which all observations were zero or missing and the first observation was a zero.

                              I believe the Romalpa's results suffer only from not having ensured a stable sort.

                              I believe that Joro's results would be improved by
                              Code:
                              generate firstnonmiss = x if x!=0
                              by id: replace firstnomiss = firstnomiss[_n-1] if !missing(firstnomiss[_n-1])
                              where the second command now omits checking for a value of 0.

                              Comment

                              Working...
                              X