return first non-zero value

River Huang

Join Date: Mar 2016

Posts: 1906
#1

return first non-zero value

07 May 2021, 18:21

Dear All, Suppose that I have this dataset

Code:

clear input id x 1 . 1 0 1 3 1 4 2 0 2 . 2 0 2 3 end

For each `id', I'd like to return the first non-zero value. Any suggesitons? Thanks.

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#2

07 May 2021, 18:43

In the data you show, if I take your request literally, the first non-zero value for both id's is a missing value. I'm going to assume that you really mean first non-zero and non-missing value.

Code:

gen byte not_these = missing(x) | x == 0 gen long obs_no = _n by id (not_these obs_no), sort: gen wanted = x[1]

If my assumption is wrong and missing value is acceptable as a non-zero value for your purposes, then just change the first command to eliminate -missing(x) | -.
2 likes
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

07 May 2021, 18:44

Not quite clear what you mean by "return": would collapse to a single observation per id be what you want, or do you want to add a new variable to your existing dataset?

In the latter case, this seems to do what you want.

Code:

generate seq = _n
generate fnz = .
bysort id (seq): replace fnz = cond(!missing(fnz[_n-1]),fnz[_n-1],cond(x!=0,x,.))
bysort id (seq): replace fnz = fnz[_N]
drop seq
list, sepby(id) noobs

Code:

. list, sepby(id) noobs

  +--------------+
  | id   x   fnz |
  |--------------|
  |  1   .     3 |
  |  1   0     3 |
  |  1   3     3 |
  |  1   4     3 |
  |--------------|
  |  2   0     3 |
  |  2   .     3 |
  |  2   0     3 |
  |  2   3     3 |
  +--------------+

Comment

Romalpa Akzo

Join Date: Oct 2017

Posts: 369
#4

08 May 2021, 01:07

Code:

bys id: gen wanted = total(x*(sum(x/x)==1))
3 likes
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35432

08 May 2021, 04:07

Here's yet another take. The idea of "first" is empty without some idea of an order within groups. Here the criterion is taken to be at least 1 but not missing. Substitute your own criterion if different.

Code:

clear
input id x
1 .
1 0
1 3
1 4
2 0
2 . 
2 0
2 3
end

gen long seq = _n 

bysort id (seq): egen when = min(cond(inrange(x, 1, .), seq, .)) 
gen wanted = x[when]

list , sepby(id) 

     +------------------------------+
     | id   x   seq   when   wanted |
     |------------------------------|
  1. |  1   .     1      3        3 |
  2. |  1   0     2      3        3 |
  3. |  1   3     3      3        3 |
  4. |  1   4     4      3        3 |
     |------------------------------|
  5. |  2   0     5      8        3 |
  6. |  2   .     6      8        3 |
  7. |  2   0     7      8        3 |
  8. |  2   3     8      8        3 |
     +------------------------------+

Code:

The cond() trick is written up in Section 9 of https://www.stata-journal.com/articl...article=dm0055

Related questions:

https://www.stata.com/support/faqs/d...ing-last-date/

https://www.stata.com/support/faqs/d...t-occurrences/

Comment

Romalpa Akzo

Join Date: Oct 2017

Posts: 369
#6

08 May 2021, 19:51

1. A typo in #4. The code should be:

Code:

bys id: egen awanted = sum(x*(sum(x/x)==1))

2. In #5,

Code:

inrange(x, 1, .)

could be replaced by

Code:

!inlist(0,.)

or

Code:

x/x != 1

That helps the code applicable for more general cases, where x may be less than 1 (including being minus).
2 likes
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35432
#7

09 May 2021, 03:00

#6

Code:

!inlist(0, .)

will be fine so long as there no extended missing values .a to .z to exclude.

The last line is a clever trick; the worry is that it needs the reader to puzzle out what is going on.
1 like
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4374
#8

09 May 2021, 04:25

Nick, fine, then:

Code:

!inlist(0, .) & !inrange(.a, .z)

I'm no big fan of code golf, but her one-liners are absolutely brilliant. I stand in awe every time.

Credit where credit is due.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35432
#9

09 May 2021, 04:39

#8 Joseph Coveney I agree: it's a clever trick. The zone between clever tricks and unclear code is large and tempting to me too.

Code:

if x != 0 & !missing(x)

is another way to do it.

Note that in #6 #7 #8 we all mean

Code:

!inlist(x, 0,.)
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3047
#10

09 May 2021, 06:01

Clyde Schechter , William Lisowski , Nick Cox , Joseph Coveney
How about we start running "competitions" on Stata code, say:
1. Most efficient Stata code doing something in shortest way possible.
2. Most obscure Stata code, which still does something useful.
3. Most readable and self explanatory Stata code?

At some point William posted something on languages which are on purpose designed to be obscure, and following the links I read something about competitions in C (I think) like the ones above.

Regarding Romalpa Akzo one liners, apart from the considerations of Nick and Joseph regarding readability vs efficiency, there is one more consideration. Understanding what Romalpa did is not easy and might take some time, but if one does understand, this understanding is helpful because it clarified how Stata thinks.

On another level, it is funny how so many people did this apparently simple task in so many different ways.

To me, the natural way to do this is:

Code:

. gen firstnomiss = x (2 missing values generated) . bysort id: replace firstnomiss = firstnomiss[_n-1] if !missing(firstnomiss[_n-1]) & firstnomiss[_n-1]!=0 (1 real change made) . by id: replace firstnomiss = firstnomiss[_N] (5 real changes made)
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35432
#11

09 May 2021, 06:40

@Joro Kolev: Your code depends on stable sorting. As implied in #5 why is anyone interested in first occurrence without information on what defines order of occurrence.
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3047
#12

09 May 2021, 07:18

I automatically assumed what you questioned in #5, that the sort has a meaning, and has to be preserved.

However what you are saying indeed raises 2 questions regarding how Stata operates, to which I do not know the answers.

Questions 1) How does Stata determine the current sort of the data? In this simple example, I firstly tried to do what I had in mind without any sorting, and Stata incorrectly claimed that my data is not sorted

Code:

. gen firstnomiss = x (2 missing values generated) . by id: replace firstnomiss = firstnomiss[_n-1] if !missing(firstnomiss[_n-1]) & firstnomiss[_n-1]!=0 not sorted r(5);

what Stata claims is incorrect, the data is sorted by id.

Question 2) There is the grave warning in the help file of -sort-, that the order of ties with respect to the sorting variable might not be preserved, unless you specify the -sort, stable- option. I do diligently specify the option -,stable- when I deal with data big enough so that I cannot eyeball the data... But in such small example datasets that we are dealing here on Statalist, I cannot recall any example of when sorting without the -,stable- option, messed up the original order of the ties. E.g., this was the case here, I eyeballed the sorted data, after Stata made me sort it, and the ties were sorted in the original order.

Then Question 2) is Does Stata ever at all mess up the order of the ties if you do not specify the -sort,stable- option? Is yes, when does this happen?

I think Nick says that my code should have been as follows, to avoid the possibility that Stata messes up the order of the ties:

Code:

. sort id, stable . gen firstnomiss = x (2 missing values generated) . by id: replace firstnomiss = firstnomiss[_n-1] if !missing(firstnomiss[_n-1]) & firstnomiss[_n-1]!=0 (1 real change made) . by id: replace firstnomiss = firstnomiss[_N] (5 real changes made)

Originally posted by Nick Cox View Post

@Joro Kolev: Your code depends on stable sorting. As implied in #5 why is anyone interested in first occurrence without information on what defines order of occurrence.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#13

09 May 2021, 07:44

Does Stata ever at all mess up the order of the ties if you do not specify the -sort,stable- option? Is yes, when does this happen?

Yes, in the following example.

Code:

. sysuse auto, clear (1978 Automobile Data) . sort price . sort foreign . list foreign price in 1/10, clean noobs foreign price Domestic 7,827 Domestic 11,497 Domestic 4,733 Domestic 4,647 Domestic 4,425 Domestic 6,342 Domestic 3,955 Domestic 6,486 Domestic 4,504 Domestic 15,906 .

Added in edit: I would not expect Stata developers to try to explain precisely under what circumstances the stable option is unneeded, because as night follows day someone would rely on that and have their code break when Stata changed the sort algorithms. But — especially as we see users with millions of observations in their data, perhaps imported from a non-Stata source that was sorted beforehand — it would be helpful to have something like the following
sort varlist, confirm to assert the dataset in memory is sorted by a particular set of variables ...

which Stata then confirms leaving the dataset unchanged, using code that is designed for efficiency in large datasets ...

and if the sort order is confirmed, sets the metadata appropriately so the by prefix recognizes it as sorted by the listed variables

Last edited by William Lisowski; 09 May 2021, 08:03.
2 likes
Comment

Joro Kolev

Join Date: Aug 2018
Posts: 3047

#14

09 May 2021, 09:22

So here for added fun I generated some more formidable data, that looks more or less like the data originally presented.

1. These methods are not equivalent.

2. These methods are not equally fast.

Code:

. clear

. * Generate the data
. set obs 10000000
number of observations (_N) was 0, now 10,000,000

. gen x =  runiformint(0,4)

. replace x = . if x==1
(1,999,616 real changes made, 1,999,616 to missing)

. egen id = seq(), block(10)

. gen originaln = _n

. 
. timer clear

. 
. * Clyde
. timer on 1

. gen byte not_these = missing(x) | x == 0

. gen long obs_no = _n

. by id (not_these obs_no), sort: gen wanted = x[1]
(580 missing values generated)

. timer off 1

. sort originaln

. 
. 
. *William
. timer on 2

. generate seq = _n

. generate fnz = .
(10,000,000 missing values generated)

. bysort id (seq): replace fnz = cond(!missing(fnz[_n-1]),fnz[_n-1],cond(x!=0,x,.))
(9,332,432 real changes made)

. bysort id (seq): replace fnz = fnz[_N]
(666,498 real changes made)

. timer off 2

. drop seq

. sort originaln

. 
. *Nick
. timer on 3

. gen long seq = _n 

. bysort id (seq): egen when = min(cond(inrange(x, 1, .), seq, .)) 
(1070 missing values generated)

. gen bwanted = x[when]
(1,070 missing values generated)

. timer off 3

. sort originaln

. 
. *Romalpa
. timer on 4

. bys id: egen awanted = sum(x*(sum(x/x)==1))

. timer off 4

. sort originaln

. 
. *Joro
. timer on 5

. sort id, stable

. gen firstnomiss = x
(1,999,616 missing values generated)

. by id: replace firstnomiss = firstnomiss[_n-1] if !missing(firstnomiss[_n-1]) & firstnomiss[_n-1]!=0
(6,664,517 real changes made)

. by id: replace firstnomiss = firstnomiss[_N]
(666,977 real changes made, 238 to missing)

. timer off 5

. sort originaln

. 
. summ wanted fnz bwanted awanted firstnomiss

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      wanted |  9,999,420    3.000113    .8168021          0          4
         fnz |  9,998,930     3.00026    .8165521          2          4
     bwanted |  9,998,930     3.00026    .8165521          2          4
     awanted | 10,000,000    2.999939     .817098          0          4
 firstnomiss |  9,999,450    3.000104    .8168175          0          4

. 
. timer list 
   1:     22.35 /        1 =      22.3490
   2:      9.66 /        1 =       9.6570
   3:     29.38 /        1 =      29.3820
   4:     11.42 /        1 =      11.4210
   5:      9.34 /        1 =       9.3440

.

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#15

09 May 2021, 10:18

I am pleased that my results and Nick's (fnz and bwanted) agree and produce answers between 2 and 4.

I believe that Clyde's results would be improved by

Code:

by id (not_these obs_no), sort: gen wanted = x[1] if not_these[1]==0

which would take care of the cases for which all observations were zero or missing and the first observation was a zero.

I believe the Romalpa's results suffer only from not having ensured a stable sort.

I believe that Joro's results would be improved by

Code:

generate firstnonmiss = x if x!=0 by id: replace firstnomiss = firstnomiss[_n-1] if !missing(firstnomiss[_n-1])

where the second command now omits checking for a value of 0.
2 likes
Comment

Announcement