Ignoring missing values with the 'diff' fcn of the 'egen' command

Robert Eldritch

Join Date: Oct 2017

Posts: 12
#1

Ignoring missing values with the 'diff' fcn of the 'egen' command

07 Jun 2018, 12:13

I want to be able to ignore missing values when using the following code to flag observations that are different on specified variables: egen newvar = diff(varlist). I'm working with string variables.

Is there a more efficient way to do this than by creating a long list of 'if' terms (I have 116 variables)?

Thanks!
Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35213

07 Jun 2018, 12:41

Here's a quick hack. See also https://www.stata-journal.com/sjpdf....iclenum=pr0046 for some general strategies.

Code:

clear
input str2 (foo bar bazz)
"ab" "ab" "cd"
"" "bc" "bc" 
"de" "de" "de"
"" "" "" 
end 

gen different = 0 
gen first = "" 

quietly foreach v of var foo bar bazz { 
    replace first = `v' if `v' != "" & first == "" 
    replace different = 1 if `v' != "" & `v' != first 
} 

list 


     +-------------------------------------+
     | foo   bar   bazz   differ~t   first |
     |-------------------------------------|
  1. |  ab    ab     cd          1      ab |
  2. |        bc     bc          0      bc |
  3. |  de    de     de          0      de |
  4. |                           0         |
     +-------------------------------------+

Comment

Robert Eldritch

Join Date: Oct 2017

Posts: 12
#3

07 Jun 2018, 19:14

Thanks, Nick! I also wanted to see what the first non-missing value was too, so this is perfect.
Comment
Claire McKenna

Join Date: Feb 2022

Posts: 83
#4

13 Mar 2022, 13:02

Nick, I'm trying to apply this code to row of 30 variables (indicating labor force status); I want to generate variable indicating whether any change in status over time. I'm getting "type mismatch" (the 30 variables in var list -- equivalent of your foo bar bazz -- are "byte" variables). How do I resolve? [Update -- I changed " " to . and it seems to have worked]

Last edited by Claire McKenna; 13 Mar 2022, 13:15. Reason: Figured it out!
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35213

13 Mar 2022, 13:28

Claire McKenna

The question in #1 was about string variables. If you have numeric variables only then the egen function diff() will work. I don't recollect ever using it. I just discovered that it has different ideas about missing values from those I would imagine using myself.

The code here ignores missing values (including any .a through .z if included in the data) and declares that observations are different on the variables specified if (and only if) there are different non-missing values in each observation

Code:

clear
input foo bar bazz
1 2 3 
2 2 2 
3 3 3 
. . . 
1 1 . 
1 . . 
. . . 
end 

gen first = . 
gen different = 0 

quietly foreach v of var foo bar bazz { 
    replace first = `v' if `v' < . & first == . 
    replace different = 1 if `v' < .  & `v' != first 
} 

egen DIFFERENT = diff(foo bar bazz)

list, sep(0)

     +------------------------------------------------+
     | foo   bar   bazz   first   differ~t   DIFFER~T |
     |------------------------------------------------|
  1. |   1     2      3       1          1          1 |
  2. |   2     2      2       2          0          0 |
  3. |   3     3      3       3          0          0 |
  4. |   .     .      .       .          0          0 |
  5. |   1     1      .       1          0          1 |
  6. |   1     .      .       1          0          1 |
  7. |   .     .      .       .          0          0 |
     +------------------------------------------------+

That said, you're evidently holding panel data in wide layout (format, structure) -- which is usually a bad idea in Stata. A reshape long is advisable.

Comment

Claire McKenna

Join Date: Feb 2022

Posts: 83
#6

13 Mar 2022, 13:44

Thanks, Nick. egen diff picks up a transition to/from missing as a change, which I want to avoid. Your code resolves the issue. Related to this, is there a way to generate a variable that picks up the number of transitions between statuses. So it would count the number of times over the row when there is a transition between different non-missing responses.

Re wide v long, I didn't know that about Stata. Do you have any documentation that helps to explain?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35213
#7

13 Mar 2022, 16:14

A long layout being better is something that runs through Stata. Just about any time series or panel procedure assumes variables in time being held as columns (variables) not in rows (observations).

The number of transitions is just -- for a long layout --

Code:

bysort id (time) : gen transitions = sum(whatever != whatever[_n-1]) by id: replace transitions= transitions[_N] - 1

but that code is not subtle about missing values.
1 like
Comment
Claire McKenna

Join Date: Feb 2022

Posts: 83
#8

18 Mar 2022, 10:27

How would I do the equivalent of egen group, looking over a column of multiple observations, over time, for a single person? (It's very possible I'm just thinking about things in the wrong way; need to reorient from wide to long)

Last edited by Claire McKenna; 18 Mar 2022, 10:31.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35213
#9

18 Mar 2022, 10:30

#9 Please spell that out in terms of inputs and what you want from them.
Comment
Claire McKenna

Join Date: Feb 2022

Posts: 83
#10

18 Mar 2022, 12:50

I just found this, and it seems to contain answers to my questions: https://www.ls3.soziologie.uni-muenc...tacommands.pdf. Thank you anyway, Nick!
Comment

Announcement