Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Ignoring missing values with the 'diff' fcn of the 'egen' command

    I want to be able to ignore missing values when using the following code to flag observations that are different on specified variables: egen newvar = diff(varlist). I'm working with string variables.

    Is there a more efficient way to do this than by creating a long list of 'if' terms (I have 116 variables)?

    Thanks!

  • #2
    Here's a quick hack. See also https://www.stata-journal.com/sjpdf....iclenum=pr0046 for some general strategies.

    Code:
    clear
    input str2 (foo bar bazz)
    "ab" "ab" "cd"
    "" "bc" "bc" 
    "de" "de" "de"
    "" "" "" 
    end 
    
    gen different = 0 
    gen first = "" 
    
    quietly foreach v of var foo bar bazz { 
        replace first = `v' if `v' != "" & first == "" 
        replace different = 1 if `v' != "" & `v' != first 
    } 
    
    list 
    
    
         +-------------------------------------+
         | foo   bar   bazz   differ~t   first |
         |-------------------------------------|
      1. |  ab    ab     cd          1      ab |
      2. |        bc     bc          0      bc |
      3. |  de    de     de          0      de |
      4. |                           0         |
         +-------------------------------------+

    Comment


    • #3
      Thanks, Nick! I also wanted to see what the first non-missing value was too, so this is perfect.

      Comment


      • #4
        Nick, I'm trying to apply this code to row of 30 variables (indicating labor force status); I want to generate variable indicating whether any change in status over time. I'm getting "type mismatch" (the 30 variables in var list -- equivalent of your foo bar bazz -- are "byte" variables). How do I resolve? [Update -- I changed " " to . and it seems to have worked]
        Last edited by Claire McKenna; 13 Mar 2022, 13:15. Reason: Figured it out!

        Comment


        • #5
          Claire McKenna

          The question in #1 was about string variables. If you have numeric variables only then the egen function diff() will work. I don't recollect ever using it. I just discovered that it has different ideas about missing values from those I would imagine using myself.

          The code here ignores missing values (including any .a through .z if included in the data) and declares that observations are different on the variables specified if (and only if) there are different non-missing values in each observation


          Code:
          clear
          input foo bar bazz
          1 2 3 
          2 2 2 
          3 3 3 
          . . . 
          1 1 . 
          1 . . 
          . . . 
          end 
          
          gen first = . 
          gen different = 0 
          
          quietly foreach v of var foo bar bazz { 
              replace first = `v' if `v' < . & first == . 
              replace different = 1 if `v' < .  & `v' != first 
          } 
          
          egen DIFFERENT = diff(foo bar bazz)
          
          list, sep(0)
          
               +------------------------------------------------+
               | foo   bar   bazz   first   differ~t   DIFFER~T |
               |------------------------------------------------|
            1. |   1     2      3       1          1          1 |
            2. |   2     2      2       2          0          0 |
            3. |   3     3      3       3          0          0 |
            4. |   .     .      .       .          0          0 |
            5. |   1     1      .       1          0          1 |
            6. |   1     .      .       1          0          1 |
            7. |   .     .      .       .          0          0 |
               +------------------------------------------------+

          That said, you're evidently holding panel data in wide layout (format, structure) -- which is usually a bad idea in Stata. A reshape long is advisable.
          ​​​​​​​

          Comment


          • #6
            Thanks, Nick. egen diff picks up a transition to/from missing as a change, which I want to avoid. Your code resolves the issue. Related to this, is there a way to generate a variable that picks up the number of transitions between statuses. So it would count the number of times over the row when there is a transition between different non-missing responses.

            Re wide v long, I didn't know that about Stata. Do you have any documentation that helps to explain?

            Comment


            • #7
              A long layout being better is something that runs through Stata. Just about any time series or panel procedure assumes variables in time being held as columns (variables) not in rows (observations).

              The number of transitions is just -- for a long layout --

              Code:
              bysort id (time) : gen transitions = sum(whatever != whatever[_n-1]) 
              by id: replace transitions= transitions[_N] - 1
              but that code is not subtle about missing values.

              Comment


              • #8
                How would I do the equivalent of egen group, looking over a column of multiple observations, over time, for a single person? (It's very possible I'm just thinking about things in the wrong way; need to reorient from wide to long)
                Last edited by Claire McKenna; 18 Mar 2022, 10:31.

                Comment


                • #9
                  #9 Please spell that out in terms of inputs and what you want from them.

                  Comment


                  • #10
                    I just found this, and it seems to contain answers to my questions: https://www.ls3.soziologie.uni-muenc...tacommands.pdf. Thank you anyway, Nick!

                    Comment

                    Working...
                    X