Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Can anyone help me fix the problem of this egen code?

    I have a large panel data set. For each id, it has tons of observations on different days (several years, so this is a very large data set). Now I need a variable that is the sum of a given time period for each id.

    Thus, I use
    Code:
    egen newvar = total(total) if inrange(date, mdy(10, 19, 2005), mdy(10, 9, 2006)), by(id)
    By definition, I think for each id, it should contains the same value of newvar. But it is not true. WHY?

    And this code doesn't give me the desired outcome. It looks likes I bungled this code.
    Last edited by Yao Zhao; 14 Mar 2020, 22:59.

  • #2
    You are misusing -if- here. The -if- clause results in newvar only being calculated for observations in the data set for which date is between 19oct2005 and 9oct2006. Other observations just get a missing value. That is what -if- does: it applies the command only to those observations meeting the -if- condition.

    What you want is for every observation to receive the value of total, even observations outside that date range, but you want the total to be calculated only using those observations in the date range. The syntax that will do that is:

    Code:
    by id, sort: egen newvar = total(cond(inrange(date, mdy(10, 19, 2005), mdy(10, 9, 2006)), total, .))
    See -help cond()- for more information about the cond function.

    As an aside, it is generally a good idea to avoid using variable names that are the same as a Stata command or function name: it makes the code confusing. Why not change the name of that total variable to something a bit longer but more descriptive, such as total_cost, or total_revenue, or total_infections, or total_whatever_it_is? Alternatively, the mere fact that you are adding this thing up within groups suggests that it really isn't a full total at all. So maybe rename it to subtotal? Or if it represents the total of something over a single date, how about daily_total or total_today?

    Comment


    • #3
      Your code is correct. Thanks. But after seeing -help cond- file, I'm still a little bit confused about your code here.

      Let's say for id 1, he has 4 observations, and the first 2 are not in the range and the latter 2 are in the range.

      Then -cond- gives the first 2 obs missing values first. And -cond- gives the latter 2 the total variable value. And finally, we sum these value?

      How does your code work? What's the 1st step to analyze?

      Comment


      • #4
        See https://www.stata-journal.com/articl...article=dm0055 sections 9 and 10. Missing values are ignored in computing totals. That's the nub.

        Comment


        • #5
          Aha, thank you Nick. This article is the exact what I need. Especially in the introduction section, you summarize the types of problem succinctly.

          Comment


          • #6
            Hi Nick and Clyde,

            After I see the article, I try to use 2 other methods to get the same result. Divide 0 method does work.

            Code:
            by id, sort: egen contreblock3_3 = total(total/(inrange(date, mdy(10, 19, 2005), mdy(10, 9, 2006))))
            assert contreblock3 == contreblock3_3
            But the section 8 method failed.

            Code:
            by id, sort: egen contreblock3_2 = total(total) if inrange(date, mdy(10, 19, 2005), mdy(10, 9, 2006))
            by id (contreblock3_2), sort: replace contreblock3_2 = contreblock3_2[1]
            assert contreblock3 == contreblock3_2
            It's false. I checked the data set. It looks like there are some missing values in this method, but no missing values in other 2 methods. The reason of missing values is probably some groups doesn't have this time range at all.

            Comment


            • #7
              Indeed: the two methods are not the same. generate with if will produce missing values in observations not included by the if condition. egen, total() will produce zeros if asked to add up all missings, unless you specify the missing option.

              Careful study of this minimal example shows the issues.


              Code:
              clear 
              input x y foo 
              1 42 1
              2  . 0 
              end 
              
              egen wanted1 = total(y / foo), by(x)
              
              egen wanted2 = total(cond(foo, y, .)), by(x)
              
              egen wanted3 = total(y) if foo, by(x)
              bysort x (wanted3) : replace wanted3 = wanted3[1] if missing(wanted3)
              
              list
              
                   +--------------------------------------------+
                   | x    y   foo   wanted1   wanted2   wanted3 |
                   |--------------------------------------------|
                1. | 1   42     1        42        42        42 |
                2. | 2    .     0         0         0         . |
                   +--------------------------------------------+
              The paper cited clearly is incomplete in that it doesn't survey all possible complications and exceptions, which would be hard.

              Comment

              Working...
              X