Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Combination of bysort, replace and count if

    Hello, I am following up some readings and came across this code:

    bysort v001: gen dif = 0
    replace dif = 1 if v001 == v001[_n-1] & wt2 != wt2[_n-1]
    browse if dif == 1
    count if dif==1


    I can see the code is trying to check for an error or consistency, and replacing missing values.

    But could you please clarify what it is exactly doing, and what answer one should be looking for in the two conditions:
    count if dif ==1 or if dif == 0?

    if count if dif == 1 reports 7,000 cases, what does it mean?

    Thanks in advance...Cy

  • #2
    Yes, it checks consistency but it does not replace missing values.

    However, the code is not quite right. The bysort should not be in line 1. Here is a working example:

    Code:
    clear
    input v001 wt2
    1 18
    1 18
    1 18
    2 25
    2 27
    2 27
    3 14
    3 15
    3 16
    end
    
    gen dif = 0
    bysort v001: replace dif = 1 if v001 == v001[_n-1] & wt2 != wt2[_n-1]
    The "gen dif = 0" simply sets up a constant.

    The next line reads: "do this for every unique value in v001: replace the dif to be 1 if the following two conditions are both true, (1) the v001 value of the current case (_n, but it's omitted here because it's implied) agrees with the value in the last row (_n - 1); (2) the wt2 value of the current case disagrees with the value in the last row."

    After running the code, the data look like this:

    Code:
         +------------------+
         | v001   wt2   dif |
         |------------------|
      1. |    1    18     0 |
      2. |    1    18     0 |
      3. |    1    18     0 |
         |------------------|
      4. |    2    25     0 |
      5. |    2    27     1 |
      6. |    2    27     0 |
         |------------------|
      7. |    3    14     0 |
      8. |    3    15     1 |
      9. |    3    16     1 |
         +------------------+
    Within each level of v001, if the wt2 is different from the last row's wt2, dif is shown as 1.

    if count if dif == 1 reports 7,000 cases, what does it mean?
    It means that 7,000 cases have wt2 value different than its previous value within their corresponding v001 value.

    Comment


    • #3
      Originally posted by Yawo Kokuvi View Post
      if count if dif == 1 reports 7,000 cases, what does it mean?
      Taking the code literally as typed in #1, it means that 7,000 cases have a different value of v001 compared the one previous, while wt2 remains the same in both observations. (It also means the first observation marked as dif=1 since the first observation case isn't handled.) It is also not enforcing any sort order to perform this check, so you should prefix with -bysort- as Ken suggests.

      Comment


      • #4
        It is also worth nothing that the first two lines of code can be simplified and reduced to a single line that, to me at least, seems more self-explanatory:
        Code:
        bysort v001: gen dif = (wt2 != wt2[_n-1]) &( _n > 1)
        More generally, although you will frequently see people writing Stata code like this:
        Code:
        gen variable = 0
        replace variable = 1 if some_condition
        in Stata this is better expressed with:
        Code:
        gen variable = some_condition
        This is because Stata, unlike some other statistical packages, has logical expressions. Any logical expression will, in Stata, be interpreted as 0 if it is false and 1 if it is true. So this single line does exactly the same thing as the two lines. But it is more compact and transparent. I think the practice of doing it with a -gen- followed by a -replace- is a habit that some people who are accustomed to working with some alternative software have developed, and they have never mastered the concept of logical expressions and their evaluation as 0 or 1.

        Of course, this kind of brevity is only possible when the variable can be properly defined in terms of a single logical expression. There are situations where that is not possible and -gen- followed by a series of -replace- commands is needed. But these are not all that common.

        Added: I should also point out that the results of this code (and also of the original version in #1) are not deterministic, and running the code repeatedly on the same data will not always produce the same results. This is because v001 does not by itself uniquely determine the sort order of the data in the -bysort v001- command. For example, if in the data shown in #2 we were to interchange the third and fourth or third and fifth observations, the data would be sorted differently, yet they are both well-sorted on v001. In situations like this, Stata chooses at random among the available sort orders that are consistent with the command. Note that since the setting of v001 to 1 depends, in part, on which observation is first in the v001 group, an interchange of observations like the one just described would result in different values of wt2 bearing the dif == 1, and even with a different result for the count of dif == 1 observations. For that reason, it is usually a better idea to sort the data in a deterministic way by specifying enough sort variables that the sort order is uniquely determined by them. Alternatively, the -stable- option can be used to tell Stata that after sorting on the specified variables, preserve the previously existing sort order within those groups.
        Last edited by Clyde Schechter; 08 Jul 2023, 19:02.

        Comment


        • #5
          Ken and Leonardo: Thanks so much for your responses - your explanations make sense. Clyde - I appreciate your observations, and to truly sort the data, 3 variables - v001, v002, and v003. I will try those out. Cheers, Cy

          Comment


          • #6
            Clyde, I was wondering if you could elaborate on a piece of code you posted above in #4.

            Code:
            bysort v001: gen dif = (wt2 != wt2[_n-1]) &( _n > 1)
            I'm curious why the last bit of code "&(_n > 1)" is necessary. Given that the bysort command is used, I thought that this would tell Stata to create this "dif" variable if wt2 does not equal wt2 from the previous observation within each observation of v001. I've run some similar code without the last qualifier, "&(_n > 1)", and found that I'm incorrectly flagging differences because Stata is comparing my wt2 to the previous wt2 even if v001 isn't the same (I'm using the same variable names as this example for simplicity). I've looked through the bys manual and haven't been able to figure this out, so I appreciate any help in understanding why that code is needed - thanks!
            Last edited by Elizabeth Teas; 02 Aug 2023, 09:16.

            Comment


            • #7
              _n == 1 in the first observation of each v001 group. Suppose we are looking at such an observation and there is no -if _n > 1- restriction. What is wt2[_n-1]? Since _n == 1, it is wt2[0]. But there is no 0th observation in the v001 group (or anywhere else in Stata): Stata's subscripts are 1-based. Stata does not make the mistake of looking at the last observation of the preceding v001 group. But there is no wt2[0], so, in Stata, wt2[_n-1] == missing value. But wt2, the value of the first observation will, in general, not be missing, so it will be counted as different from its predecessor, which, at least in the context of the original post in this thread, is not wanted. It probably isn't what you want either.

              Comment

              Working...
              X