Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to count the number of distinct values (of a variable) that satisfy a given condition

    Dear all,

    I am using a dataset that is such that for a given value of the identifying variable ("var_id") we observe several values for another variable, let's say "var_of_interest".

    In particular, the dataset is like

    var_id var_of_interest
    11 PI
    11 PA
    11 CE
    11 BA
    12 SP
    12 SP
    12 BA
    12 RN

    I would like to create a variable ("new_var") that counts the number of distinct values of "var_of_interest" that are different from "BA".

    In particular, I would like the dataset to be like the following

    var_id var_of_interest new_var
    11 PI 3
    11 PA 3
    11 CE 3
    11 BA 3
    12 SP 2
    12 SP 2
    12 BA 2
    12 RN 2

    where "new_var" is the variable that retrieves the quantity of interest for each different id (from the variable "var_id").

    Can you help me with that?

    Thank you very much.

    Any help is greatly appreciated.


    Obs: here is the code to import the fictitious data:

    clear
    input var_id str2 var_of_interest
    11 "PI"
    11 "PA"
    11 "CE"
    11 "BA"
    12 "SP"
    12 "SP"
    12 "BA"
    12 "RN"
    end

  • #2
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float var_id str2 var_of_interest
    11 "PI"
    11 "PA"
    11 "CE"
    11 "BA"
    12 "SP"
    12 "SP"
    12 "BA"
    12 "RN"
    end
    
    bys var_id: egen wanted= total(var_of_interest!="BA")
    Res.:

    Code:
    . l, sepby(var_id)
    
         +----------------------------+
         | var_id   var_of~t   wanted |
         |----------------------------|
      1. |     11         PI        3 |
      2. |     11         PA        3 |
      3. |     11         CE        3 |
      4. |     11         BA        3 |
         |----------------------------|
      5. |     12         SP        3 |
      6. |     12         SP        3 |
      7. |     12         BA        3 |
      8. |     12         RN        3 |
         +----------------------------+

    Comment


    • #3
      Code:
      bys var_id var_of_interest: gen nvals = _n == 1  if var_of_interest!="BA"
      bys var_id: egen wanted = total(nvals)
      drop nvals
      See also:
      https://www.stata.com/support/faqs/d...-observations/

      Comment


      • #4
        Thank you very much Ali Atia and Andrew Musau !!

        I am curious about why the solution of Andrew did return a wrong value for var_id == 12 (the correct value for 'var_of_interest' in this case is 2 and not 3).

        The solution from Ali works perfectly!

        Thanks again!

        Comment


        • #5
          Try something like this:

          Code:
          . gen  new_var = . 
          (8 missing values generated)
          
          . forvalues i=11/12 {
            2. levelsof var_of_interest if var_id==`i' & var_of_interest!="BA"
            3. replace  new_var = r(r) if var_id==`i'
            4. }
          `"CE"' `"PA"' `"PI"'
          (4 real changes made)
          `"RN"' `"SP"'
          (4 real changes made)
          
          . 
          . list, sepby(var_id)
          
               +-----------------------------+
               | var_id   var_of~t   new_var |
               |-----------------------------|
            1. |     11         PI         3 |
            2. |     11         PA         3 |
            3. |     11         CE         3 |
            4. |     11         BA         3 |
               |-----------------------------|
            5. |     12         SP         2 |
            6. |     12         SP         2 |
            7. |     12         BA         2 |
            8. |     12         RN         2 |
               +-----------------------------+

          Comment


          • #6
            I did not notice that you have duplicates. -egen(total)- will just count how many observations are within a group. You can tag distinct observations (or first occurrences).

            Code:
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input float var_id str2 var_of_interest
            11 "PI"
            11 "PA"
            11 "CE"
            11 "BA"
            12 "SP"
            12 "SP"
            12 "BA"
            12 "RN"
            end
            
            egen fob= tag(var_id var_of_interest)
            bys var_id: egen wanted=total(cond(fob & var_of_interest!="BA", 1, 0))
            Res.:

            Code:
            . l, sepby(var_id)
            
                 +----------------------------------+
                 | var_id   var_of~t   fob   wanted |
                 |----------------------------------|
              1. |     11         PI     1        3 |
              2. |     11         PA     1        3 |
              3. |     11         CE     1        3 |
              4. |     11         BA     1        3 |
                 |----------------------------------|
              5. |     12         SP     1        2 |
              6. |     12         SP     0        2 |
              7. |     12         BA     1        2 |
              8. |     12         RN     1        2 |
                 +----------------------------------+
            Last edited by Andrew Musau; 28 Jan 2021, 17:49.

            Comment


            • #7
              Thank you very much Andrew Musau and Joro Kolev , your solutions worked like a charm!!

              Now I understood Andrew the point of using "egen ... total" in the presence of duplicates, many thanks!

              Comment


              • #8
                Both Ali's solution in #3 and Andrew's solution in #6 are preferable because they do not rely on looping over observations, and looping over observations is relatively slow. They both rely on the same idea of tagging an observation per group. Most probably Ali's solution would be the fastest over a large dataset.

                Here is a variation of an egen solution like Andrew's, but more similar to Ali's way of tagging. The point is that -egen, tag- also accepts if/in conditions:

                Code:
                . egen tagnoba = tag( var_id var_of_interest) if var_of_interest!="BA"
                
                . egen new_var = total(tagnoba), by( var_id )
                
                . list, sepby(var_id)
                
                     +---------------------------------------+
                     | var_id   var_of~t   tagnoba   new_var |
                     |---------------------------------------|
                  1. |     11         PI         1         3 |
                  2. |     11         PA         1         3 |
                  3. |     11         CE         1         3 |
                  4. |     11         BA         0         3 |
                     |---------------------------------------|
                  5. |     12         SP         1         2 |
                  6. |     12         SP         0         2 |
                  7. |     12         BA         0         2 |
                  8. |     12         RN         1         2 |
                     +---------------------------------------+


                Comment


                • #9
                  The tag() function of egen goes back a long way in Stata (1999) and just encapsulates a longer-standing trick that I guess is in the manuals somewhere and/or was discussed on Statalist in the 1990s (all the old posts have disappeared).

                  In essence the nearest simple equivalent is

                  Code:
                  bysort varlist : gen tag = _n == 1

                  where varlist means a variable list, not something you type directly.

                  But -- as documented -- the
                  egen function tag() returns 0 for observations excluded by an if or in condition, which people are understandably surprised by if they expect the call to return missings for excluded observations.

                  This was designed in as what seemed more useful behaviour. for most problems. I suppose there should be an option for those who want to insist on missings in that circumstance, but I don't recall much grumbling about it;

                  Comment


                  • #10
                    So what is the correct syntax if I want the "new_var" to count the number of distinct values of "var_of_interest" in general?

                    And how can I find out the frequency of each specific value of "var_of_interest" within the same "var_id"?

                    Comment


                    • #11
                      I have a very similar problem where I have not found the solution yet.

                      In my data (example from other post) I also have multiple entries for the same population ("entity") but also for different years ("year").

                      1. With "egen tag" and "egen ndistinct" I can find out how many different "firms" each combination of "entity" and "year" (group?) has.
                      ( I will have to do this not only for "firm" but also for "value" and other variables)

                      2. Now I would like to know the frequencies for each value in "ndistinct" over all the combinations of "entity" and "year".
                      (To get an idea of the variation within different variables and think about how to merge them in the most sensible way).

                      3. Last I need to make sure there is only one observation per combinations of "entity" and "year" and therefore combine (merge, collapse, append...?) the values of other variables. If there is a most frequent value, choose that one, if there is not, choose the highest or first, or in some other cases the mean or sum (to be defined for each variable).

                      I really struggle to find a way how to do that. I hope my example is clear.

                      Original data
                      Code:
                      * Example generated by -dataex-. To install: ssc install dataex
                      clear
                      input byte entity int year str1 firm float value
                      1 2010 "A" 15
                      1 2010 "A"  8
                      1 2010 "B" 12
                      1 2011 "B" 25
                      1 2012 "B"  8
                      2 2010 "A"  7
                      2 2011 "A"  5
                      2 2011 "A" 12
                      2 2011 "C" 13
                      2 2012 "A" 19
                      2 2012 "B" 25
                      2 2011 "B" 14
                      2 2012 "C" 18
                      2 2012 "D" 16
                      end
                      
                      sort entity year
                      list, sepby(entity year)
                      1. Question
                      Code:
                      * Example generated by -dataex-. To install: ssc install dataex
                      clear
                      input byte entity int year str1 firm float value
                      1 2010 "A" 15
                      1 2010 "A"  8
                      1 2010 "B" 12
                      1 2011 "B" 25
                      1 2012 "B"  8
                      2 2010 "A"  7
                      2 2011 "A"  5
                      2 2011 "A" 12
                      2 2011 "C" 13
                      2 2012 "A" 19
                      2 2012 "B" 25
                      2 2011 "B" 14
                      2 2012 "C" 18
                      2 2012 "D" 16
                      end
                      
                      egen tag = tag(firm entity year)
                      egen ndistinct = total(tag), by(entity year)
                      
                      sort entity year
                      list, sepby(entity year)
                      2. Question -> This is the table I would like to obtain
                      Code:
                      * Example generated by -dataex-. To install: ssc install dataex
                      clear
                      input byte ndistinct freq
                      1 3
                      2 1
                      3 1
                      4 1
                      end
                      sort ndistinct freq
                      list, sepby(ndistinct freq)
                      3. Question?

                      See also:
                      https://www.statalist.org/forums/for...cases-by-group
                      https://www.statalist.org/forums/for...r-each-subject
                      https://www.statalist.org/forums/for...f-similar-type

                      Comment


                      • #12
                        This is one of 6 threads on which this question is posted. Answered at https://www.statalist.org/forums/for...r-each-subject.

                        Comment

                        Working...
                        X