Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Calculating number of repeated values across variables

    I am having trouble figuring out how to generate a variable with the count of the number of repeated values in a row (across a set of ordered variables) in a dataset that looks like this below. Not the total number of Xs across a set of variables, but the total number that appear next to each other in order. For example, I want to know how many ones are repeated in a row, how many 2s repeated, 3s, etc.

    Example: For the first record_id of 2, I want to create a variable that has a value of 6 representing the number of 1s that are repeated in a row, a second variable with the value of 2 for the number of 2s in a row, etc.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input int record_id byte(fint1 fint2 fint3 fint4 fint5 fint6 fint7 fint8 fint9 fint10 fint11)
     2 3 1 1 1 1 1 1 2 2 1 2
     3 5 4 4 5 4 4 5 4 4 4 4
     6 4 1 2 2 1 2 3 2 2 5 4
     7 3 1 1 1 1 1 2 1 1 1 1
    11 4 2 2 2 2 2 2 3 3 2 2
    15 1 1 1 2 1 2 1 1 1 1 3
    19 2 1 1 2 1 1 2 2 1 1 1
    20 3 1 1 1 1 1 1 1 1 1 1
    21 4 1 1 1 2 2 1 2 2 2 1
    26 1 1 3 3 3 3 4 4 3 3 3
    27 3 1 1 1 1 1 1 1 2 2 3
    28 3 2 3 2 2 2 2 2 2 3 4
    31 3 2 1 1 1 1 1 1 1 1 1
    34 1 1 1 1 1 1 1 1 1 1 1
    43 2 1 1 1 1 1 2 2 2 1 1
    end
    Appreciate any ideas anyone might have!

  • #2
    See the -anycount()- function of egen. For example:

    Code:
    egen wanted1= anycount(fint*), values(1)
    EDITED: Actually, this will give you the total. You can reshape long the data and create a spell variable.
    Last edited by Andrew Musau; 06 Feb 2024, 13:39.

    Comment


    • #3
      Your question is a little unclear. For example, in the first observation, the "number of 2's in a row" could be 2 (starting at fint8) or just 1 (starting at fint11). I'll assume what you intend is the longest number of consecutive fint's with that value. You explicitly state that you want this for values 1 and 2. But for no extra cost, you can get it for every value that occurs in any of the fint variables.

      As with many things in Stata, this is nearly impossible to do in wide layout, but is just a few simple lines of code in long layout. So:
      Code:
      reshape long fint, i(record_id) j(seq)
      
      by record_id (seq): gen run = sum(fint != fint[_n-1])
      by record_id run (seq), sort: gen run_length = _N
      
      levelsof fint, local(fints)
      foreach f of local fints {
          by record_id: egen repeated_values_`f' = max(cond(fint == `f', run_length, .))
      }
      
      drop run run_length
      reshape wide
      Just as going to long layout facilitated this task, it is likely that whatever else you are going to do with this data will also be better done in the long layout. So unless you know for sure that you need these fint variables to be wide for some specific reason, you should skip that final -reshape wide- and just go forward with the long data.

      Added: Crossed with #2.
      Last edited by Clyde Schechter; 06 Feb 2024, 13:42.

      Comment


      • #4
        Thank you so much Clyde, that code is exactly what I needed! I figured that working in long layout might have been a better way to go but wasn't sure how to approach in either format. Thanks again!

        Comment

        Working...
        X