Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to combine observations?

    Hi everyone,

    I have some panel data and I need to combine observations that share an id and a date. The id is stored in a variable ("unique_id"). Each id may have multiple dates, and so will have multiple records (i.e. longitudinal data). However, right now I have multiple observations with the same id and same date, but with different values on some dummy variables. I want to combine all of the observations that share an id and a date in such a way that if at least observation has a value of "1" in a dummy variable, the combined variable will also have a value of "1" in that dummy.

    Do you have any suggestions?

    Thanks,

    Kris

  • #2
    OK. So let's give that "dummy" variable a name. I'll call it x. Let's also name your date variable. I'll call it date. You want to thin your data to a single observation per combination of unique_id and date, with the combined value of x being 1 if any of the x's is 1, and zero otherwise.

    Code:
    by unique_id date, sort: egen new_x = max(x == 1)
    by unique_id date: keep if _n == 1
    drop x
    rename new_x x
    Note: You don't say what you want to do if x is missing for every observation for a given unique_id and date. The above code sets the combined value to zero, not to missing. Perhaps this situation never arises in your data. If it does, and if the proper solution is to set x to missing value in this circumstance, then it would be as follows:

    Code:
    by unique_id date, sort: egen new_x = max(x==1)
    by unique_id date,: egen all_missing = min(missing(x))
    by unique_id date: keep if _n == 1
    replace new_x = . if all_missing == 1
    drop x all_missing
    rename new_x x

    Comment


    • #3
      Worked perfectly. Thanks!!

      Comment


      • #4
        Hello everyone,
        I have panel data from 2001 to 2020 where I want to combine certain observations as shown below

        Associate
        Associate confirmed
        Associate not confirmed
        Researchers
        Researchers A
        Researchers B
        Researchers tenured
        Researchers non-tenured
        Ordinary professors
        Full professors
        Extraordinary professors

        Here each observation has a panel dataset from 2001 to 2020 with various universities. I want to combine observations with years and university names. something summing up like: Associate (Associate, the associate confirmed, and Associate not confirmed), likewise with other observations.
        Is there any solution for this?

        Thank you.

        Comment


        • #5
          Clyde Schechter Thanks

          I've been using similar code (see below) - but I need to copy and paste this for 20 times for different variables(see bold) eg Dvt, MI, heartfailure etc.
          Is there a shorter way to do this ?Rather than having to do this for 20 times.
          Many thanks


          bys hospitalid2 date: egen new_var = max(dvt == 1)
          by hospitalid2 date: egen all_missing = min(missing(dvt))
          by hospitalid2 date: keep if _n == 1
          replace new_var = . if all_missing == 1
          drop dvt all_missing

          rename new_var dvt
          Last edited by Denise Vella; 07 Dec 2022, 04:22.

          Comment


          • #6
            Code:
            by unique_id date, sort: egen new_x = max(x == 1)
            
            by unique_id date, sort: egen new_x = max(x == 1) 
            
            by unique_id date: keep if _n == 1 For each set which have the same ID and date - keep the first observation....
            This means generate a new variable new_x.

            When variable x == 1, find the maximum when they have the same unique_id and date.

            For each set which have the same ID and date - keep the first observation....

            #5 That starts right and gets a little confused.

            x == 1 is evaluated as 1 if x is 1 and as 0 otherwise.

            Taking the maximum over possible 1s and 0s -- for each (identifier, date) combination -- yields 1 if any value is 1 and 0 otherwise (if all values are 0).

            So for each (identifier, date) combination we have either all 1s or all 0s, and since they are all the same we need only keep one observation for our purpose. There are two systematic ways to keep one observation -- whenever several are identical -- to keep the first or to keep the last. It doesn't matter which we use -- it's like saying we have 1 1 1 1 1 1 here and 0 0 0 there, so which should we chose of the 1s and the 0s. You need not care.

            Here a demonstration of a very simple case. Adding dates doesn't change the main idea.

            Code:
            * Example generated by -dataex-. For more info, type help dataex
            clear
            input float(id x)
            1 1
            1 0
            1 0
            2 0
            2 0
            2 0
            end
            
            bysort id : egen any1 = max(x == 1)
            
            list, sepby(id)
            
                 +---------------+
                 | id   x   any1 |
                 |---------------|
              1. |  1   1      1 |
              2. |  1   0      1 |
              3. |  1   0      1 |
                 |---------------|
              4. |  2   0      0 |
              5. |  2   0      0 |
              6. |  2   0      0 |
                 +---------------+
            
            bysort id : keep if _n == 1
            
            l id any1
            
                 +-----------+
                 | id   any1 |
                 |-----------|
              1. |  1      1 |
              2. |  2      0 |
                 +-----------+
            For the correspondence any <=> max, all <=> min, see also https://www.stata.com/support/faqs/d...ble-recording/



            .



            Comment


            • #7
              I've been using similar code (see below) - but I need to copy and paste this for 20 times for different variables(see bold) eg Dvt, MI, heartfailure etc.
              Is there a shorter way to do this ?Rather than having to do this for 20 times.
              Code:
              foreach v of varlist Dvt MI heartfailure /*etc.*/ {
                  by hospitalid2 date, sort: egen byte has_`v' = max(`v' == 1)
                  by hospitalid2 date: gen byte all_missing = missing(`v'[1])
                  replace has_`v' = . if all_missing
                  drop all_missing
              }

              Comment

              Working...
              X