Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • "Compress" lines when missing values

    Dear Statalists,

    for every id in my dataset there are var1 to var7 that represent different values. Some of the values in var1 to var7 are missing.
    Now I would like to "compress" each observation/each line, so that first the non-missing values occur and then the missing values.

    For example, for id=1, 3 of the 7 values are missing: var1, var3 and var7. I would like to replace var1 by var 2, var2 by var4, var3 by var5 and var4 by var 6, so that the 3 missing values are "at the end"

    Any ideas how to do this systematically for a large dataset?


    Code:
    clear
    input byte(id var1 var2 var3 var4 var5 var6 var7)
    1  . 15 . 17 18 19  .
    2 25 26 . 30  .  . 12
    end

  • #2
    not completely clear, but I think that -rowsort- (from the Stata Journal) will do what you want; use -search- to find and download

    Comment


    • #3
      Thanks, Rich!

      rowsort would work, but I would like to keep the order of the numbers instead of putting them in ascending or descending order. I just want the missing values to be at the end,

      Do you know a similar command to rowsort that keeps the order of the non-missing values?

      Comment


      • #4
        I don't know about a command, but here is one technique.
        Code:
        clear
        input byte(id var1 var2 var3 var4 var5 var6 var7)
        1  . 15 . 17 18 19  .
        2 25 26 . 30  .  . 12
        end
        
        reshape long var, i(id) j(j)
        bys id: gen tosort = sum(!missing(var)) if !missing(var)
        bys id (tosort): gen jj = _n
        drop j tosort
        reshape wide var, i(id) j(jj)
        list
        
        
             +-----------------------------------------------------+
             | id   var1   var2   var3   var4   var5   var6   var7 |
             |-----------------------------------------------------|
          1. |  1     15     17     18     19      .      .      . |
          2. |  2     25     26     30     12      .      .      . |
             +-----------------------------------------------------+

        Comment


        • #5
          I found another way, but there is still one problem:

          Imagine I only care about var1 to var6. There are also var7 to var11, and if var1 to var6 have missing values, I would like to "refill" them by the values from var7-var10, while keeping the order of non-missing values. To do so, I have another variable called "start_missing". In my dataset, ID1 has missing values from var3 on, so that "start_missing" has a value of 3.

          The problem with my code is, that within the forvalues loop, my data gets overwritten.

          Can you help me out on this?




          Code:
          clear
          input byte(id var1 var2 var3 var4 var5 var6 var7 var8 var9 var10 var11 start_missing)
          1 13 15  .  . . . 1 2 3 . . 3
          2 25 26 30 54 . . 4 5 . . . 5
          end
          
          
          forvalues i = 1/6 {
          replace var`i' = var7 if `i'== start_missing
          replace var`i+1' = var8 if `i'== start_missing
          replace var`i+2' = var8 if `i'== start_missing
          replace var`i+3' = var9 if `i'== start_missing
          replace var`i+4' = var10 if `i'== start_missing
          replace var`i+5' = var11 if `i'== start_missing
          }



          Comment


          • #6
            Your data examples are all integers when they are not missing. So you can concatenate all the values into a string, remove the missings and then read them out again.

            Code:
            clear
            input byte(id var1 var2 var3 var4 var5 var6 var7 var8 var9 var10 var11 start_missing)
            1 13 15  .  . . . 1 2 3 . . 3
            2 25 26 30 54 . . 4 5 . . . 5
            end
            
            egen all = concat(var*), p(" ") 
            replace all = trim(itrim(subinstr(all, ".", "", .))) 
            
            quietly forval j = 1/11 { 
                replace var`j' = real(word(all, `j')) 
            }
            
            list all var* 
            
                 +------------------------------------------------------------------------------------------------+
                 |             all   var1   var2   var3   var4   var5   var6   var7   var8   var9   var10   var11 |
                 |------------------------------------------------------------------------------------------------|
              1. |     13 15 1 2 3     13     15      1      2      3      .      .      .      .       .       . |
              2. | 25 26 30 54 4 5     25     26     30     54      4      5      .      .      .       .       . |
                 +------------------------------------------------------------------------------------------------+

            Comment


            • #7
              Nick, thank you so much! This is really what I was looking for.

              Comment

              Working...
              X