Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Compute share of missing values by ID above certain threshold

    Hi all,

    I would like to compute a share of ID with missing values (of varlist) above certain threshold and export in excel file.
    To visualize what I have for example:
    ID year var1 var2 var3
    1 2001 . 4 7
    1 2002 . 4 7
    1 2003 . 4 .
    1 2004 333 4 .
    1 2005 333 4 .
    1 2006 333 4 .
    2 2004 55 4 .
    2 2005 6 . 8
    2 2006 . . 8
    3 2003 7 . .
    3 2004 7 . 5
    And for instance, if I say I want to generate dummy variable to treat missing var1-3 above 50% of number of observations by each ID (-> and round them), then:

    ID year var1 var2 var3 threshold missing_var1 missing_var2 missing_var3
    1 2001 . 4 7 3 1 0 1
    1 2002 . 4 7 3 1 0 1
    1 2003 . 4 . 3 1 0 1
    1 2004 333 4 . 3 1 0 1
    1 2005 333 4 . 3 1 0 1
    1 2006 333 4 . 3 1 0 1
    2 2004 55 4 . 2 0 1 0
    2 2005 6 . 8 2 0 1 0
    2 2006 . . 8 2 0 1 0
    3 2003 7 . . 1 0 1 1
    3 2004 7 . 5 1 0 1 1
    missing_var should be either 0 or 1 by each ID; threshold should be calculated by 50% * (nb of obs by ID) and round them.
    And based on this, I would like to compute share of missing values at ID-level. So in this example, share of missing values of var1 would be 1/3; of var2 would be 2/3 and var3 would be 2/3.

    Could someone help me to implement this in stata, please ?

    Thank you !!

  • #2
    Code:
    ** Example generated by -dataex-. For more info, type help dataex
    clear
    input byte id int(year var1) byte(var2 var3)
    1 2001   . 4 7
    1 2002   . 4 7
    1 2003   . 4 .
    1 2004 333 4 .
    1 2005 333 4 .
    1 2006 333 4 .
    2 2004  55 4 .
    2 2005   6 . 8
    2 2006   . . 8
    3 2003   7 . .
    3 2004   7 . 5
    end
    
    frame put *, into(count)
    frame count{
        reshape long var, i(id year) j(which)
        bys id which: gen halfobs=_N/2
        by id which: egen miss=total(missing(var))
        contract id which halfobs miss, freq(wanted)
        bys which (id): replace wanted= sum((miss>=halfobs)/_N)
        by which: replace wanted= wanted[_N]
    }
    You can additionally use contract if you want to export the shares per variable (identifier denoted by 'which' in my code).

    Res.:

    Code:
    . frame count: format wanted %4.3f
    
    . frame count: list, sepby(which)
    
         +--------------------------------------+
         | id   which   halfobs   miss   wanted |
         |--------------------------------------|
      1. |  1       1         3      3    0.333 |
      2. |  2       1       1.5      1    0.333 |
      3. |  3       1         1      0    0.333 |
         |--------------------------------------|
      4. |  1       2         3      0    0.667 |
      5. |  2       2       1.5      2    0.667 |
      6. |  3       2         1      2    0.667 |
         |--------------------------------------|
      7. |  1       3         3      4    0.667 |
      8. |  2       3       1.5      1    0.667 |
      9. |  3       3         1      1    0.667 |
         +--------------------------------------+
    Last edited by Andrew Musau; 27 Mar 2025, 08:28.

    Comment


    • #3
      Andrew Musau Thank you very much for your reply!
      I actually don't get what 'which' is.. And I actually wanted to computed nb of missing values by ID for each variables (var1-3) and compute their share for each variables as well.
      Sorry if i wasn't clear..

      And I am new to frame command, but can it be exported in excel and then go back to the dataset that I was using?

      Comment


      • #4
        Originally posted by Anne-Claire Jo View Post
        Andrew MusauAnd I actually wanted to computed nb of missing values by ID for each variables (var1-3) and compute their share for each variables as well.
        I think #2 does what you ask for, and the shares 0.667,0.667, 0.333 = \(\frac{2}{3}, \frac{2}{3}, \frac{1}{3}\) correspond to your calculation. If this is not the case, point out where the error is.

        I actually don't get what 'which' is

        As I stated, which is the variable identifier. So, which = 1 refers to 'var1', which = 2 refers to 'var2', and so on. Perhaps you used 'var1', 'var2', and 'var3' as proxies for some variable names. That will affect the reshape command, as the stubname in this case is 'var'. The code below assumes that your variables do not have a common stubname.

        And I am new to frame command, but can it be exported in excel and then go back to the dataset that I was using?
        A frame is just a separate dataset in memory that allows you to work with multiple datasets simultaneously without overwriting or replacing the active dataset. frames were introduced in Stata 16 to facilitate data management, merging, and analysis across multiple datasets. So the active dataset is never affected. To export the dataset to Excel, just use export excel as usual.


        Code:
        frame count: export excel using myfile, replace
        Here is the modification to accommodate variable names without a common stubname.

        Code:
        ** Example generated by -dataex-. For more info, type help dataex
        clear
        input byte id int(year alpha) byte(beta gamma)
        1 2001   . 4 7
        1 2002   . 4 7
        1 2003   . 4 .
        1 2004 333 4 .
        1 2005 333 4 .
        1 2006 333 4 .
        2 2004  55 4 .
        2 2005   6 . 8
        2 2006   . . 8
        3 2003   7 . .
        3 2004   7 . 5
        end
        
        frame put *, into(count)
        frame count{
            rename (alpha-gamma) var=
            reshape long var, i(id year) j(which) string
            bys id which: gen halfobs=_N/2
            by id which: egen miss=total(missing(var))
            contract id which halfobs miss, freq(wanted)
            bys which (id): replace wanted= sum((miss>=halfobs)/_N)
            by which: replace wanted= wanted[_N]
        }

        Res.:

        Code:
        . frame count: format wanted %4.3f
        
        . frame count: list, sepby(which)
        
             +--------------------------------------+
             | id   which   halfobs   miss   wanted |
             |--------------------------------------|
          1. |  1   alpha         3      3    0.333 |
          2. |  2   alpha       1.5      1    0.333 |
          3. |  3   alpha         1      0    0.333 |
             |--------------------------------------|
          4. |  1    beta         3      0    0.667 |
          5. |  2    beta       1.5      2    0.667 |
          6. |  3    beta         1      2    0.667 |
             |--------------------------------------|
          7. |  1   gamma         3      4    0.667 |
          8. |  2   gamma       1.5      1    0.667 |
          9. |  3   gamma         1      1    0.667 |
             +--------------------------------------+
        
        .


        Comment


        • #5
          Andrew Musau
          Thanks for your reply!

          I honestly don't feel very much confident with understanding the codes you provided - sorry, I am not very strong with stata..
          Is your codes do similar output as the codes below? :
          Code:
              
          foreach var of global vars_to_check{
              preserve
              
              bys $id: gen halfobs = _N / 2
              // Count nb of missing obs of variable by ID
              egen miss_`var' = total(missing(`var')), by($id)
              * == 1 if nb of missing obs is larger than 50% of its nb of observations
              bys $id: gen missing_`var' = (miss_`var' >= halfobs)
              * drop duplicates
              bys $id: keep if _n == 1
              * count total nb of firms with missing variable
              egen nb_missing_`var' = total(missing_`var')
              * Compute share
              gen s_m_v2_`var' = nb_missing_`var' / _N
              
              restore
          }
          you can ignore details on variable names, maybe just the underlying logic.
          Maybe I am hard-coding or coding in inefficient way what you are doing..

          Comment


          • #6
            Here is some code that is similar to yours, Anne-Claire Jo , but without the need to -preserve- and -restore-. Andrew's code achieves the same thing, but this might be easier to follow.

            Code:
            global vars_to_check alpha beta gamma
            global id id
            
            bys $id: gen halfobs = _N / 2
            egen tagged = tag(id) // == 1 for one observation per firm
            egen num_firms = total(tagged)
            
            foreach var of global vars_to_check {
                // Count nb of missing obs of variable by ID
                egen miss_`var' = total(missing(`var')), by($id)
            }
            
            keep if tagged
            drop tagged
            
            foreach var of global vars_to_check {
                * == 1 if nb of missing obs is larger than 50% of its nb of observations
                gen missing_`var' = (miss_`var' >= halfobs)
                * count total nb of firms with missing variable
                egen nb_missing_`var' = total(missing_`var')
                * Compute share
                gen s_m_v2_`var' = nb_missing_`var' / num_firms
            }
            which produces one observation for each firm, with the following values of shares:

            Code:
            . list s_m_v2_* in 1, abbrev(12) noobs
            
              +-------------------------------------------+
              | s_m_v2_alpha   s_m_v2_beta   s_m_v2_gamma |
              |-------------------------------------------|
              |     .3333333      .6666667       .6666667 |
              +-------------------------------------------+

            Comment


            • #7
              Thanks for your help !! It is clear

              Comment

              Working...
              X