Compute share of missing values by ID above certain threshold

Anne-Claire Jo

Join Date: Feb 2021
Posts: 132

Compute share of missing values by ID above certain threshold

27 Mar 2025, 07:26

Hi all,

I would like to compute a share of ID with missing values (of varlist) above certain threshold and export in excel file.
To visualize what I have for example:

ID	year	var1	var2	var3
1	2001	.	4	7
1	2002	.	4	7
1	2003	.	4	.
1	2004	333	4	.
1	2005	333	4	.
1	2006	333	4	.
2	2004	55	4	.
2	2005	6	.	8
2	2006	.	.	8
3	2003	7	.	.
3	2004	7	.	5

And for instance, if I say I want to generate dummy variable to treat missing var1-3 above 50% of number of observations by each ID (-> and round them), then:

ID	year	var1	var2	var3	threshold	missing_var1	missing_var2	missing_var3
1	2001	.	4	7	3	1	0	1
1	2002	.	4	7	3	1	0	1
1	2003	.	4	.	3	1	0	1
1	2004	333	4	.	3	1	0	1
1	2005	333	4	.	3	1	0	1
1	2006	333	4	.	3	1	0	1
2	2004	55	4	.	2	0	1	0
2	2005	6	.	8	2	0	1	0
2	2006	.	.	8	2	0	1	0
3	2003	7	.	.	1	0	1	1
3	2004	7	.	5	1	0	1	1

missing_var should be either 0 or 1 by each ID; threshold should be calculated by 50% * (nb of obs by ID) and round them.
And based on this, I would like to compute share of missing values at ID-level. So in this example, share of missing values of var1 would be 1/3; of var2 would be 2/3 and var3 would be 2/3.

Could someone help me to implement this in stata, please ?

Thank you !!

Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 10058

27 Mar 2025, 08:07

Code:

** Example generated by -dataex-. For more info, type help dataex
clear
input byte id int(year var1) byte(var2 var3)
1 2001   . 4 7
1 2002   . 4 7
1 2003   . 4 .
1 2004 333 4 .
1 2005 333 4 .
1 2006 333 4 .
2 2004  55 4 .
2 2005   6 . 8
2 2006   . . 8
3 2003   7 . .
3 2004   7 . 5
end

frame put *, into(count)
frame count{
    reshape long var, i(id year) j(which)
    bys id which: gen halfobs=_N/2
    by id which: egen miss=total(missing(var))
    contract id which halfobs miss, freq(wanted)
    bys which (id): replace wanted= sum((miss>=halfobs)/_N)
    by which: replace wanted= wanted[_N]
}

You can additionally use contract if you want to export the shares per variable (identifier denoted by 'which' in my code).

Res.:

Code:

. frame count: format wanted %4.3f

. frame count: list, sepby(which)

     +--------------------------------------+
     | id   which   halfobs   miss   wanted |
     |--------------------------------------|
  1. |  1       1         3      3    0.333 |
  2. |  2       1       1.5      1    0.333 |
  3. |  3       1         1      0    0.333 |
     |--------------------------------------|
  4. |  1       2         3      0    0.667 |
  5. |  2       2       1.5      2    0.667 |
  6. |  3       2         1      2    0.667 |
     |--------------------------------------|
  7. |  1       3         3      4    0.667 |
  8. |  2       3       1.5      1    0.667 |
  9. |  3       3         1      1    0.667 |
     +--------------------------------------+

Last edited by Andrew Musau; 27 Mar 2025, 08:28.

Comment

Anne-Claire Jo

Join Date: Feb 2021

Posts: 132
#3

27 Mar 2025, 08:59

Andrew Musau Thank you very much for your reply!
I actually don't get what 'which' is.. And I actually wanted to computed nb of missing values by ID for each variables (var1-3) and compute their share for each variables as well.
Sorry if i wasn't clear..

And I am new to frame command, but can it be exported in excel and then go back to the dataset that I was using?
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10058

27 Mar 2025, 13:11

Originally posted by Anne-Claire Jo View Post

Andrew MusauAnd I actually wanted to computed nb of missing values by ID for each variables (var1-3) and compute their share for each variables as well.

I think #2 does what you ask for, and the shares 0.667,0.667, 0.333 = \(\frac{2}{3}, \frac{2}{3}, \frac{1}{3}\) correspond to your calculation. If this is not the case, point out where the error is.

I actually don't get what 'which' is

As I stated, which is the variable identifier. So, which = 1 refers to 'var1', which = 2 refers to 'var2', and so on. Perhaps you used 'var1', 'var2', and 'var3' as proxies for some variable names. That will affect the reshape command, as the stubname in this case is 'var'. The code below assumes that your variables do not have a common stubname.

And I am new to frame command, but can it be exported in excel and then go back to the dataset that I was using?

A frame is just a separate dataset in memory that allows you to work with multiple datasets simultaneously without overwriting or replacing the active dataset. frames were introduced in Stata 16 to facilitate data management, merging, and analysis across multiple datasets. So the active dataset is never affected. To export the dataset to Excel, just use export excel as usual.

Code:

frame count: export excel using myfile, replace

Here is the modification to accommodate variable names without a common stubname.

Code:

** Example generated by -dataex-. For more info, type help dataex
clear
input byte id int(year alpha) byte(beta gamma)
1 2001   . 4 7
1 2002   . 4 7
1 2003   . 4 .
1 2004 333 4 .
1 2005 333 4 .
1 2006 333 4 .
2 2004  55 4 .
2 2005   6 . 8
2 2006   . . 8
3 2003   7 . .
3 2004   7 . 5
end

frame put *, into(count)
frame count{
    rename (alpha-gamma) var=
    reshape long var, i(id year) j(which) string
    bys id which: gen halfobs=_N/2
    by id which: egen miss=total(missing(var))
    contract id which halfobs miss, freq(wanted)
    bys which (id): replace wanted= sum((miss>=halfobs)/_N)
    by which: replace wanted= wanted[_N]
}

Res.:

Code:

. frame count: format wanted %4.3f

. frame count: list, sepby(which)

     +--------------------------------------+
     | id   which   halfobs   miss   wanted |
     |--------------------------------------|
  1. |  1   alpha         3      3    0.333 |
  2. |  2   alpha       1.5      1    0.333 |
  3. |  3   alpha         1      0    0.333 |
     |--------------------------------------|
  4. |  1    beta         3      0    0.667 |
  5. |  2    beta       1.5      2    0.667 |
  6. |  3    beta         1      2    0.667 |
     |--------------------------------------|
  7. |  1   gamma         3      4    0.667 |
  8. |  2   gamma       1.5      1    0.667 |
  9. |  3   gamma         1      1    0.667 |
     +--------------------------------------+

.

Comment

Anne-Claire Jo

Join Date: Feb 2021
Posts: 132

30 Mar 2025, 17:17

Andrew Musau
Thanks for your reply!

I honestly don't feel very much confident with understanding the codes you provided - sorry, I am not very strong with stata..
Is your codes do similar output as the codes below? :

Code:

    
foreach var of global vars_to_check{
    preserve
    
    bys $id: gen halfobs = _N / 2
    // Count nb of missing obs of variable by ID
    egen miss_`var' = total(missing(`var')), by($id)
    * == 1 if nb of missing obs is larger than 50% of its nb of observations
    bys $id: gen missing_`var' = (miss_`var' >= halfobs)
    * drop duplicates
    bys $id: keep if _n == 1
    * count total nb of firms with missing variable
    egen nb_missing_`var' = total(missing_`var')
    * Compute share
    gen s_m_v2_`var' = nb_missing_`var' / _N
    
    restore
}

you can ignore details on variable names, maybe just the underlying logic.
Maybe I am hard-coding or coding in inefficient way what you are doing..

Comment

Hemanshu Kumar

Join Date: Mar 2015
Posts: 1309

31 Mar 2025, 04:21

Here is some code that is similar to yours, Anne-Claire Jo , but without the need to -preserve- and -restore-. Andrew's code achieves the same thing, but this might be easier to follow.

Code:

global vars_to_check alpha beta gamma
global id id

bys $id: gen halfobs = _N / 2
egen tagged = tag(id) // == 1 for one observation per firm
egen num_firms = total(tagged)

foreach var of global vars_to_check {
    // Count nb of missing obs of variable by ID
    egen miss_`var' = total(missing(`var')), by($id)
}

keep if tagged
drop tagged

foreach var of global vars_to_check {
    * == 1 if nb of missing obs is larger than 50% of its nb of observations
    gen missing_`var' = (miss_`var' >= halfobs)
    * count total nb of firms with missing variable
    egen nb_missing_`var' = total(missing_`var')
    * Compute share
    gen s_m_v2_`var' = nb_missing_`var' / num_firms
}

which produces one observation for each firm, with the following values of shares:

Code:

. list s_m_v2_* in 1, abbrev(12) noobs

  +-------------------------------------------+
  | s_m_v2_alpha   s_m_v2_beta   s_m_v2_gamma |
  |-------------------------------------------|
  |     .3333333      .6666667       .6666667 |
  +-------------------------------------------+

Comment

Anne-Claire Jo

Join Date: Feb 2021

Posts: 132
#7

31 Mar 2025, 07:30

Thanks for your help !! It is clear
Comment

Announcement

Compute share of missing values by ID above certain threshold

Comment

Comment

Comment

Comment

Comment

Comment