How to ask Stata to keep only the most recent nonmissing observations of a variable?

Hugo Denis

Join Date: Apr 2022

Posts: 27
#1

How to ask Stata to keep only the most recent nonmissing observations of a variable?

25 Apr 2022, 09:01

Dear Statalisters,

I hope you are all fine. I am currently using a file gathering data on employment by sector for a set of countries and years. Here's an example generated by the command -dataex-. I had to alter the data with random numbers for the sake of confidentiality:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str3 country int year str2 sector double employ "C1" 2015 " 1" 52 "C1" 2016 " 1" 56 "C1" 2017 " 1" . "C1" 2018 " 1" . "C1" 2015 " 2" 23 "C1" 2016 " 2" 28 "C1" 2017 " 2" 22 "C1" 2018 " 2" 54 "C1" 2019 " 2" 43 "C1" 2015 " 3" . "C1" 2016 " 3" . "C1" 2017 " 3" 65 "C1" 2018 " 3" 58 "C1" 2019 " 3" 26 "C1" 2015 " 4" . "C1" 2017 " 4" . "C1" 2018 " 4" . "C1" 2019 " 4" . "C1" 2015 " 5" 53 "C1" 2016 " 5" 55 end

I would like to ask Stata to keep only observations for employ with the most recent nonmissing observations in each sector number. In the case where observations are missing for all years (as in sector 4 in the example displayed above), then I would like Stata to keep the missing value for the most recent year. What I would like should look like something like this :

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str3 country int year str2 sector double employ "C1" 2016 " 1" 56 "C1" 2019 " 2" 43 "C1" 2019 " 3" 26 "C1" 2019 " 4" . "C1" 2016 " 5" 55 end

Could anyone help me on this request? Any advice would be greatly appreciated.

Regards,

Hugo
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29818
#2

25 Apr 2022, 10:15

Thank you for using -dataex- on your very first post!

Code:

gen byte non_missing_employ = !missing(employ) by country sector (non_missing_employ year), sort: keep if _n == _N
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35241
#3

25 Apr 2022, 11:02

Clyde Schechter gives excellent advice as usual, but it's hard for me to imagine that the missing values are worth keeping. If not then

Code:

drop if missing(employ) bysort country_sector (year) : keep if _n == _N

gets you there a little more directly.
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29818
#4

25 Apr 2022, 11:08

If the missing values are not worth keeping, then Nick's approach in #3 is simpler and therefore preferred. But in #1, O.P. explicitly asks that if there are no non-missing values, the observation with the most recent date is to be preserved. So I assume that, for some reason, at least some observations with missing values are needed going forward.
1 like
Comment
Hugo Denis

Join Date: Apr 2022

Posts: 27
#5

25 Apr 2022, 15:24

Clyde & Nick: I thank both of you for your suggestions. Your code worked perfectly well. Indeed, I'd like to know the number of missing observations remaining after this little dataset transformation. But I agree on the fact that they do not have any intrinsic value per se.

Thanks again for your code !
Comment
Hugo Denis

Join Date: Apr 2022

Posts: 27
#6

01 May 2022, 20:35

Dear Statalisters,

I hope you are all fine. I'd like to ask you a slightly different question from the one I asked in #1. I think it is appropriate to ask it on this thread rather than on a new one.

Once I ran the code given above by Clyde, I noticed than the number of observations for employ I ended up with was greater than the number of observations for employ in 2019. This is due to the fact that there were sectors who weren't even available in 2019 (not even as missing values), as in sector 1. The code naturally displayed those omitted sectors when it was available in previous years, as expected in the second dataex provided in #1.

But now I would like to stay within the number of observations of 2019 only, i.e. I would like to keep the number of total observations (nonmissing + missing) fixed. Let's suppose my data is summarized in:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str3 country int year str2 sector double employ "C1" 2015 " 1" 52 "C1" 2016 " 1" 56 "C1" 2017 " 1" . "C1" 2018 " 1" . "C1" 2015 " 2" 23 "C1" 2016 " 2" 28 "C1" 2017 " 2" 22 "C1" 2018 " 2" 54 "C1" 2019 " 2" 43 "C1" 2015 " 3" . "C1" 2016 " 3" . "C1" 2017 " 3" 65 "C1" 2018 " 3" 58 "C1" 2019 " 3" 26 "C1" 2015 " 4" . "C1" 2017 " 4" . "C1" 2018 " 4" 54 "C1" 2019 " 4" . "C1" 2015 " 5" 53 "C1" 2016 " 5" 55 end

There are only three observations for year 2019. I would like to replace missing 2019 values by previous nonmissing values but only for the sectors where there is an observation in 2019 (e.g. sector 1 shouldn't be included). What I would like now should look like :

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str3 country int year str2 sector double employ "C1" 2019 " 2" 43 "C1" 2019 " 3" 26 "C1" 2018 " 4" 54 end

Any help would be appreciated.

Hugo
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 9970

02 May 2022, 01:11

Code:

bys country sector (year): replace employ= employ[_n-1] if missing(employ) & !missing(employ[_n-1])

Comment

Hugo Denis

Join Date: Apr 2022

Posts: 27
#8

04 May 2022, 05:00

Andrew : The code was exactly what I needed, thanks a lot !

The only problem is that the replaced values now take "2019" as a year, which is technically wrong. Is there any way I can keep the previous year in the var "year" whenever employ is replaced?

Thanks again,

Hugo
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 9970

04 May 2022, 08:54

Originally posted by Hugo Denis View Post

The only problem is that the replaced values now take "2019" as a year, which is technically wrong. Is there any way I can keep the previous year in the var "year" whenever employ is replaced?

Wouldn't this just amount to duplicating observations?

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str3 country int year str2 sector double employ
"C1" 2015 " 1" 52
"C1" 2016 " 1" 56
"C1" 2017 " 1"  .
"C1" 2018 " 1"  .
"C1" 2015 " 2" 23
"C1" 2016 " 2" 28
"C1" 2017 " 2" 22
"C1" 2018 " 2" 54
"C1" 2019 " 2" 43
"C1" 2015 " 3"  .
"C1" 2016 " 3"  .
"C1" 2017 " 3" 65
"C1" 2018 " 3" 58
"C1" 2019 " 3" 26
"C1" 2015 " 4"  .
"C1" 2017 " 4"  .
"C1" 2018 " 4"  54
"C1" 2019 " 4"  .
"C1" 2015 " 5" 53
"C1" 2016 " 5" 55
end


gen tag= missing(employ)
bys country sector (year): replace employ= employ[_n-1] if missing(employ) & !missing(employ[_n-1])
bys country sector (year): replace year= year[_n-1] if tag & !tag[_n-1]

Res.:

Code:

. duplicates list country year sector employ

Duplicates in terms of country year sector employ

  +--------------------------------------------------+
  | group:   obs:   country   year   sector   employ |
  |--------------------------------------------------|
  |      1      2        C1   2016        1       56 |
  |      1      3        C1   2016        1       56 |
  |      2     17        C1   2018        4       54 |
  |      2     18        C1   2018        4       54 |
  +--------------------------------------------------+

Announcement

How to ask Stata to keep only the most recent nonmissing observations of a variable?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment