Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to ask Stata to keep only the most recent nonmissing observations of a variable?

    Dear Statalisters,

    I hope you are all fine. I am currently using a file gathering data on employment by sector for a set of countries and years. Here's an example generated by the command -dataex-. I had to alter the data with random numbers for the sake of confidentiality:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str3 country int year str2 sector double employ
    "C1" 2015 " 1" 52
    "C1" 2016 " 1" 56
    "C1" 2017 " 1"    .
    "C1" 2018 " 1"    .
    "C1" 2015 " 2" 23
    "C1" 2016 " 2" 28
    "C1" 2017 " 2" 22
    "C1" 2018 " 2" 54
    "C1" 2019 " 2" 43
    "C1" 2015 " 3"    .
    "C1" 2016 " 3"    .
    "C1" 2017 " 3" 65
    "C1" 2018 " 3" 58
    "C1" 2019 " 3" 26
    "C1" 2015 " 4"    .
    "C1" 2017 " 4"    .
    "C1" 2018 " 4"    .
    "C1" 2019 " 4"    .
    "C1" 2015 " 5" 53
    "C1" 2016 " 5" 55
    end

    I would like to ask Stata to keep only observations for employ with the most recent nonmissing observations in each sector number. In the case where observations are missing for all years (as in sector 4 in the example displayed above), then I would like Stata to keep the missing value for the most recent year. What I would like should look like something like this :

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str3 country int year str2 sector double employ
    "C1" 2016 " 1" 56
    "C1" 2019 " 2" 43
    "C1" 2019 " 3" 26
    "C1" 2019 " 4"    .
    "C1" 2016 " 5" 55
    end
    Could anyone help me on this request? Any advice would be greatly appreciated.

    Regards,

    Hugo


  • #2
    Thank you for using -dataex- on your very first post!

    Code:
    gen byte non_missing_employ = !missing(employ)
    by country sector (non_missing_employ year), sort: keep if _n == _N

    Comment


    • #3
      Clyde Schechter gives excellent advice as usual, but it's hard for me to imagine that the missing values are worth keeping. If not then

      Code:
      drop if missing(employ) 
      bysort country_sector (year) : keep if _n == _N
      gets you there a little more directly.

      Comment


      • #4
        If the missing values are not worth keeping, then Nick's approach in #3 is simpler and therefore preferred. But in #1, O.P. explicitly asks that if there are no non-missing values, the observation with the most recent date is to be preserved. So I assume that, for some reason, at least some observations with missing values are needed going forward.

        Comment


        • #5
          Clyde & Nick: I thank both of you for your suggestions. Your code worked perfectly well. Indeed, I'd like to know the number of missing observations remaining after this little dataset transformation. But I agree on the fact that they do not have any intrinsic value per se.

          Thanks again for your code !

          Comment


          • #6
            Dear Statalisters,

            I hope you are all fine. I'd like to ask you a slightly different question from the one I asked in #1. I think it is appropriate to ask it on this thread rather than on a new one.

            Once I ran the code given above by Clyde, I noticed than the number of observations for employ I ended up with was greater than the number of observations for employ in 2019. This is due to the fact that there were sectors who weren't even available in 2019 (not even as missing values), as in sector 1. The code naturally displayed those omitted sectors when it was available in previous years, as expected in the second dataex provided in #1.

            But now I would like to stay within the number of observations of 2019 only, i.e. I would like to keep the number of total observations (nonmissing + missing) fixed. Let's suppose my data is summarized in:

            Code:
            * Example generated by -dataex-. For more info, type help dataex
            clear
            input str3 country int year str2 sector double employ
            "C1" 2015 " 1" 52
            "C1" 2016 " 1" 56
            "C1" 2017 " 1"  .
            "C1" 2018 " 1"  .
            "C1" 2015 " 2" 23
            "C1" 2016 " 2" 28
            "C1" 2017 " 2" 22
            "C1" 2018 " 2" 54
            "C1" 2019 " 2" 43
            "C1" 2015 " 3"  .
            "C1" 2016 " 3"  .
            "C1" 2017 " 3" 65
            "C1" 2018 " 3" 58
            "C1" 2019 " 3" 26
            "C1" 2015 " 4"  .
            "C1" 2017 " 4"  .
            "C1" 2018 " 4"  54
            "C1" 2019 " 4"  .
            "C1" 2015 " 5" 53
            "C1" 2016 " 5" 55
            end

            There are only three observations for year 2019. I would like to replace missing 2019 values by previous nonmissing values but only for the sectors where there is an observation in 2019 (e.g. sector 1 shouldn't be included). What I would like now should look like :

            Code:
            * Example generated by -dataex-. For more info, type help dataex
            clear
            input str3 country int year str2 sector double employ
            "C1" 2019 " 2" 43
            "C1" 2019 " 3" 26
            "C1" 2018 " 4" 54
            end
            Any help would be appreciated.

            Hugo

            Comment


            • #7
              Code:
              bys country sector (year): replace employ= employ[_n-1] if missing(employ) & !missing(employ[_n-1])

              Comment


              • #8
                Andrew : The code was exactly what I needed, thanks a lot !

                The only problem is that the replaced values now take "2019" as a year, which is technically wrong. Is there any way I can keep the previous year in the var "year" whenever employ is replaced?

                Thanks again,

                Hugo

                Comment


                • #9
                  Originally posted by Hugo Denis View Post
                  The only problem is that the replaced values now take "2019" as a year, which is technically wrong. Is there any way I can keep the previous year in the var "year" whenever employ is replaced?
                  Wouldn't this just amount to duplicating observations?


                  Code:
                  * Example generated by -dataex-. For more info, type help dataex
                  clear
                  input str3 country int year str2 sector double employ
                  "C1" 2015 " 1" 52
                  "C1" 2016 " 1" 56
                  "C1" 2017 " 1"  .
                  "C1" 2018 " 1"  .
                  "C1" 2015 " 2" 23
                  "C1" 2016 " 2" 28
                  "C1" 2017 " 2" 22
                  "C1" 2018 " 2" 54
                  "C1" 2019 " 2" 43
                  "C1" 2015 " 3"  .
                  "C1" 2016 " 3"  .
                  "C1" 2017 " 3" 65
                  "C1" 2018 " 3" 58
                  "C1" 2019 " 3" 26
                  "C1" 2015 " 4"  .
                  "C1" 2017 " 4"  .
                  "C1" 2018 " 4"  54
                  "C1" 2019 " 4"  .
                  "C1" 2015 " 5" 53
                  "C1" 2016 " 5" 55
                  end
                  
                  
                  gen tag= missing(employ)
                  bys country sector (year): replace employ= employ[_n-1] if missing(employ) & !missing(employ[_n-1])
                  bys country sector (year): replace year= year[_n-1] if tag & !tag[_n-1]
                  Res.:

                  Code:
                  . duplicates list country year sector employ
                  
                  Duplicates in terms of country year sector employ
                  
                    +--------------------------------------------------+
                    | group:   obs:   country   year   sector   employ |
                    |--------------------------------------------------|
                    |      1      2        C1   2016        1       56 |
                    |      1      3        C1   2016        1       56 |
                    |      2     17        C1   2018        4       54 |
                    |      2     18        C1   2018        4       54 |
                    +--------------------------------------------------+

                  Comment

                  Working...
                  X