Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Keep observations for consecutive years

    Dear Statalist,

    I'm new here.
    I have a very large panel. The main problem is that I have not a specific start and end date (for example I can have an observation in the dataset from 1978 to 2000 and another from 1985 to 2010.) My purpose is to keep only those observations having all years reported between the start and the end date, whatever these two dates could be. I'm able to compute this syntax if the start date and end dates are the same, but in this case I cannot be able to figure the solution out.

    Any thoughts?

    Thank you!
    Nicolò

  • #2
    Hi Nicolo,

    if the rest of your data set is complete (i.e. no missing data between start and end date for each var.), the solution is quite simple:

    Code:
    drop if missing(var1)| missing(var2)| missing(var3)
    This drops an observation unless all three vars have non-missing values. If you want too keep some observation with missing values, however, it becomes a bit more complicated, and it is likely easier done by hand. In your example, it would look sth. like this:
    Code:
    drop if year<1985 | year>2000
    .

    I hope I could help,

    Tim
    Last edited by Tim Umbach; 14 Oct 2017, 04:43.

    Comment


    • #3
      Hi tim,

      thank you very much for the advise.
      However what I mean is different; I did not express myself that clear, sorry.
      The idea is that the dataset is composed of workers, their dates of entering the job market (the first date reported) and then info about wages until 2012 . But obviously they enter in different moment during life and working life, and maybe they retired. Therefore I have for every worker different date of entry and different date of "exit" the job market.
      But, sometimes the date are not rightly reported; the problem is that I don't have missing value, but simply a jump in the year. Something like this, for example:
      Year Worker_ID
      1990 1
      1991 1
      1993 1
      1995 1
      1996 1
      I don't have missing, but the year is not reported. Obviously dates are different, workers by workers. Then I need to keep only the workers having all the years reported, let's suppose, from 1990 to 1996, taking into account that the first and the last years are different for almost every workers.
      Maybe is exactly equal to what you said, I'm a new Stata user!

      Thank you very much!
      Nicolò

      Comment


      • #4
        Welcome to Statalist, Nicolò.

        This turns out to be easy if you see the shortcut - if a worker has no missing years between the earliest and the last, the worker will have as many observations as the difference between the two years, plus 1. (This does assume that no worker has two or more observations for the same year.) Then you need to know that the automatic variable _N is the number of observations in the dataset, or in the by group when using the by prefix. Below I show code on your example, and a second example where there is no missing year.
        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input byte id int year
        1 1990
        1 1991
        1 1993
        1 1995
        1 1996
        2 1991
        2 1992
        2 1993
        2 1994
        2 1995
        2 1996
        2 1997
        end
        by id (year), sort: drop if year[_N]-year[1]+1 != _N
        list, clean
        Code:
        . list, clean
        
               id   year  
          1.    2   1991  
          2.    2   1992  
          3.    2   1993  
          4.    2   1994  
          5.    2   1995  
          6.    2   1996  
          7.    2   1997
        Last edited by William Lisowski; 14 Oct 2017, 11:10.

        Comment


        • #5
          Hi William,

          thank you very much, this is clear and it worked perfectly.

          Regards,
          Nicolò

          Comment

          Working...
          X