Generate single date variable from year and month variables

Sandra Bloem

Join Date: Jun 2020

Posts: 106
#1

Generate single date variable from year and month variables

01 Apr 2021, 11:58

Hi there,

I am struggling to create a date variable. I have found some information on using -mdy- and -date-, but this won't seem to work with the type of data I have.

I have two variables available (storage type double): year and month. I need to create a new variable that combines these two together basically (I don't have data on the day)

For some observations I don't have any data (so year and month are missing), and for some observations only 'month' has a missing value.
When only the year is available, the new date variable should contain the year value (i.e. in this case the new variable should not get a missing value).

Could someone help me? Thank you in advance (:

Here is some example data.

Code:

clear all input double(id year month) 1 1992 4 2 1995 11 3 . . 4 2007 3 5 2004 . end
Tags: None
Andrew Musau

Join Date: Oct 2014

Posts: 10036
#2

01 Apr 2021, 12:01

For some observations I don't have any data (so year and month are missing), and for some observations only 'month' has a missing value.
When only the year is available, the new date variable should contain the year value (i.e. in this case the new variable should not get a missing value).

How is it then a year-month variable? The values do mean something. 2020m3 is exactly 3 months from 2019m12. What should 2018 represent?

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input double(id year month) 1 1992 4 2 1995 11 3 . . 4 2007 3 5 2004 . end gen wanted= ym(year, month) format wanted %tm

Res.:

Code:

. l +-----------------------------+ | id year month wanted | |-----------------------------| 1. | 1 1992 4 1992m4 | 2. | 2 1995 11 1995m11 | 3. | 3 . . . | 4. | 4 2007 3 2007m3 | 5. | 5 2004 . . | +-----------------------------+ .

Last edited by Andrew Musau; 01 Apr 2021, 12:04.
1 like
Comment
Sandra Bloem

Join Date: Jun 2020

Posts: 106
#3

01 Apr 2021, 12:08

Andrew Musau

Thank you so much! This definitely helps a lot.

I have one question remaining. For observation no. 5, the wanted variable has a missing value, while it should specify the year 2004. Is there a way to do this?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35372
#4

01 Apr 2021, 12:17

If you don't know the month, you don't know the monthly date. Your only alternatives are (1) imputing a notional month, preferably with a sound reason for doing that (2) imputing month 1 to 12 randomly, so that across observations you don't have bias (3) finding out the correct value (usually but not always impossible).
1 like
Comment
Sandra Bloem

Join Date: Jun 2020

Posts: 106
#5

01 Apr 2021, 12:34

Nick Cox

Okay, thank you for the clarification.

Would it be possible to obtain the proportions of the available months (i.e. the values 1–12) and then use this to replace the missing values, such that the proportions of each month stay the same?
If for example May (i.e. 5) is the month 14% of the time, then the missed values should be replaced with "5" 14% of the time.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35372

01 Apr 2021, 13:46

Yes, you can do that. To cut to the essence of the problem, let's imagine a problem of imputing quarters 1 to 4. Then "your value of 4" is 12.

First, let's create a dataset with known and unknown quarters:

Code:

 
. clear

. set obs 20
number of observations (_N) was 0, now 20

. set seed 2803

. gen quarter = cond(_n == 1, 1, cond(_n <= 3, 2, cond(_n <= 6, 3, cond(_n <= 10, 4, .))))
(10 missing values generated)

. tab quarter, missing

    quarter |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |          1        5.00        5.00
          2 |          2       10.00       15.00
          3 |          3       15.00       30.00
          4 |          4       20.00       50.00
          . |         10       50.00      100.00
------------+-----------------------------------
      Total |         20      100.00

The empirical probabilities are 0.1 0.2 0.3 0.4 because we created values with a deterministic rule. In practice, you would estimate the probabilities from your data.

Mata has a convenient function which for some reason is not (yet) in Stata. We create a variable and then fill in with values created in Mata.

Code:

 

. mata : st_addvar("byte", "imputed")
  2

. mata : st_store(., "imputed", rdiscrete(20, 1, (0.1, 0.2, 0.3, 0.4)))

The arguments for rdiscrete() are number of rows, number of columns, vector of probabilities of values 1 up.

Code:

. tab imputed

    imputed |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |          3       15.00       15.00
          2 |          6       30.00       45.00
          3 |          2       10.00       55.00
          4 |          9       45.00      100.00
------------+-----------------------------------
      Total |         20      100.00

. gen better = cond(quarter < ., quarter, imputed)

. l

     +----------------------------+
     | quarter   imputed   better |
     |----------------------------|
  1. |       1         4        1 |
  2. |       2         4        2 |
  3. |       2         2        2 |
  4. |       3         3        3 |
  5. |       3         2        3 |
     |----------------------------|
  6. |       3         4        3 |
  7. |       4         1        4 |
  8. |       4         1        4 |
  9. |       4         1        4 |
 10. |       4         3        4 |
     |----------------------------|
 11. |       .         4        4 |
 12. |       .         4        4 |
 13. |       .         2        2 |
 14. |       .         2        2 |
 15. |       .         2        2 |
     |----------------------------|
 16. |       .         4        4 |
 17. |       .         4        4 |
 18. |       .         4        4 |
 19. |       .         4        4 |
 20. |       .         2        2 |
     +----------------------------+

All sorts of small twists are possible. In practice, we only need to impute as many values as are missing.

Comment

Sandra Bloem

Join Date: Jun 2020

Posts: 106
#7

01 Apr 2021, 13:59

Nick Cox

Wow, thank you so much for the detailed explanation. I really appreciate it. This helps me a ton (:

Have a blessed day!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35372
#8

01 Apr 2021, 14:07

https://www.stata-journal.com/articl...article=pr0032 is a little dated in terms of code detail but otherwise another version of the main idea;.
1 like
Comment
Sandra Bloem

Join Date: Jun 2020

Posts: 106
#9

01 Apr 2021, 14:09

Nick Cox

Thank you for all the information (:
Comment

Announcement