Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generate single date variable from year and month variables

    Hi there,

    I am struggling to create a date variable. I have found some information on using -mdy- and -date-, but this won't seem to work with the type of data I have.

    I have two variables available (storage type double): year and month. I need to create a new variable that combines these two together basically (I don't have data on the day)

    For some observations I don't have any data (so year and month are missing), and for some observations only 'month' has a missing value.
    When only the year is available, the new date variable should contain the year value (i.e. in this case the new variable should not get a missing value).


    Could someone help me? Thank you in advance (:



    Here is some example data.


    Code:
    clear all
    input double(id year month)
    1 1992 4
    2 1995 11
    3 . .
    4 2007 3
    5 2004 .
    end

    Click image for larger version

Name:	Schermafbeelding 2021-04-01 om 19.47.22.png
Views:	1
Size:	103.8 KB
ID:	1601010


  • #2
    For some observations I don't have any data (so year and month are missing), and for some observations only 'month' has a missing value.
    When only the year is available, the new date variable should contain the year value (i.e. in this case the new variable should not get a missing value).
    How is it then a year-month variable? The values do mean something. 2020m3 is exactly 3 months from 2019m12. What should 2018 represent?


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input double(id year month)
    1 1992  4
    2 1995 11
    3    .  .
    4 2007  3
    5 2004  .
    end
    
    gen wanted= ym(year, month)
    format wanted %tm
    Res.:

    Code:
    . l
    
         +-----------------------------+
         | id   year   month    wanted |
         |-----------------------------|
      1. |  1   1992       4    1992m4 |
      2. |  2   1995      11   1995m11 |
      3. |  3      .       .         . |
      4. |  4   2007       3    2007m3 |
      5. |  5   2004       .         . |
         +-----------------------------+
    
    .
    Last edited by Andrew Musau; 01 Apr 2021, 12:04.

    Comment


    • #3
      Andrew Musau

      Thank you so much! This definitely helps a lot.

      I have one question remaining. For observation no. 5, the wanted variable has a missing value, while it should specify the year 2004. Is there a way to do this?

      Comment


      • #4
        If you don't know the month, you don't know the monthly date. Your only alternatives are (1) imputing a notional month, preferably with a sound reason for doing that (2) imputing month 1 to 12 randomly, so that across observations you don't have bias (3) finding out the correct value (usually but not always impossible).

        Comment


        • #5
          Nick Cox

          Okay, thank you for the clarification.

          Would it be possible to obtain the proportions of the available months (i.e. the values 1–12) and then use this to replace the missing values, such that the proportions of each month stay the same?
          If for example May (i.e. 5) is the month 14% of the time, then the missed values should be replaced with "5" 14% of the time.

          Comment


          • #6
            Yes, you can do that. To cut to the essence of the problem, let's imagine a problem of imputing quarters 1 to 4. Then "your value of 4" is 12.

            First, let's create a dataset with known and unknown quarters:

            Code:
             
            . clear
            
            . set obs 20
            number of observations (_N) was 0, now 20
            
            . set seed 2803
            
            . gen quarter = cond(_n == 1, 1, cond(_n <= 3, 2, cond(_n <= 6, 3, cond(_n <= 10, 4, .))))
            (10 missing values generated)
            
            . tab quarter, missing
            
                quarter |      Freq.     Percent        Cum.
            ------------+-----------------------------------
                      1 |          1        5.00        5.00
                      2 |          2       10.00       15.00
                      3 |          3       15.00       30.00
                      4 |          4       20.00       50.00
                      . |         10       50.00      100.00
            ------------+-----------------------------------
                  Total |         20      100.00
            The empirical probabilities are 0.1 0.2 0.3 0.4 because we created values with a deterministic rule. In practice, you would estimate the probabilities from your data.

            Mata has a convenient function which for some reason is not (yet) in Stata. We create a variable and then fill in with values created in Mata.

            Code:
             
            
            . mata : st_addvar("byte", "imputed")
              2
            
            . mata : st_store(., "imputed", rdiscrete(20, 1, (0.1, 0.2, 0.3, 0.4)))
            The arguments for rdiscrete() are number of rows, number of columns, vector of probabilities of values 1 up.

            Code:
            . tab imputed
            
                imputed |      Freq.     Percent        Cum.
            ------------+-----------------------------------
                      1 |          3       15.00       15.00
                      2 |          6       30.00       45.00
                      3 |          2       10.00       55.00
                      4 |          9       45.00      100.00
            ------------+-----------------------------------
                  Total |         20      100.00
            
            . gen better = cond(quarter < ., quarter, imputed)
            
            . l
            
                 +----------------------------+
                 | quarter   imputed   better |
                 |----------------------------|
              1. |       1         4        1 |
              2. |       2         4        2 |
              3. |       2         2        2 |
              4. |       3         3        3 |
              5. |       3         2        3 |
                 |----------------------------|
              6. |       3         4        3 |
              7. |       4         1        4 |
              8. |       4         1        4 |
              9. |       4         1        4 |
             10. |       4         3        4 |
                 |----------------------------|
             11. |       .         4        4 |
             12. |       .         4        4 |
             13. |       .         2        2 |
             14. |       .         2        2 |
             15. |       .         2        2 |
                 |----------------------------|
             16. |       .         4        4 |
             17. |       .         4        4 |
             18. |       .         4        4 |
             19. |       .         4        4 |
             20. |       .         2        2 |
                 +----------------------------+
            All sorts of small twists are possible. In practice, we only need to impute as many values as are missing.

            Comment


            • #7
              Nick Cox

              Wow, thank you so much for the detailed explanation. I really appreciate it. This helps me a ton (:

              Have a blessed day!

              Comment


              • #8
                https://www.stata-journal.com/articl...article=pr0032 is a little dated in terms of code detail but otherwise another version of the main idea;.

                Comment


                • #9
                  Nick Cox

                  Thank you for all the information (:

                  Comment

                  Working...
                  X