Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generating missing values by date

    Hi all,

    I'm currently working on a school-month panel dataset with some data about individual courses in those schools. The way we received the data, the dates are mostly consistent and comprehensive, but there are some random gaps that persist for each school. However, we want there to still be rows for these date gaps (within each school) and just have the course values be missing data.

    Here's a small example:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str11 Schools byte(Math Science Engineering) int Date
    "Millerville" 43 45 87 19905
    "Millerville" 66 65 56 20148
    "Millerville" 56 76 34 20149
    "Millerville" 45 78 34 20150
    "Millerville" 56 37 34 20151
    "Millerville" 77 58 56 20152
    "Millerville" 87 96 45 20154
    "Millerville" 88 74 45 20155
    "Springer"    90 64 65 19997
    "Springer"    44 25 55 20179
    "Springer"    54 64 15 20180
    "Springer"    54 78 26 20181
    "Springer"    34 65 37 20182
    "Allston"     65 45 36 20089
    "Allston"     52 29 75 20120
    "Allston"     23 64 86 20209
    "Allston"     53 78 45 20240
    "Allston"     86 93 75 20270
    "Allston"     57 47 45 20301
    "Phoenix"     45 45 34 20089
    "Phoenix"     34 23 99 20240
    "Phoenix"     66 54 34 20270
    "Phoenix"     45 78 54 20301
    "Phoenix"     96 44 46 20332
    "Phoenix"     46 75 65 20362
    end
    format %tdnn/dd/CCYY Date

  • #2
    No point to doing that so far as I can see. What would you do with the observations with missing values?

    Comment


    • #3
      Good question. I would guess that the person who requested it wants to be able to easily/visually see where we have missing months? Or maybe tally up total missings by site, or by month, etc...

      Comment


      • #4
        You can do it but you've something much more serious to worry about. Looking at your data carefully I see a funny mix of dates some of which are the first day of each month -- so these data could be monthly, just presented in terms of the those daily dates -- and some others. So you have to work at that. You may even need to look at the data source.

        Code:
        . format Date %td 
        
        . gen MDate = mofd(Date) 
        
        . format MDate %tm 
        
        . list, sepby(Schools) 
        
             +---------------------------------------------------------------+
             |     Schools   Math   Science   Engine~g        Date     MDate |
             |---------------------------------------------------------------|
          1. | Millerville     43        45         87   01jul2014    2014m7 |
          2. | Millerville     66        65         56   01mar2015    2015m3 |
          3. | Millerville     56        76         34   02mar2015    2015m3 |
          4. | Millerville     45        78         34   03mar2015    2015m3 |
          5. | Millerville     56        37         34   04mar2015    2015m3 |
          6. | Millerville     77        58         56   05mar2015    2015m3 |
          7. | Millerville     87        96         45   07mar2015    2015m3 |
          8. | Millerville     88        74         45   08mar2015    2015m3 |
             |---------------------------------------------------------------|
          9. |    Springer     90        64         65   01oct2014   2014m10 |
         10. |    Springer     44        25         55   01apr2015    2015m4 |
         11. |    Springer     54        64         15   02apr2015    2015m4 |
         12. |    Springer     54        78         26   03apr2015    2015m4 |
         13. |    Springer     34        65         37   04apr2015    2015m4 |
             |---------------------------------------------------------------|
         14. |     Allston     65        45         36   01jan2015    2015m1 |
         15. |     Allston     52        29         75   01feb2015    2015m2 |
         16. |     Allston     23        64         86   01may2015    2015m5 |
         17. |     Allston     53        78         45   01jun2015    2015m6 |
         18. |     Allston     86        93         75   01jul2015    2015m7 |
         19. |     Allston     57        47         45   01aug2015    2015m8 |
             |---------------------------------------------------------------|
         20. |     Phoenix     45        45         34   01jan2015    2015m1 |
         21. |     Phoenix     34        23         99   01jun2015    2015m6 |
         22. |     Phoenix     66        54         34   01jul2015    2015m7 |
         23. |     Phoenix     45        78         54   01aug2015    2015m8 |
         24. |     Phoenix     96        44         46   01sep2015    2015m9 |
         25. |     Phoenix     46        75         65   01oct2015   2015m10 |
             +---------------------------------------------------------------+

        Comment


        • #5
          Yes, you're right--that's because I was looking at two data sources, one of which did not have dates, and just tried to hand-create some dates in Excel for the purpose of the -dataex- here. So I went and did the actual merging and it should be accurate here (some slightly different data because it was changed/sorted a bit differently now that it's merged but functionally it's all the same). They should all be the 1st of the month, essentially it's just representing a month.

          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input str11 Schools byte(Math Science Engineering) int Date float(Month Year)
          "Millerville" 43 45 87 19905  7 2014
          "Millerville" 66 65 56 20148  3 2015
          "Millerville" 45 78 34 20179  4 2015
          "Millerville" 56 37 34 20209  5 2015
          "Millerville" 77 58 56 20240  6 2015
          "Millerville" 87 96 45 20270  7 2015
          "Millerville" 88 74 45 20301  8 2015
          "Springer"    90 64 65 19997 10 2014
          "Springer"    44 25 55 20028 11 2014
          "Springer"    54 64 15 20270  7 2015
          "Springer"    54 78 26 20301  8 2015
          "Springer"    34 65 37 20332  9 2015
          "Phoenix"     45 45 34 20120  2 2015
          "Phoenix"     34 23 99 20148  3 2015
          "Phoenix"     66 54 34 20270  7 2015
          "Phoenix"     45 78 54 20301  8 2015
          "Phoenix"     96 44 46 20332  9 2015
          "Phoenix"     46 75 65 20362 10 2015
          "Phoenix"     59 58 51 20393 11 2015
          "Phoenix"     60 22 51 20423 12 2015
          end
          format %td Date

          Comment


          • #6
            OK, so you can do something like this


            Code:
            . gen MDate = mofd(Date) 
            
            . format MDate %tm
            
            . egen NSchools = group(Schools), label 
            
            . tsset NSchools MDate 
                   panel variable:  NSchools (unbalanced)
                    time variable:  MDate, 2014m7 to 2015m12, but with gaps
                            delta:  1 month
            
            . tsfill
            That last step using tsfill is a good way to make a big dataset even bigger without adding information. Even the number of missing values can be calculated directly as the complement of the number of present values.

            Comment


            • #7
              Thanks Nick. It seems like -tsfill- is exactly what was needed.

              Would it be most efficient, from here, to treat NSchools as the new school category variable instead of the original Schools? Given that NSchools still retains school names after -tsfill-, and Schools has missing values now where the dataset was expanded, that would seem to be best. NSchools is a float and not a string, so I'll try to convert it back to string after the -tsfill-

              Comment


              • #8
                What's efficient, or indeed what means (storage? speed? code length?), is not the main point. To use tsfill you need to use tsset and to use tsset you must have a numeric panel identifier variable. I don't know why you want to convert NSchools to string. If you are sure it's not needed you just drop it and work with Schools. But then you'll not be able to use anything that depends on tsset. Conversely the same long string values again and again are usually more space-consuming than a numeric variable with value labels. If you're worried about efficiency in the sense of storage or speed you should want to keep your dataset as small as possible.
                Last edited by Nick Cox; 19 Jun 2018, 12:27.

                Comment


                • #9
                  Perhaps this is just how I was taught, but my initial thought about converting NSchools to string is because if I want to do some comparisons of scores by school, then the school variable should be string...but as I'm writing this out, I'm realizing that that isn't correct.

                  Comment


                  • #10
                    I didn't suggest dropping the original string variable. It's still there if you want to use it.

                    Comment


                    • #11
                      But the original string variable now has missings where -tsfill- added in missing values, right? Although I suppose it won't be an issue if someone is conducting analyses on the non-missing data. Probably best to just leave both the string and float variables in there.

                      Comment


                      • #12
                        You can interpolate the strings. Known problem, known solution. https://www.statalist.org/forums/for...-interpolation

                        Comment

                        Working...
                        X