Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Some dates are not getting sorted in ascending order using sort command

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str7 date_string long countyfips
    "3/15/20" 1001
    "3/16/20" 1001
    "3/17/20" 1001
    "3/18/20" 1001
    "3/19/20" 1001
    "3/20/20" 1001
    "3/21/20" 1001
    "3/22/20" 1001
    "3/23/20" 1001
    "3/24/20" 1001
    "3/25/20" 1001
    "3/26/20" 1001
    "3/27/20" 1001
    "3/28/20" 1001
    "3/29/20" 1001
    "3/30/20" 1001
    "3/31/20" 1001
    "4/1/20"  1001
    "4/10/20" 1001
    "4/11/20" 1001
    "4/12/20" 1001
    "4/13/20" 1001
    "4/14/20" 1001
    "4/15/20" 1001
    "4/16/20" 1001
    "4/17/20" 1001
    "4/18/20" 1001
    "4/19/20" 1001
    "4/2/20"  1001
    "4/20/20" 1001
    "4/21/20" 1001
    "4/22/20" 1001
    "4/23/20" 1001
    "4/24/20" 1001
    "4/25/20" 1001
    "4/26/20" 1001
    "4/27/20" 1001
    "4/28/20" 1001
    "4/29/20" 1001
    "4/3/20"  1001
    "4/30/20" 1001
    "4/4/20"  1001
    "4/5/20"  1001
    "4/6/20"  1001
    "4/7/20"  1001
    "4/8/20"  1001
    "4/9/20"  1001
    "5/1/20"  1001
    "5/2/20"  1001
    "5/3/20"  1001
    end
    As can be seen, there is some problem with the first week of April. The command sort is properly sorting for all other days except the first week of April. This problem is uniform across all counties. The same dates, viz. the first week of April are getting missorted. What could be going wrong? The dates for Feb, March, May etc are in the same format but there is no problem for these months.

    I have tried encode and sorted using the newly generated long format. The issue still persists.

    I have tried using date function: gen date = date(date_string, "MDY"). It generates missing values even though there are no missing values in date_string.

    Thanks.

  • #2
    Yes, string variables that look like dates will not generally sort in correct chronological order. Dates stored as string variables actually are pretty much useless in Stata. And -encoding- them doesn't help. In fact, I can't think of a single situation in which applying -encode- to a date variable will be useful. You were on the right track with using the -date()- function. What you missed is that these string dates are not in "MDY" format because the "MDY' format requires a four-digit year. On the assumption that these dates are all in the current century, you can fix this with:
    Code:
    gen date = date(date_string, "MD20Y")
    format date %td
    sort date

    Comment


    • #3
      Your last paragraph is the best idea, but needs a tweak to make century explicit. This is all documented under


      Code:
      help date()

      You need date(date_string, "MDY", 2025).

      The first two solutions don't work for compelling reasons. For example, as a string 3 sorts after 19 and after 29, and so on, because sort is taking strings literally and has zero awareness that the data are dates. encode produces nonsense here for the same reason: it works with alphabetical order, so that e.g. "1/1/23" sorts before "31/12/99" even though many small children can explain why that is wrong for dates.

      For dates you need date functions unless you understand exactly what you are doing otherwise. Even then, you probably need date functions.

      Added: Crossed with #2. If he's Clyde, I must be Bonnie.

      Comment


      • #4
        The issue is that the 2-digit year is ambiguous, so you need to give Stata a hint about what century you're in. This is explained clearly in -help date-. Two different options are provided, yielding the same end result.

        Code:
        gen date1 = date(date_string, "MD20Y")
        gen date2 = date(date_string, "MDY", 2023)
        format date1 date2 %td
        assert date1==date2
        sort date1
        Note that -encode- is always inappropriate for dates, and generally for converting string representations of numbers to numeric values.
        Last edited by Leonardo Guizzetti; 19 Apr 2023, 10:56. Reason: Edit: crossed with Bonnie & Clyde

        Comment


        • #5
          Great, that worked! Thanks a lot Clyde, Bonnie and Leonardo. Specifying the era in 4 digits - whether the Great Depression or the Great Recession - is important in Stata, because sorting literally means numerical ascending order rather than actual date/time chronology.

          Comment


          • #6
            Bonnie
            ???

            Comment


            • #7
              See also https://journals.sagepub.com/doi/epd...867X1801800413 for an overview that includes this problem.

              Comment


              • #8
                Thanks Nick! Looks like this comprehensive guide has everything that we need to deal with those strings.

                Comment

                Working...
                X