Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating duration variable (HH:MM:SS) to include durations > 24:00:00

    Hello everyone,

    I have an administrative dataset with a variable containing durations a person spends in detention. The detention can be for up to 72 hours. There are ~250,000 observations. I have calculated the durations in Excel (HH:MM:SS). I am struggling to import the Excel durations into a manner that makes sense in Stata. This specific issue is how to represent periods beyond 24:00:00.

    I understand, at a very basic level, this is because, as Nick Cox has explained in a post some years ago "[d]urations are often best not regarded as date-times at all", in part because of Stata's anchoring of durations around 1 January 1960 and also because Stata works like a 24 hour clock.

    Here's how a snapshot of the data looks in Stata on import from Excel. My aim is to get these durations converted and formatted in a manner Stata make sense of so, ultimatey, I can construct a model that predicts how DURATIONVARIABLE relates to various independent variables in the dataset.

    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str10 CASENUMBER str9 DURATIONVARIABLE
    "1536866819" "2:02:10"
    "2858545465" "7:56:07"
    "2495167939" "12:09:29"
    "1376276231" "5:56:43"
    "4398713795" "13:24:53"
    "1983011019" "32:39:34"
    "1095268814" "5:28:22"
    "39381691" "12:52:29"
    "1562593376" "12:09:42"
    "3009819123" "8:29:43"
    "1950424550" "23:08:02"
    "3548316279" "5:29:01"
    "3384273420" "19:53:32"
    "2498842354" "35:30:55"
    "4852829451" "5:13:02"
    "75308479" "34:05:44"
    "281657202" "21:25:52"
    "1589321058" "14:19:53"
    "5081152858" "19:49:34"
    "4563397312" "12:15:28"

    end
    [/CODE]

    Any thoughts greatly appreciated!

    Richard

  • #2
    Your data example is very clear but I am not so clear what you want to do in Stata. But for modelling a numeric variable is crucial. Here is duration in seconds as an integer. Hours and minutes are clearly alternative units.


    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str10 CASENUMBER str9 DURATIONVARIABLE
    "1536866819" "2:02:10"
    "2858545465" "7:56:07"
    "2495167939" "12:09:29"
    "1376276231" "5:56:43"
    "4398713795" "13:24:53"
    "1983011019" "32:39:34"
    "1095268814" "5:28:22"
    "39381691" "12:52:29"
    "1562593376" "12:09:42"
    "3009819123" "8:29:43"
    "1950424550" "23:08:02"
    "3548316279" "5:29:01"
    "3384273420" "19:53:32"
    "2498842354" "35:30:55"
    "4852829451" "5:13:02"
    "75308479" "34:05:44"
    "281657202" "21:25:52"
    "1589321058" "14:19:53"
    "5081152858" "19:49:34"
    "4563397312" "12:15:28"
    
    end
    
    split DURATIONVARIABLE, parse(":") destring gen(DUR)
    
    gen long duration = 3600 * DUR1 + 60 * DUR2 + DUR3 
    format duration %12.0f 
    
    list 
    
         +-------------------------------------------------------+
         | CASENUMBER   DURATI~E   DUR1   DUR2   DUR3   duration |
         |-------------------------------------------------------|
      1. | 1536866819    2:02:10      2      2     10       7330 |
      2. | 2858545465    7:56:07      7     56      7      28567 |
      3. | 2495167939   12:09:29     12      9     29      43769 |
      4. | 1376276231    5:56:43      5     56     43      21403 |
      5. | 4398713795   13:24:53     13     24     53      48293 |
         |-------------------------------------------------------|
      6. | 1983011019   32:39:34     32     39     34     117574 |
      7. | 1095268814    5:28:22      5     28     22      19702 |
      8. |   39381691   12:52:29     12     52     29      46349 |
      9. | 1562593376   12:09:42     12      9     42      43782 |
     10. | 3009819123    8:29:43      8     29     43      30583 |
         |-------------------------------------------------------|
     11. | 1950424550   23:08:02     23      8      2      83282 |
     12. | 3548316279    5:29:01      5     29      1      19741 |
     13. | 3384273420   19:53:32     19     53     32      71612 |
     14. | 2498842354   35:30:55     35     30     55     127855 |
     15. | 4852829451    5:13:02      5     13      2      18782 |
         |-------------------------------------------------------|
     16. |   75308479   34:05:44     34      5     44     122744 |
     17. |  281657202   21:25:52     21     25     52      77152 |
     18. | 1589321058   14:19:53     14     19     53      51593 |
     19. | 5081152858   19:49:34     19     49     34      71374 |
     20. | 4563397312   12:15:28     12     15     28      44128 |
         +-------------------------------------------------------+

    Comment


    • #3
      If you are looking for a way to convert DURATIONVARIABLE to a Stata datetime variable and then format it so it will look like 35:30:55 when Stata exhibits it, I'm pretty sure that cannot be done. If that is what you really need, then I think the best you can do is convert it to a Stata clock variable--which you can use for sorting and calculations, but then always show the DURATIONVARIABLE itself when -list-ing or -display-ing, etc.

      However, you might consider instead doing this:
      Code:
      split DURATIONVARIABLE, parse(:) gen(component) destring
      gen component0 = floor(component1/24), before(component1)
      replace component1 = mod(component1, 24)
      assert component0 <= 3
      
      gen duration = clockdiff_frac(cofd(td(1jan1960)), ///
      cofC(Cdhms(td(1jan1960)+component0, component1, component2, component3)), "h")
      format duration %5.4f
      This calculates a new variable, duration, which is the number of hours corresponding to duration, with the minutes and seconds shown as a decimal fraction. This variable can be used for chronological sorting and calculations, and is a reasonable way, in most contexts, of reporting a time duration. So instead of saying that somebody was detained for 32 hours, 39 minutes, and 34 seconds, saying they were detained for 32.6594 hours is mathematically equivalent (well, to 4 decimal places--you can change the display format to give you more or fewer, as you prefer) and is easy for everyone to grasp. You might even want, in some contexts, to round that off to the nearest half or quarter hour, which you can easily do with Stata's -round()- function.

      Added: Crossed with #2, where a similar overall approach is taken, but the results given in seconds rather than hours. FWIW, I think given the range of values of your variable, presenting the results in units of hours is better than seconds. Everyone has an intuitive grasp of what 32.7 hours is, whereas few people will grasp 117,574 seconds.
      Last edited by Clyde Schechter; 20 Nov 2024, 16:32.

      Comment


      • #4
        Indeed. I agree with Clyde Schechter and doubt that seconds are what anyone wants to see, but it's your choice what else to use. Use (e.g.)

        Code:
        gen double hours = seconds/3600

        Comment


        • #5
          Many thanks to Nick Cox and @Clyde Schechter for such interest and prompt replies. What a rich resource this forum is.

          Substantively:

          Replying to #2

          This seems a very sensible approach. Practically, when I try running the Code though, I'm struggling. Stata is communicating the following to me:


          ". split DURATIONVARIABLE, parse( : ) gen(component) destring
          variables born as string:
          component1 component2 component3
          component1: all characters numeric; replaced as int
          (312852 missing values generated)
          component2: all characters numeric; replaced as byte
          (312852 missing values generated)
          component3: all characters numeric; replaced as byte
          (312852 missing values generated)

          . gen component0 = floor(component1/24), before(component1)
          (312,852 missing values generated)

          . replace component1 = mod(component1, 24)
          (493 real changes made)

          . assert component0 <= 3
          312,857 contradictions in 315,089 observations
          assertion is false
          r(9);

          .
          . gen duration = clockdiff_frac(cofd(td(1jan1960)), ///
          too few ')' or ']'
          r(132);

          . cofC(Cdhms(td(1jan1960)+component0, component1, component2, component3)), "h"
          > )
          command cofC is unrecognized
          r(199);

          . format duration %5.4f
          variable duration not found
          r(111);

          . datatext
          command datatext is unrecognized
          r(199);"


          Replying to #3

          Apologies, for clarity: in Stata, I would like to create some basic descriptive statistics (e.g. average duration in dentention) but ultimatley some regression models that explore the relationship between duration in detention with offender and offence variables.

          Am I write in thinking that, using your method, the result is similar to Clyde's #3 in that we're arriving at a numeric integar and, having got in duration seconds, my next step would be to convert into hours/minutes?

          Thanks again - and for your patience with this follow up!
          Last edited by Richard James; 20 Nov 2024, 17:51.

          Comment


          • #6
            Replying to @Nick Cox @ #2

            Many thanks: your code worked successfully on my first dataset. Then, curiously, when I appplied it to the second dataset (observations for later period) I'm getting a type mismatch midway through. Any thoughts?

            . split DURATIONVARIABLE, parse(":") destring gen(DUR)
            variables born as string:
            DUR1 DUR2 DUR3
            DUR1: contains nonnumeric characters; no replace
            DUR2: all characters numeric; replaced as byte
            (103096 missing values generated)
            DUR3: all characters numeric; replaced as byte
            (103096 missing values generated)

            .
            . gen long duration = 3600 * DUR1 + 60 * DUR2 + DUR3
            type mismatch
            r(109);
            Last edited by Richard James; 20 Nov 2024, 17:50.

            Comment


            • #7
              Replying to #6:

              Evidently matters are different in the second dataset. As the split command failed to destring the hour component some other characters must be present other than numeric characters and the colon. You will need to look more closely at the data. Perhaps the problem lies in characters such as h hr hrs hour hours, none of which will be ignored unless you say so.

              The occurrence of missing values also suggests that the data are often if not always quite different from your data example.

              The generate command fails for the reason given by the previous output: DUR1 is string. However, the missing values on the other variables would also be infectious.

              Comment


              • #8
                Let's go over what's happening in #5. There are several issues.

                First all those missing values. It seems to be the case that in 312852 observations in your data set, the value of DURATIONVARIABLE is missing. Your example data in #1 did not show any missing values, so I assumed that the full data set has none. In any case, those are just notifications, not error messages, and aren't really problems. You can ignore those messages.

                Let's turn our attention to
                Code:
                . assert component0 <= 3
                312,857 contradictions in 315,089 observations
                assertion is false
                r(9);
                There are two different things going on here. First, we already know that 312852 observations have missing values for component0. And, in Stata, missing value is larger than any real number, hence not <= 3. So those are not problems. But that still leaves contradictions unaccounted for. Where are they coming from? You said in #1 that the detention times ranged up to 3 days. But Stata thinks otherwise. So after running the code up to, but not including that -assert- command, you should run:
                Code:
                list if component0 > 3 & !missing(component0)
                This will show you the five offending observations. Then you have to decide if your expectation that the durations would be non longer than 3 days was incorrect in the first place, or if these are data errors that need to be fixed. If the former, change -assert component0 <= 3- to -assert component0 <= # | missing(component0)-. replacing # by whatever the maximum number of days is really supposed to be.

                Next up:
                Code:
                . gen duration = clockdiff_frac(cofd(td(1jan1960)), ///
                too few ')' or ']'
                r(132);
                
                . cofC(Cdhms(td(1jan1960)+component0, component1, component2, component3)), "h"
                > )
                command cofC is unrecognized
                r(199);
                This is all one mistake, though there are two possibilities of which mistake it could be. The /// sequence at the end of the first line tells Stata that the command continues on the next line--without that (or some other method of explicitly telling Stata to seek the rest of the command on the next line) Stata assumes the end of the line entails the end of the command. Well, you can see for yourself that the code on that line is not a complete command. It doesn't even reach the end of the list of arguments for clockdiff_frac(), which Stata recognizes because there is no closing parenthesis to match the clockdiff_frac opening parenthesis. And then when you try to run the next line, Stata rightly complains that it doesn't start with a command, because -cofC()- is a function, not a command. The problem is that you tried to run the two lines of a single command separately.

                So there are two ways you could have gotten into this problem. One is that you are entering the commands into the Command window. The problem with that is that /// is not allowed in the Command window. This code was written and tested in a do-file. And I think here on Statalist wherever you are shown code, you should assume that it was written for use in a do-file, not in the Command window. That's certainly the case with my code, but I think it applies broadly on Statalist. Certainly any code containing /// is simply not runnable in the Command window. The other way you might have gotten into this problem is that you are running the code in a do-file, but you are running it one line at a time. That works in some cases and in others it does not. It does not work when, as here, some of the commands extend over more than one line. (It also fails when the code relies on the use of local macros--but there aren't any of those in this code, so if you don't know about local macros, don't worry about it at this time.)

                Either way, the code shown in #3 must be run from a do-file, not from the Command window. And if you want to run it one command at a time, you can do that, but for commands that are extended over more than one line, you must run all of those lines together.

                Code:
                . format duration %5.4f
                variable duration not found
                r(111);
                This error is a "cascade" from the preceding problem. Because the command creating variable duration triggered an error message and was not actually run, variable duration does not exist, so Stata is unhappy. But when you fix that problem, variable duration will be there and Stata will be happy to format it for you.

                Finally,
                Code:
                . datatext
                command datatext is unrecognized
                r(199);
                Well, it speaks for itself. There is no Stata command called -datatext-, nor, as far as I know, any user-written command of that name. I suspect that this is a typo and you meant -dataex-.



                Comment

                Working...
                X