Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Excel import dates problem

    Dear All, the following Excel file contains a single cell with date January 23 2012 1:45



    The following code is used to import the data:
    Code:
    import excel http://www.radyakin.org/statalist/2014/20140421_2255_excel_date.xlsx, firstrow clear
    It is imported by Stata 13.0 and 13.1 (Jan 15 2014) (both Windows) as:

    Code:
    . list
    
         +----------------+
         |       received |
         |----------------|
      1. | 1/23/2012 1:44 |
         +----------------+
    
    . describe , full
    
    Contains data
      obs:             1                          
     vars:             1                          
     size:             8                          
    -----------------------------------------------------------------------------------------------------
                  storage   display    value
    variable name   type    format     label      variable label
    -----------------------------------------------------------------------------------------------------
    received        double  %tc..                 received
    -----------------------------------------------------------------------------------------------------
    Sorted by:  
         Note:  dataset has changed since last saved
    Why 1:44 and not 1:45?

    Tried:
    • adding option allstring does not help.
    • the problem is same regardless of .xls or .xlsx extension.
    Considered:
    • The time zone should not be a problem since the difference is just 1 minute.
    • The leap seconds are also not 60 (so far).
    • whatsnew doesn't list relevant bug fixes for March update relative to January (installed, can't update).

    Thank you, Sergiy Radyakin

  • #2
    I get the same behaviour on an updated Stata 13.1 for Mac. Stat/Transfer version 12 converts it correctly.

    Comment


    • #3
      It seems to be off by one millisecond. If it's a bug, you can use ODBC until StataCorp puts out a fix.

      .version13.1

      .
      .clear*

      .setmoreoff

      .setlinesize79

      .
      .importexcel///
      >http://www.radyakin.org/statalist/2014/20140421_2255_excel_date.xlsx,
      >///
      >firstrowclear

      .
      .list,noobs

      +----------------+
      |received|
      |----------------|
      |1/23/20121:44|
      +----------------+

      .
      .formatreceived%tcCCYY-NN-DD_HH:MM:SS.sss

      .list,noobs

      +-------------------------+
      |received|
      |-------------------------|
      |2012-01-2313:44:59.999|
      +-------------------------+

      .
      .formatreceived%tCCCYY-NN-DD_HH:MM:SS.sss

      .list,noobs

      +-------------------------+
      |received|
      |-------------------------|
      |2012-01-2313:44:35.999|
      +-------------------------+

      .
      .
      .tempfiletmpfil0

      .copyhttp://www.radyakin.org/statalist/2014/20140421_2255_excel_date.xlsx///
      >`tmpfil0'

      .
      .localdsnExcelFiles;DBQ=`tmpfil0';

      .odbcexec("SELECT*FROM[Sheet1$];"),dsn("`dsn'")

      1.+----------------------------------------------------------------+
      |received|2012-01-2313:45:00|
      +----------------------------------------------------------------+



      .
      .odbcload,table(Sheet1$)dsn("`dsn'")clear

      .describe

      Containsdata
      obs:1
      vars:1
      size:8
      -------------------------------------------------------------------------------
      storagedisplayvalue
      variablenametypeformatlabelvariablelabel
      -------------------------------------------------------------------------------
      receiveddouble%tc
      -------------------------------------------------------------------------------
      Sortedby:
      Note:datasethaschangedsincelastsaved

      .list,noobs

      +--------------------+
      |received|
      |--------------------|
      |23jan201213:45:00|
      +--------------------+

      .
      .formatreceived%tcCCYY-NN-DD_HH:MM:SS.sss

      .list,noobs

      +-------------------------+
      |received|
      |-------------------------|
      |2012-01-2313:45:00.000|
      +-------------------------+

      .
      .formatreceived%tCCCYY-NN-DD_HH:MM:SS.sss

      .list,noobs

      +-------------------------+
      |received|
      |-------------------------|
      |2012-01-2313:44:36.000|
      +-------------------------+

      .
      .exit

      endofdo-file

      Comment


      • #4
        Dear Joseph, thank you very much for the explanation. It seems the problem is related to precision since it occurs for some values and not the other, and does not accumulate over time (as it would be if the problem was e.g. leap seconds). Here are a few data points (each value should be an exact hour):

        Code:
                              received  
          1.   2012-01-23 00:00:00.000  
          2.   2012-01-23 00:59:59.999  
          3.   2012-01-23 02:00:00.000  
          4.   2012-01-23 03:00:00.000  
          5.   2012-01-23 03:59:59.999  
          6.   2012-01-23 05:00:00.000  
          7.   2012-01-23 06:00:00.000  
          8.   2012-01-23 06:59:59.999  
          9.   2012-01-23 08:00:00.000  
         10.   2012-01-23 09:00:00.000  
         11.   2012-01-23 09:59:59.999  
         12.   2012-01-23 11:00:00.000  
         13.   2012-01-23 12:00:00.000  
         14.   2012-01-23 12:59:59.999  
         15.   2012-01-23 14:00:00.000  
         16.   2012-01-23 15:00:00.000  
         17.   2012-01-23 15:59:59.999  
         18.   2012-01-23 17:00:00.000
        So it seems that the problem is caused by the value being stored actually as e.g. 15:59:59.999, but then Excel rounds it to nearest when displayed, but Stata rounds it to lower value - hence visible difference.

        This discussion is relevant. And this description is too. However I couldn't find a reliable source, which could confirm that Excel rounds the values to the nearest, while Stata rounds them downwards.

        Best, Sergiy Radyakin

        Comment


        • #5
          Sergiy, It does seem to be a precision issue. Here's what I think is going on.

          Despite storing the values in double-precision floating point, Stata stores time as integer milliseconds, which can always be represented precisely, so there's no problem there.

          Excel stores date-time values in double-precision floating point arithmetic, too. But, in contrast, it splits the day and time on either side of a decimal point, and keeps the time component of date-time values as decimal fractions (of a day). (See the column probe_nr in the attached Microsoft® Excel® workbook; I've listed it below, as well.) Excel is then forced to round during preparation of the text to display in the cell in order to compensate for the inherent imprecision in the underlying double-precision value.

          Stata doesn't need to round for internally created time values, but when it encounters a fractional millisecond imported from outside, it appears to treat time in the conventional manner, namely, by truncating to the most recent value. You can think of this more familiarly in the manner how we declare someone's age: if you're just one day shy of your fortieth birthday, then you're still only thirty-nine years old.

          I'm not so sure that I would call that a bug that needs to be fixed, but, in any event, if you don't want to go the ODBC route (which appears to have the conversion built-in), then you can always round the date-time values to the nearest integer after importing them. See below for an example that uses import excel with the attached Excel file.

          .version13.1

          .
          .clear*

          .setmoreoff

          .importexcel"F:\DateProbes.xlsx",sheet("DateProbes")firstrow

          .
          .dropprobe_hr

          .formatprobe_dt%tcCCYY-NN-DD_HH:MM:SS.sss

          .
          .clonevarprobe_dp=probe_dt

          .formatprobe_dp%18.4f

          .
          .list,noobsseparator(0)

          +-----------------------------------------------------------------+
          |probe_dtprobe_nrprobe_dp|
          |-----------------------------------------------------------------|
          |2013-04-2300:59:59.99941387.04166666671682297999999.9998|
          |2013-04-2302:00:00.00041387.08333333331682301600000.0002|
          |2013-04-2303:00:00.00041387.1251682305200000.0000|
          |2013-04-2303:59:59.99941387.16666666671682308799999.9998|
          |2013-04-2305:00:00.00041387.20833333331682312400000.0002|
          |2013-04-2306:00:00.00041387.251682316000000.0000|
          |2013-04-2306:59:59.99941387.29166666671682319599999.9998|
          |2013-04-2308:00:00.00041387.33333333331682323200000.0002|
          |2013-04-2309:00:00.00041387.3751682326800000.0000|
          |2013-04-2309:59:59.99941387.41666666671682330399999.9998|
          |2013-04-2311:00:00.00041387.45833333331682334000000.0002|
          |2013-04-2312:00:00.00041387.51682337600000.0000|
          |2013-04-2312:59:59.99941387.54166666671682341199999.9998|
          |2013-04-2314:00:00.00041387.58333333331682344800000.0002|
          |2013-04-2315:00:00.00041387.6251682348400000.0000|
          |2013-04-2315:59:59.99941387.66666666671682351999999.9998|
          |2013-04-2317:00:00.00041387.70833333331682355600000.0002|
          +-----------------------------------------------------------------+

          .
          .quietlyreplaceprobe_dt=round(probe_dt,1)

          .listprobe_dt,noobsseparator(0)

          +-------------------------+
          |probe_dt|
          |-------------------------|
          |2013-04-2301:00:00.000|
          |2013-04-2302:00:00.000|
          |2013-04-2303:00:00.000|
          |2013-04-2304:00:00.000|
          |2013-04-2305:00:00.000|
          |2013-04-2306:00:00.000|
          |2013-04-2307:00:00.000|
          |2013-04-2308:00:00.000|
          |2013-04-2309:00:00.000|
          |2013-04-2310:00:00.000|
          |2013-04-2311:00:00.000|
          |2013-04-2312:00:00.000|
          |2013-04-2313:00:00.000|
          |2013-04-2314:00:00.000|
          |2013-04-2315:00:00.000|
          |2013-04-2316:00:00.000|
          |2013-04-2317:00:00.000|
          +-------------------------+

          .
          .exit

          endofdo-file
          Attached Files

          Comment


          • #6
            The difference in the data conversions is caused by how Microsoft Excel displays datetime values. Excel rounds datetime values to the nearest millisecond. Excel stores dates and times as a floating-point number representing the number of days since 1900-Jan-0, plus a fractional portion of a 24 hour day:

            Code:
                  ddddddd.tttttt
            The fraction portion of the number, 0.tttttt, represents the fractional portion of a 24 hour day. For example, 12:00 PM is stored as 0.5, or 50% of a 24 hour day. The formula for computing hours : minutes : seconds : milliseconds(1/1000 of a second) parts of the time are:

            Code:
                  hours        = MOD(INT(0.tttttttt*24), 24)
                  minutes      = MOD(INT(0.tttttttt*24*60), 60)
                  seconds      = MOD(INT(0.tttttttt*24*60*60), 60)
                  milliseconds = MOD(INT(0.tttttttt*24*60*60*1000), 1000)
            The remaining part is a fraction of a millisecond. We can compute it up to two decimal places using the below code

            Code:
                  remain       = MOD(INT(0.tttttttt*24*60*60*1000*100), 100)/100
            Now, let's look at an example of this problem in Excel and Stata. Download the attached file Date Conversion Problem.xlsx to your machine. First, open Date Conversion Problem.xlsx in Excel and look at cell B2, which is the first datetime cell in the Out Time column. The cell displays 3/19/2006 10:53.

            Next, import the same file into Stata. Use the below syntax to import the data.

            Code:
                  import excel using "Date Conversion Problem.xlsx", clear firstrow
            If you look at the same cell in Stata's Data Editor you will see the value 3/19/2006 10:52 displayed. There appears to be a minute difference in the two datetime values.

            Let's take a closer look at the data. If we change the format of cell B2 in Excel to m/d/yyyy h:mm:ss.000" (i.e., look at the time at full resolution up to millisecond) cell B2 displays 3/19/2006 10:53:00.000. If we change the format of variable OutTime in Stata to %tcnn/dd/ccYY_hh:MM.SS.sss, the corresponding cell displays 3/19/2006 10:52.59.999. The difference between the two datetime values is less then 1 millisecond instead of the preceived 1 minute.

            Losing 1 millisecond seems like a lot of precision lost when all computations are carried out in double precision. Let's investigate a bit further.

            If we enter the formula for

            Code:
                  hours : minutes : seconds : milliseconds : remain
            in Excel, enter cells C2/D2/E2/F2/G2 as

            Code:
                  C2 =MOD(INT((B2-INT(B2))*24), 24),                <--hours
                  D2 =MOD(INT((B2-INT(B2))*24*60), 60),             <--minutes
                  E2 =MOD(INT((B2-INT(B2))*24*60*60), 60)           <--seconds
                  F2 =MOD(INT((B2-INT(B2))*24*60*60*1000), 1000)    <--milliseconds
                  G2 =MOD(INT((B2-INT(B2))*24*60*60*1000*100), 100)/100 <--remain
            we get.

            Code:
                  C2 : 10
                  D2 : 52
                  E2 : 59
                  F2 : 999
                  G2 : 0.99
            The number in cell G2 matches the number in Stata.

            Code:
                  10 hours : 52 minutes : 59 seconds : 999 milliseconds : 0.99 remain
            Microsoft Excel decides to display datetime value rounded to the nearest
            millisecond; cell B2's true value is

            Code:
                  3/19/2006 10:52:59.999 plus 0.99 milliseconds,
            The millisecond field is rounded, which causes carry over effect all the way to the minute field. On the other hand, Stata does not round the time to display it, hence the difference.

            Stata currently always truncates datetime information if asked to display it at a lower resolution. For example, if you tell Stata to display a datetime which is really 3/19/2006 10:52:45.789 to only minute resolution, Stata won't round the 45.789 seconds up to another whole minute. You haven't really reached the 53rd minute in 10am, so Stata will just display 3/19/2006 10:52. Likewise, even if you have the value 3/19/2006 10:52:59.999 and ask Stata to display it at minute resolution, you wil still see 3/19/2006 10:52. This is particularly important at higher time boundaries, such as days, months, and years. Consider the datetime value 12/31/2006 23:59:59.999. It is important that Stata truncate, not round, no matter what the requested resolution is. You are still in the 31st day of December, 2006, in the 23rd hour, 59th minute, and 59th second, and that is what Stata will tell you no matter what time unit you request be displayed. Stata won't round up.

            Since all data precision is retained after importing, Stata users can easily match Excel displayed datetimes up to minute level by rounding the value.

            Code:
                  replace OutTime = round(OutTime, 1000)
            You can also round the value to the second level:

            Code:
                  replace OutTime = round(OutTime, 1)
            Attached Files

            Comment


            • #7
              Dear Joseph and Kevin, thank you very much for the responses. It is all clear for me now. To put you in context - I was showing "the new and powerful Excel import command in Stata" to a user and couldn't immediately combat the comment "oh, but the number in Stata is different from what I typed in Excel".

              The surprising part of it is that Stata does not do the suggested round(OutTime, 1000). This would be straightforward and safe, since Excel does not allow fractions of milliseconds, and since Stata knows it imports an Excel file, it should not attempt to interpret fractional parts of milliseconds as meaningful data, but safely round to nearest millisecond at import-time.

              I write 'safely' since I still can't come up with an example when this rounding would distort the data.

              An attempt to reveal the fractional parts of the milliseconds in Excel didn't succeed (000 is valid, 0000 is not):



              From this and a few web pages I conclude (hopefully correctly) that the user of Excel can never input time with precision of beyond 1 millisecond and can never see time with precision beyond 1 millisecond.

              Other programs working with Excel files seem to conform to the same rule as Excel itself, knowing the limitation of the format, round the number to nearest, not downwards. And this is likely what ODBC driver is doing as well.

              At the end of the day we might know what is happening, but wouldn't it be better to just round up the number instead?

              Thank you, Sergiy Radyakin

              Comment


              • #8
                Just out of curiosity, does anbody know how Stat Transfer handles this situation?

                Comment


                • #9
                  StatTransfer Twelve handles it as expected - correctly. In my example:

                  I have forced double storage type in transfer parameters, but even without forcing the variable is created as double.

                  Best, Sergiy

                  Comment


                  • #10
                    Thanks.

                    Comment


                    • #11
                      The fractional milliseconds do make a difference when adding or subtracting datetime values. Also, Stata reads and writes data to Excel so writing a value out to Excel and then reading a different value back into Stata is not a good idea.

                      Comment

                      Working...
                      X