Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Float vs. double - data precision

    Dear all,

    Two variables in my dataset , call it xvar and yvar

    Code:
    xvar  float   %9.0g                
    yvar        double  %12.0g
    I use command

    Code:
    list xvar  yvar if xvar<  yvar &  yvar !=.
    it listed observations in log window
    Code:
    xvar  yvar
    5.1 5.1
    To naked eyes, they are the same value. I need to delete observations that meet condition of

    Code:
    if xvar<  yvar &  yvar !=.
    My question is:
    1. should I change the variable to the same types?
    2. Should I use double or float
    Thanks,

    Rochelle



  • #2
    The output of help precision explains what is going on and how to avoid the problem you have encountered. Note particularly the first example, a version of which is displayed below. The lesson is that the way data is displayed is not necessarily an exact representation of the value stored, so what "to the naked eye" appears to be the same need not be so.

    Code:
    . set obs 5
    number of observations (_N) was 0, now 5
    
    . generate x = 1.1
    
    . list
    
         +-----+
         |   x |
         |-----|
      1. | 1.1 |
      2. | 1.1 |
      3. | 1.1 |
      4. | 1.1 |
      5. | 1.1 |
         +-----+
    
    . count if x==1.1
      0
    
    .

    Comment


    • #3
      Well, as you've already identified the problem is that the two variables offer different levels of precision. The decimal number 5.1 has no exact representation in binary (just as 1/3 has no exact representation in decimal), so when Stata creates variables with this value, it gives the closest number it can represent in binary within the precision of the storage type specified. You can see how that works out in this case by applying a display format that allows for many decimal places (which does nothing to change the internal representation of the number, just the number of digits displayed in a -list-command).

      Code:
      . clear*
      
      . set obs 1
      number of observations (_N) was 0, now 1
      
      . gen xvar = 5.1
      
      . gen double yvar = 5.1
      
      .
      . list if xvar < yvar & yvar != .
      
           +-------------+
           | xvar   yvar |
           |-------------|
        1. |  5.1    5.1 |
           +-------------+
      
      .
      . format xvar yvar %25.24f
      
      .
      . list
      
           +-----------------------------------------------------+
           |                     xvar                       yvar |
           |-----------------------------------------------------|
        1. | 5.099999904632568400e+00   5.099999999999999600e+00 |
           +-----------------------------------------------------+
      
      .
      . recast double xvar
      
      .
      . list
      
           +-----------------------------------------------------+
           |                     xvar                       yvar |
           |-----------------------------------------------------|
        1. | 5.099999904632568400e+00   5.099999999999999600e+00 |
           +-----------------------------------------------------+
      As you can see from the last two commands, -recasting- xvar as a double does not resolve the problem: when Stata adds the extra binary digits to what it already has, it does not know that xvar was originally an attempt to represent 5.1. It just extends with more zero digits what it already had.

      In general if you are comparing two variables, one a float, and the other a double, your comparison is only good out to float precision. So it is best to actually limit the comparison to that:

      Code:
      . assert float(xvar) == float(yvar)
      When creating new variables, the default storage type for Stata is -float-, and nearly all commands that create new variables will observe that default unless you specify otherwise. Float precision is good enough for nearly all practical purposes. There are very few measurements that can be made with greater than that amount of accuracy. There are really two circumstances where doubles are needed and should be used:

      1. To hold a Stata clock or Clock variable you must use a double or you will incur significant and easily noticeable errors.
      2. When a variable will be used in a long series of calculations, each of which adds a bit of truncation or rounding error, or where the calculation combines numerical values of greatly different magnitudes, then a double is needed to accurately hold the results. Most users of Stata will encounter this particular situation only infrequently. It is of importance primarily to people who write their own maximum likelihood estimation programs, however.

      So my general recommendation is to stick with float in most situations. When you do comparisons between a float and a double, compare them at the float level by specifically using float(variable_name). Bear in mind, though that comparisons of exact equality will often fail if the two sides of the equation are computed differently, even though they theoretically evaluate to the same number:

      Code:
      . assert 5.1 == 4.7 + 0.4
      assertion is false
      r(9);
      
      . assert float(5.1) == float(4.7) + float(0.4)
      assertion is false
      r(9);
      
      . assert float(5.1) == float(4.7 + 0.4)
      So, in general, exact comparisons of floating point numbers (whether floats or doubles) are not reliable.

      Added: Crossed with William Lisowski's response.

      Comment


      • #4
        Thank you William and Clyde for being so helpful !

        Have a fantastic weekend !

        Comment

        Working...
        X