Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Not a number

    Hi,

    While trying to import a CSV dataset with Covid data for France, I realized there are numeric values in the data editor which are printed "1.#QNAN". This is a quiet NaN, a "not a number" value, prescribed by the IEEE 754 standard for floating point numbers. If I remember well, they are encoded in double precision using a byte representation that cannot be used for "normal" numbers (unlike Stata, which uses "regular" large values to encode missings).

    Here is an example CSV which will show the problem I now face:

    Code:
    id,n
    1,10
    2,.
    3,NaN
    Now:

    Code:
    import delim nan.csv, clear
    list in 3
    di n[3]
    sca x=n[3]
    di x
    • -list- prints 1.#QNAN as expected.
    • -di n[3]- prints . (that is, a missing).
    • The scalar is also a missing.

    So a NaN is converted to a missing. But, if I do -list if mi(n)-, I get only the second row. So a NaN is not considered a missing!
    Then, how am I supposed to filter on NaN values in the dataset? (for instance to replace them, maybe by... a missing)

    In R, there is the -is.nan- function, and in C there is -isnan-, but in Stata, I don't know any equivalent. Is there a way to check whether a variable is NaN in an -if- clause?

    Afterthought
    There is a way: list if mi(n+0) & !mi(n). But it's not clean. Anything better?
    Last edited by Jean-Claude Arbaut; 26 Aug 2020, 05:02.

  • #2
    I am disappointed to say that I am not able to reproduce your problem on my system.
    Code:
    . about
    
    Stata/SE 16.1 for Mac (64-bit Intel)
    Revision 13 Aug 2020
    I think my experience is only indicative of a failure to reproduce your problem. Disappointing, because I think this is an interesting puzzle.

    Below is what was intended as a self-contained reproducible example of an approach I wanted to explore, which started by generating a CSV apparently identical the one you showed in post #1. But even when I replaced my generated CSV with a copy-and-paste from post #1, the results of the subsequent code were unchanged.
    Code:
    input str20 line
    "id,n"
    "1,10"
    "2,."
    "3,NaN"
    end
    file open csv using foo.csv, write text replace
    forvalues i=1/4 {
        file write csv (line[`i']) _newline
    }
    file close csv
    type foo.csv
    
    import delimited foo.csv, clear
    generate m1 = missing(n)
    generate m2 = n==.
    generate m3 = n>.
    generate sn = strofreal(n)
    list
    Code:
    . type foo.csv
    id,n
    1,10
    2,.
    3,NaN
    
    . 
    . import delimited foo.csv, clear
    (2 vars, 3 obs)
    
    . generate m1 = missing(n)
    
    . generate m2 = n==.
    
    . generate m3 = n>.
    
    . generate sn = strofreal(n)
    
    . list
    
         +-----------------------------+
         | id    n   m1   m2   m3   sn |
         |-----------------------------|
      1. |  1   10    0    0    0   10 |
      2. |  2    .    1    1    0    . |
      3. |  3    .    1    1    0    . |
         +-----------------------------+

    Comment


    • #3
      Let me add something I came across online. In the context of C++

      Any comparison operation (==, <= etc) on a NAN returns false, even comparing its equality to itself, except for != which always returns true (even when comparing to itself).
      Perhaps
      Code:
      list if n!=n
      would find your NaNs.

      Comment


      • #4
        Jean-Claude Arbaut , what is the output of

        Code:
        about
        query compilenumber
        in your Stata. I can not reproduce your -import delimited- issue in Stata 16.1 MP. Also. Would you please also post nan.csv as an attachment instead of copy/past?

        In terms of catching NaN after the fact, William Lisowski 's second suggestion is good, i.e.,

        Code:
        list if n!=n

        Comment


        • #5
          Hmm this is odd. I am able to reproduce Jean-Claude's problem, using a copy paste method into a CSV file, and the foo.csv method by William in #2.

          Code:
          . about
          Stata/MP 16.1 for Windows (64-bit x86-64)
          Revision 13 Aug 2020
          
          . query compilenumber
          Compile number 842
          Code:
           
           import delimited foo.csv, clear generate m1 = missing(n) generate m2 = n==. generate m3 = n>. generate m4 = n!=n generate sn = strofreal(n) list
          Code:
          . import delimited foo.csv, clear
          (2 vars, 3 obs)
          
          . generate m1 = missing(n)
          
          . generate m2 = n==.
          
          . generate m3 = n>.
          
          . generate m4 = n!=n
          
          . generate sn = strofreal(n)
          
          . list
          
               +--------------------------------------------+
               | id         n   m1   m2   m3   m4        sn |
               |--------------------------------------------|
            1. |  1        10    0    0    0    0        10 |
            2. |  2         .    1    1    0    0         . |
            3. |  3   1.#QNAN    0    1    0    0   1.#QNAN |
               +--------------------------------------------+

          Comment


          • #6
            Leonardo Guizzetti that's interesting. My Stata version/compilenumber/OS is exactly the same as yours, yet I am not able to reproduce. The issue is entirely possible after looking at the code, I am just baffled by the undeterminess of the behavior. Also, try the follwoing to see if it catches the NaN after the fact.

            Code:
             generate m5 = (n==n)

            Comment


            • #7
              Yes, this error is peculiar. Your condition is always true in the same dataset.

              Code:
                   +-------------------+
                   | id         n   m5 |
                   |-------------------|
                1. |  1        10    1 |
                2. |  2         .    1 |
                3. |  3   1.#QNAN    1 |
                   +-------------------+

              Comment


              • #8
                Also worth noting, I think. This problem seems new for version 16, as the variable -n- gets imported as a string character under version 15. This behaviour is not apparently preserved under version control under Stata 16.

                Comment


                • #9
                  I figured out, the behavior is depending on what the default type is.

                  Code:
                  set type double
                  enables us to reproduce the issue. And both testing of (n==n) and (n !=n) will not work since Stata expression handling has checkings that prevent NaN from being passed down to the evaluation stage. Currently, the following works due to the missing() function passes the NaN unchanged to the internal, hence causes the contradiction we can use to flag the NaN.

                  Code:
                  list if missing(n)==0 & (n >= .)

                  We will look into this in the -import delimited- side. And we will consider (most likely will happen) adding an isnan() function. Although Stata should not produce NaN in any its own functions and routines, there is no way to prevent third-party software from producing Stata datasets containing NaN. We should be able to deal with them.
                  Last edited by Hua Peng (StataCorp); 26 Aug 2020, 10:13.

                  Comment


                  • #10
                    Added in edit: Crossed with post #9.

                    Added in edit after reading post #9. I agree entirely with the sentiment in the final two sentences. We've seen other cases on Statalist where programs other than Stata produced "Stata" datasets that weren't *really* Stata datasets in some important ways.

                    =======

                    I note that the suggestion from post #6 that was tested in post #8 wasn't the recommendation from post #4. Nevertheless, the result for m5 in post #7 conflicts with the expected results quoted in post #3, so I wouldn't expect the post #4 recommendation to identify the NaN.

                    It does appear from post #5 that the nonsensical appearing
                    Code:
                    generate m6 = n==. & !missing(n)
                    would identify the NaN, based on the performance of m1 and m2 in post #5.

                    Another approach would be
                    Code:
                    generate m = strofreal(n)=="1.#QNAN"
                    which has the possible advantage of being based on the interpretation by strofreal() of the binary value as being that known as 1.#QNAN rather than being based on nonsensical-appearing Stata logic that could in theory change in the future. But then, it appears that NaN representations as text vary, per the Wikipedia article at https://en.wikipedia.org/wiki/NaN .

                    I'm just sorry that my Stata won't let me play along on this by creating NaN values in import delimited.
                    Last edited by William Lisowski; 26 Aug 2020, 10:24.

                    Comment


                    • #11
                      Updating my example from post #2 to include set type double results in almost reproducing the results suggested by post #1.
                      Code:
                      . set type double
                      
                      . import delimited foo.csv, clear
                      (2 vars, 3 obs)
                      
                      . generate m1 = missing(n)
                      
                      . generate m2 = n==.
                      
                      . generate m3 = n>.
                      
                      . generate sn = strofreal(n,"%8.0f")
                      
                      . format %8.0f n
                      
                      . list
                      
                           +-------------------------------+
                           | id     n   m1   m2   m3    sn |
                           |-------------------------------|
                        1. |  1    10    0    0    0    10 |
                        2. |  2     .    1    1    0     . |
                        3. |  3   nan    0    1    0   nan |
                           +-------------------------------+
                      
                      . describe
                      
                      Contains data
                        obs:             3                          
                       vars:             6                          
                      -----------------------------------------------------------------
                                    storage   display    value
                      variable name   type    format     label      variable label
                      -----------------------------------------------------------------
                      id              byte    %8.0g                
                      n               double  %8.0f                
                      m1              double  %10.0g                
                      m2              double  %10.0g                
                      m3              double  %10.0g                
                      sn              str3    %9s                  
                      -----------------------------------------------------------------
                      I note that omitting the %8.0f formatting of n results in it being presented as a blank rather than nan. I couldn't quickly find a format that would yield 1.#QNAN; perhaps the NaN-to-text conversion depends on OS-level code.

                      Added in edit: crossed with #12 below, which confirms that string representation of NaN depends on the operating system.
                      Last edited by William Lisowski; 26 Aug 2020, 10:50.

                      Comment


                      • #12
                        The approach of

                        Code:
                         generate m = strofreal(n)=="1.#QNAN"
                        will not be portable since NaN has different string representations on Linux console, Linux GUI/Mac, and Windows. And it can depend on languages as well. See "Display" section in https://en.wikipedia.org/wiki/NaN#Display
                        Last edited by Hua Peng (StataCorp); 26 Aug 2020, 10:40.

                        Comment


                        • #13
                          First I want to acknowledge that this is a bug in -import delimited-.

                          Now I will shed some light on what -import delimited- is doing and why the behavior is different in Stata 16 vs Stata 15. As some of you might already know, -import delimited- is written in Java. The code was modified for Stata 16 do use Java's Double.parseDouble() to parse numeric values. The reason for the change was purely to increase performance. A side effect of this change was that Double.parseDouble() understands "NaN", "+NaN", and "-NaN". Unfortunately, we did not recognize that at the time and NaN was stored directly into the Stata's dataset instead of getting translated to a Stata missing value.

                          We will get this fixed in an upcoming update, so that "NaN", "+NaN", and "-NaN" will be stored as a Stata missing value.

                          Comment


                          • #14
                            James Hassell (StataCorp) Thank you very much for the clarification!

                            I already knew about -import delimited- using Java, but actually it's almost explicit in the documentation:

                            encoding(encoding) specifies the encoding of the text file to be read. The default is encoding("latin1"). Specify
                            encoding("utf-8") for files to be encoded in UTF-8. import delimited uses Java encoding. A list of available encodings
                            can be found at https://docs.oracle.com/javase/8/doc...oding.doc.html.

                            Hua Peng (StataCorp) Sorry to answer a bit late. I think it doesn't matter any longer, but for the record, here is what I get:

                            Code:
                            . about
                            
                            Stata/SE 16.1 for Windows (64-bit x86-64)
                            Revision 13 Aug 2020
                            Copyright 1985-2019 StataCorp LLC
                            
                            Total physical memory: 16.00 GB
                            Available physical memory: 5.06 GB
                            
                            Stata license: Single-user perpetual
                            Serial number: xxxxxxxxxxxx
                            Licensed to: Jean-Claude Arbaut
                            Home
                            
                            .
                            . query compilenumber
                            Compile number 842
                            And yes, I always -set type double- in my profile, to avoid any precision loss, but I would never have thought about trying without it.

                            Many thanks to all contributors.

                            Jean-Claude Arbaut
                            Last edited by Jean-Claude Arbaut; 26 Aug 2020, 12:45.

                            Comment


                            • #15
                              A similar bug in import sas is described at

                              https://www.statalist.org/forums/for...-in-import-sas

                              Comment

                              Working...
                              X