Not a number

Jean-Claude Arbaut

Join Date: Jul 2017

Posts: 209
#1

Not a number

26 Aug 2020, 04:10

Hi,

While trying to import a CSV dataset with Covid data for France, I realized there are numeric values in the data editor which are printed "1.#QNAN". This is a quiet NaN, a "not a number" value, prescribed by the IEEE 754 standard for floating point numbers. If I remember well, they are encoded in double precision using a byte representation that cannot be used for "normal" numbers (unlike Stata, which uses "regular" large values to encode missings).

Here is an example CSV which will show the problem I now face:

Code:

id,n 1,10 2,. 3,NaN

Now:

Code:

import delim nan.csv, clear list in 3 di n[3] sca x=n[3] di x

-list- prints 1.#QNAN as expected.

-di n[3]- prints . (that is, a missing).

The scalar is also a missing.

So a NaN is converted to a missing. But, if I do -list if mi(n)-, I get only the second row. So a NaN is not considered a missing!
Then, how am I supposed to filter on NaN values in the dataset? (for instance to replace them, maybe by... a missing)

In R, there is the -is.nan- function, and in C there is -isnan-, but in Stata, I don't know any equivalent. Is there a way to check whether a variable is NaN in an -if- clause?

Afterthought
There is a way: list if mi(n+0) & !mi(n). But it's not clean. Anything better?

Last edited by Jean-Claude Arbaut; 26 Aug 2020, 05:02.
Tags: None

1 like

William Lisowski

Join Date: Dec 2014
Posts: 10150

26 Aug 2020, 07:16

I am disappointed to say that I am not able to reproduce your problem on my system.

Code:

. about

Stata/SE 16.1 for Mac (64-bit Intel)
Revision 13 Aug 2020

I think my experience is only indicative of a failure to reproduce your problem. Disappointing, because I think this is an interesting puzzle.

Below is what was intended as a self-contained reproducible example of an approach I wanted to explore, which started by generating a CSV apparently identical the one you showed in post #1. But even when I replaced my generated CSV with a copy-and-paste from post #1, the results of the subsequent code were unchanged.

Code:

input str20 line
"id,n"
"1,10"
"2,."
"3,NaN"
end
file open csv using foo.csv, write text replace
forvalues i=1/4 {
    file write csv (line[`i']) _newline
}
file close csv
type foo.csv

import delimited foo.csv, clear
generate m1 = missing(n)
generate m2 = n==.
generate m3 = n>.
generate sn = strofreal(n)
list

Code:

. type foo.csv
id,n
1,10
2,.
3,NaN

. 
. import delimited foo.csv, clear
(2 vars, 3 obs)

. generate m1 = missing(n)

. generate m2 = n==.

. generate m3 = n>.

. generate sn = strofreal(n)

. list

     +-----------------------------+
     | id    n   m1   m2   m3   sn |
     |-----------------------------|
  1. |  1   10    0    0    0   10 |
  2. |  2    .    1    1    0    . |
  3. |  3    .    1    1    0    . |
     +-----------------------------+

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#3

26 Aug 2020, 07:26

Let me add something I came across online. In the context of C++

Any comparison operation (==, <= etc) on a NAN returns false, even comparing its equality to itself, except for != which always returns true (even when comparing to itself).

Perhaps

Code:

list if n!=n

would find your NaNs.
Comment
Hua Peng (StataCorp)

StataCorp Employee

Join Date: Jun 2014

Posts: 346
#4

26 Aug 2020, 07:50

Jean-Claude Arbaut , what is the output of

Code:

about query compilenumber

in your Stata. I can not reproduce your -import delimited- issue in Stata 16.1 MP. Also. Would you please also post nan.csv as an attachment instead of copy/past?

In terms of catching NaN after the fact, William Lisowski 's second suggestion is good, i.e.,

Code:

list if n!=n
1 like
Comment

Leonardo Guizzetti

Join Date: Jul 2016
Posts: 2402

26 Aug 2020, 08:24

Hmm this is odd. I am able to reproduce Jean-Claude's problem, using a copy paste method into a CSV file, and the foo.csv method by William in #2.

Code:

. about
Stata/MP 16.1 for Windows (64-bit x86-64)
Revision 13 Aug 2020

. query compilenumber
Compile number 842

Code:

 
 import delimited foo.csv, clear generate m1 = missing(n) generate m2 = n==. generate m3 = n>. generate m4 = n!=n generate sn = strofreal(n) list

Code:

. import delimited foo.csv, clear
(2 vars, 3 obs)

. generate m1 = missing(n)

. generate m2 = n==.

. generate m3 = n>.

. generate m4 = n!=n

. generate sn = strofreal(n)

. list

     +--------------------------------------------+
     | id         n   m1   m2   m3   m4        sn |
     |--------------------------------------------|
  1. |  1        10    0    0    0    0        10 |
  2. |  2         .    1    1    0    0         . |
  3. |  3   1.#QNAN    0    1    0    0   1.#QNAN |
     +--------------------------------------------+

Comment

Hua Peng (StataCorp)

StataCorp Employee

Join Date: Jun 2014

Posts: 346
#6

26 Aug 2020, 08:54

Leonardo Guizzetti that's interesting. My Stata version/compilenumber/OS is exactly the same as yours, yet I am not able to reproduce. The issue is entirely possible after looking at the code, I am just baffled by the undeterminess of the behavior. Also, try the follwoing to see if it catches the NaN after the fact.

Code:

generate m5 = (n==n)
Comment

Leonardo Guizzetti

Join Date: Jul 2016
Posts: 2402

26 Aug 2020, 09:10

Yes, this error is peculiar. Your condition is always true in the same dataset.

Code:

     +-------------------+
     | id         n   m5 |
     |-------------------|
  1. |  1        10    1 |
  2. |  2         .    1 |
  3. |  3   1.#QNAN    1 |
     +-------------------+

Comment

Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2402
#8

26 Aug 2020, 09:14

Also worth noting, I think. This problem seems new for version 16, as the variable -n- gets imported as a string character under version 15. This behaviour is not apparently preserved under version control under Stata 16.
1 like
Comment
Hua Peng (StataCorp)

StataCorp Employee

Join Date: Jun 2014

Posts: 346
#9

26 Aug 2020, 09:46

I figured out, the behavior is depending on what the default type is.

Code:

set type double

enables us to reproduce the issue. And both testing of (n==n) and (n !=n) will not work since Stata expression handling has checkings that prevent NaN from being passed down to the evaluation stage. Currently, the following works due to the missing() function passes the NaN unchanged to the internal, hence causes the contradiction we can use to flag the NaN.

Code:

list if missing(n)==0 & (n >= .)

We will look into this in the -import delimited- side. And we will consider (most likely will happen) adding an isnan() function. Although Stata should not produce NaN in any its own functions and routines, there is no way to prevent third-party software from producing Stata datasets containing NaN. We should be able to deal with them.

Last edited by Hua Peng (StataCorp); 26 Aug 2020, 10:13.
3 likes
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#10

26 Aug 2020, 10:20

Added in edit: Crossed with post #9.

Added in edit after reading post #9. I agree entirely with the sentiment in the final two sentences. We've seen other cases on Statalist where programs other than Stata produced "Stata" datasets that weren't *really* Stata datasets in some important ways.

=======

I note that the suggestion from post #6 that was tested in post #8 wasn't the recommendation from post #4. Nevertheless, the result for m5 in post #7 conflicts with the expected results quoted in post #3, so I wouldn't expect the post #4 recommendation to identify the NaN.

It does appear from post #5 that the nonsensical appearing

Code:

generate m6 = n==. & !missing(n)

would identify the NaN, based on the performance of m1 and m2 in post #5.

Another approach would be

Code:

generate m = strofreal(n)=="1.#QNAN"

which has the possible advantage of being based on the interpretation by strofreal() of the binary value as being that known as 1.#QNAN rather than being based on nonsensical-appearing Stata logic that could in theory change in the future. But then, it appears that NaN representations as text vary, per the Wikipedia article at https://en.wikipedia.org/wiki/NaN .

I'm just sorry that my Stata won't let me play along on this by creating NaN values in import delimited.

Last edited by William Lisowski; 26 Aug 2020, 10:24.
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

#11

26 Aug 2020, 10:34

Updating my example from post #2 to include set type double results in almost reproducing the results suggested by post #1.

Code:

. set type double

. import delimited foo.csv, clear
(2 vars, 3 obs)

. generate m1 = missing(n)

. generate m2 = n==.

. generate m3 = n>.

. generate sn = strofreal(n,"%8.0f")

. format %8.0f n

. list

     +-------------------------------+
     | id     n   m1   m2   m3    sn |
     |-------------------------------|
  1. |  1    10    0    0    0    10 |
  2. |  2     .    1    1    0     . |
  3. |  3   nan    0    1    0   nan |
     +-------------------------------+

. describe

Contains data
  obs:             3                          
 vars:             6                          
-----------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
-----------------------------------------------------------------
id              byte    %8.0g                
n               double  %8.0f                
m1              double  %10.0g                
m2              double  %10.0g                
m3              double  %10.0g                
sn              str3    %9s                  
-----------------------------------------------------------------

I note that omitting the %8.0f formatting of n results in it being presented as a blank rather than nan. I couldn't quickly find a format that would yield 1.#QNAN; perhaps the NaN-to-text conversion depends on OS-level code.

Added in edit: crossed with #12 below, which confirms that string representation of NaN depends on the operating system.

Last edited by William Lisowski; 26 Aug 2020, 10:50.

Comment

Hua Peng (StataCorp)

StataCorp Employee

Join Date: Jun 2014

Posts: 346
#12

26 Aug 2020, 10:34

The approach of

Code:

generate m = strofreal(n)=="1.#QNAN"

will not be portable since NaN has different string representations on Linux console, Linux GUI/Mac, and Windows. And it can depend on languages as well. See "Display" section in https://en.wikipedia.org/wiki/NaN#Display

Last edited by Hua Peng (StataCorp); 26 Aug 2020, 10:40.
2 likes
Comment
James Hassell (StataCorp)

StataCorp Employee

Join Date: Apr 2015

Posts: 74
#13

26 Aug 2020, 11:46

First I want to acknowledge that this is a bug in -import delimited-.

Now I will shed some light on what -import delimited- is doing and why the behavior is different in Stata 16 vs Stata 15. As some of you might already know, -import delimited- is written in Java. The code was modified for Stata 16 do use Java's Double.parseDouble() to parse numeric values. The reason for the change was purely to increase performance. A side effect of this change was that Double.parseDouble() understands "NaN", "+NaN", and "-NaN". Unfortunately, we did not recognize that at the time and NaN was stored directly into the Stata's dataset instead of getting translated to a Stata missing value.

We will get this fixed in an upcoming update, so that "NaN", "+NaN", and "-NaN" will be stored as a Stata missing value.
4 likes
Comment
Jean-Claude Arbaut

Join Date: Jul 2017

Posts: 209
#14

26 Aug 2020, 12:40

James Hassell (StataCorp) Thank you very much for the clarification!

I already knew about -import delimited- using Java, but actually it's almost explicit in the documentation:

encoding(encoding) specifies the encoding of the text file to be read. The default is encoding("latin1"). Specify
encoding("utf-8") for files to be encoded in UTF-8. import delimited uses Java encoding. A list of available encodings
can be found at https://docs.oracle.com/javase/8/doc...oding.doc.html.

Hua Peng (StataCorp) Sorry to answer a bit late. I think it doesn't matter any longer, but for the record, here is what I get:

Code:

. about Stata/SE 16.1 for Windows (64-bit x86-64) Revision 13 Aug 2020 Copyright 1985-2019 StataCorp LLC Total physical memory: 16.00 GB Available physical memory: 5.06 GB Stata license: Single-user perpetual Serial number: xxxxxxxxxxxx Licensed to: Jean-Claude Arbaut Home . . query compilenumber Compile number 842

And yes, I always -set type double- in my profile, to avoid any precision loss, but I would never have thought about trying without it.

Many thanks to all contributors.

Jean-Claude Arbaut

Last edited by Jean-Claude Arbaut; 26 Aug 2020, 12:45.
2 likes
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#15

04 Sep 2020, 12:46

A similar bug in import sas is described at

https://www.statalist.org/forums/for...-in-import-sas
1 like
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment