BCF's use of signalling NaNs #145

jmarshall · 2016-04-13T15:41:40Z

From §6.3.3:

BCF2 supports the full range of values from -Infinity to +Infinity, including NaN. BCF2 needs to represent missing values for single precision floating point numbers. This is accomplished by writing the NaN value as the quiet NaN (qNaN), while the MISSING value is encoded as a signaling NaN. From the NaN wikipedia entry, we have:

For example, a bit-wise example of a IEEE floating-point standard single precision (32-bit) NaN would be: s111 1111 1axx xxxx xxxx xxxx xxxx xxxx where s is the sign (most often ignored in applications), a determines the type of NaN, and x is an extra payload (most often ignored in applications). If a = 1, it is a quiet NaN; if a is zero and the payload is nonzero, then it is a signaling NaN.

Unfortunately using signalling NaNs was probably a mistake. On some platforms, operating on them (even just copying them about) converts them to quiet NaNs, and it is hard to control this. (See e.g. early paragraphs of the C committee's N1011 and Kahan's IEEE754 Status notes's comments on signalling NaNs.)

This happens on 32-bit x87, where just loading values on to the FP stack converts SNaNs to QNaNs. This causes the test suite failures in samtools/bcftools#389.

This could be worked around by very careful programming that avoided treating these values as floating point and so avoided FP operations that might convert MISSING etc to QNaNs. However this would be quite painful, and the pain would be repeated in every implementation that wanted to make this work on x86, not just htslib/bcftools.

Implementations could accept that 0x7F800001 (missing) might get converted to 0x7FC00001 in memory and treat either as MISSING for their computations, and then be careful to ensure they are the proper SNaN value when writing them out to BCF files. (And similarly for the other BCF reserved NaN special-purpose values.) So the pain and care would be limited to writing out BCF.

Ideally we might change the BCF specification to allow particular QNaN values for these special-purpose BCF values, perhaps in addition to the current SNaN values. There are a number of options, including:

(The do nothing approach.) Leave BCF's use of signalling NaNs as is; implementations will have to have extra code and carefulness at least when writing BCF files.
(The compatible approach.) Change the table in §6.3.3 to allow either 0x7F800001 or 0x7FC00001 for missing, and similarly for end-of-vector and the other reserved codes, interchangeably.

There is a risk with this that the quiet 0x7FC00001 values are not very distinctive and may be in use as ordinary (non-BCF-special) NaNs. But hopefully all such real NaNs are using the most plain value, 0x7FC00000.
(The distinctive approach.) Invent distinctive new quiet NaN values such as 0x7FCBCF01…0x7FCBCF07 for the BCF special-purpose values, and transition to them from the current 0x7Fx0000x values in some way.

The text was updated successfully, but these errors were encountered:

droazen · 2016-05-24T01:56:24Z

@jmarshall I agree that signaling NaN was an incredibly poor choice due to its tendency to get auto-converted to quiet NaN. I think that before we decide on how radical the fix should be, however, we need more information on how bad the problem is: how many/which architectures are known to be affected? Are only 32-bit architectures affected, or are there 64-bit architectures affected as well? If client code just passes these NaN values around in variables and does things like bcf_float_is_missing() on them, is there any danger of auto-conversion of signaling -> quiet NaN?

jmarshall · 2016-05-24T12:51:25Z

It is of course completely platform-dependent. What has been observed so far is that Intel 387 arithmetic converts SNaNs to QNaNs even just when copying values around between variables (due to the copying being done via the 387 floating-point stack — which might or might not happen depending on optimisation options and vagaries). Intel SSE arithmetic has not been observed to change SNaNs.

On 32-bit x86, 387 arithmetic is the default (for GCC), and programs are affected. On 64-bit x86-64, SSE arithmetic is the default (for GCC), and programs have not been observed to be affected. With GCC, -m64 -mfpmath=387 produces 64-bit programs that are affected.

jmarshall · 2016-06-10T11:59:34Z

On 64-bit SSE-arithmetic x86-64 (i.e., the most common platform), just copying values has not been observed to convert SNaN to QNaN, but doing arithmetic on them does. For example, changing line 758 of vcfnorm.c thus

- dst->qual = line->qual;
+ dst->qual = line->qual + 0.0;

makes the bcftools test suite fail (in bcftools norm -m- test/norm.split.vcf) due to the QUAL field now being plain nan rather than ..

This can be avoided on this platform by alway checking bcf_float_is_missing() etc before doing arithmetic, and it seems that htslib/bcftools do always do this. It is a bit of a pain and undercuts the point of making the special values be NaNs; OTOH you have to do these checks before doing arithmetic on integer fields etc too (where the special values are just particular integer values, as there are mostly no NaN equivalents), so float is not much worse…

jmarshall · 2016-06-10T12:00:17Z

Broadites: what does Java do in this area?

droazen · 2016-06-10T16:35:47Z

@jmarshall According to my tests, adding 0.0 to a signalling NaN float value in Java 8 does not cause it to auto-convert to quiet NaN -- the value after adding 0.0f as determined by a call to Float.floatToRawIntBits() (which preserves distinct NaN values rather than collapsing them all to 0x7FC00000) is bit-identical to the original.

Still, this new evidence of the brittleness of these values on x86-64 is very concerning, and probably forces us to go with either option 2 or 3 from your original post above. Option 2 is tempting provided that we can somehow prove that "real" NaNs will never use 0x7FC00001.

lbergelson · 2016-06-10T16:48:46Z

It's also possible that behavior in java will be different on different architectures or across jvms.

…cial value 0x7f800001 (NaN), used to indicate a missing value, was changed to 0x7fc00001 (also NaN) when the function le_to_float returned. This problem might be specific to MSVC and the calling convention where the return value is passed on the FPU register stack. In that process the value seems to be converted from signaling NaN to quiet NaN. The fix is to keep the value as uint32 while testing for special values. Reported as a bug in samtools: samtools/hts-specs#145

anderskaplan · 2017-03-12T17:03:52Z

According to my tests (Windows, MSVC, 32-bit mode) the problem appears when floats are returned from functions, not when they are copied. With the default calling convention in my environment, the callee stores the return value on the FPU register stack using the fld instruction, and the caller retrieves it using the fstp instruction. The conversion happens somewhere between those two.

I have submitted a pull request to htslib that handles the problem by keeping the values as uint32's while testing for the special values:
samtools/htslib#485

pd3 · 2017-03-13T14:31:34Z

Thank you for this. I am wondering if the change could be made at the level of htslib/vcf.h instead, so that the rest of the code remains unchanged? Do the bcf_float_set_* defines work on your machine? Unfortunately I don't have a platform to reproduce this problem.

anderskaplan · 2017-03-14T19:18:13Z

bcf_float_set works fine. The problem is with le_to_float. That's the function that returns the signaling NaN on the FPU stack, which gets converted to a quiet NaN by the FPU.

As the code that calls le_to_float is in vcf.c, I can see no other alternatives than to modify that source file.

A tentative replacement of samtools#485 which requires modifications in user code rather than a single API-level change. This needs to be verified on the affected platform, works fine for me. See also samtools/hts-specs#145 for details

pd3 · 2017-03-15T07:45:15Z

I believe it should be possible to do this at the API level. Can you please test this? pd3/htslib@83ded78

anderskaplan · 2017-03-15T15:18:05Z

No, this won't help. Like I said, the damage has already been done when the uint_8 data was first converted to a float.

But another alternative would be to modify the signature of le_to_float, so that it takes a float pointer as a parameter and returns void. Changing that signature will of course affect the implementation in vcf.c as well.

pd3 · 2017-03-15T15:28:42Z

I was talking rubbish, sorry. I understand the problem now, ignore what I suggested above please.

anderskaplan · 2017-03-15T20:47:13Z

:)

daviesrob · 2017-03-16T12:31:07Z

I think some htslib implementation details may have leaked into the specification discussion...

Anyway, I think this has been fairly comprehensively covered here, especially in @jmarshall's original comment. The only way that implementations can portably work with the current specification is to effectively treat all floating-point values as a bunch of bytes, up until the point where the value is actually needed. Anyone trying to use the value would have to check that it isn't a missing value code before trying to convert it. Missing values would have to be set by calling a function to set the appropriate binary value, and not by trying to pass it in as a float (htslib already has this).

Maybe we should bring this up at the next file formats meeting? I'd favour John's option 2, as it's likely to be the least work. We would probably want to ensure that all missing values are written in the current representation, but we count bit 22 as "don't care" when checking for missing values.

jmarshall added the vcf label Apr 13, 2016

jmarshall mentioned this issue Apr 13, 2016

Test suite fails on some architectures samtools/bcftools#389

Open

jmarshall modified the milestone: June 2016 meeting May 25, 2016

anderskaplan mentioned this issue Mar 12, 2017

NaN issue samtools/htslib#485

Merged

daviesrob mentioned this issue Jul 17, 2017

Failing test_vcf_api/test_vcf_sweep tests on i386 samtools/htslib#565

Open

ginggs mentioned this issue Feb 18, 2018

New test problem compared to other issues with 1.6 on armel, armhf and mipsel architecture samtools/bcftools#744

Open

rishidev removed this from the June 2016 meeting milestone Jan 30, 2020

randall77 mentioned this issue Feb 25, 2020

cmd/compile: constant propagation in compiler converts signaling NaN to quiet NaN golang/go#36400

Closed

jmarshall mentioned this issue Aug 9, 2021

samtools stat -p -t 12_3reads.bed fails on i686 samtools/samtools#1478

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BCF's use of signalling NaNs #145

BCF's use of signalling NaNs #145

jmarshall commented Apr 13, 2016

droazen commented May 24, 2016

jmarshall commented May 24, 2016

jmarshall commented Jun 10, 2016

jmarshall commented Jun 10, 2016

droazen commented Jun 10, 2016

lbergelson commented Jun 10, 2016

anderskaplan commented Mar 12, 2017

pd3 commented Mar 13, 2017

anderskaplan commented Mar 14, 2017 •

edited

Loading

pd3 commented Mar 15, 2017

anderskaplan commented Mar 15, 2017 •

edited

Loading

pd3 commented Mar 15, 2017

anderskaplan commented Mar 15, 2017

daviesrob commented Mar 16, 2017

BCF's use of signalling NaNs #145

BCF's use of signalling NaNs #145

Comments

jmarshall commented Apr 13, 2016

droazen commented May 24, 2016

jmarshall commented May 24, 2016

jmarshall commented Jun 10, 2016

jmarshall commented Jun 10, 2016

droazen commented Jun 10, 2016

lbergelson commented Jun 10, 2016

anderskaplan commented Mar 12, 2017

pd3 commented Mar 13, 2017

anderskaplan commented Mar 14, 2017 • edited Loading

pd3 commented Mar 15, 2017

anderskaplan commented Mar 15, 2017 • edited Loading

pd3 commented Mar 15, 2017

anderskaplan commented Mar 15, 2017

daviesrob commented Mar 16, 2017

anderskaplan commented Mar 14, 2017 •

edited

Loading

anderskaplan commented Mar 15, 2017 •

edited

Loading