-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BCF's use of signalling NaNs #145
Comments
@jmarshall I agree that signaling NaN was an incredibly poor choice due to its tendency to get auto-converted to quiet NaN. I think that before we decide on how radical the fix should be, however, we need more information on how bad the problem is: how many/which architectures are known to be affected? Are only 32-bit architectures affected, or are there 64-bit architectures affected as well? If client code just passes these NaN values around in variables and does things like |
It is of course completely platform-dependent. What has been observed so far is that Intel 387 arithmetic converts SNaNs to QNaNs even just when copying values around between variables (due to the copying being done via the 387 floating-point stack — which might or might not happen depending on optimisation options and vagaries). Intel SSE arithmetic has not been observed to change SNaNs. On 32-bit x86, 387 arithmetic is the default (for GCC), and programs are affected. On 64-bit x86-64, SSE arithmetic is the default (for GCC), and programs have not been observed to be affected. With GCC, |
On 64-bit SSE-arithmetic x86-64 (i.e., the most common platform), just copying values has not been observed to convert SNaN to QNaN, but doing arithmetic on them does. For example, changing line 758 of vcfnorm.c thus - dst->qual = line->qual;
+ dst->qual = line->qual + 0.0; makes the bcftools test suite fail (in This can be avoided on this platform by alway checking |
Broadites: what does Java do in this area? |
@jmarshall According to my tests, adding 0.0 to a signalling NaN float value in Java 8 does not cause it to auto-convert to quiet NaN -- the value after adding 0.0f as determined by a call to Still, this new evidence of the brittleness of these values on x86-64 is very concerning, and probably forces us to go with either option 2 or 3 from your original post above. Option 2 is tempting provided that we can somehow prove that "real" NaNs will never use 0x7FC00001. |
It's also possible that behavior in java will be different on different architectures or across jvms. |
…cial value 0x7f800001 (NaN), used to indicate a missing value, was changed to 0x7fc00001 (also NaN) when the function le_to_float returned. This problem might be specific to MSVC and the calling convention where the return value is passed on the FPU register stack. In that process the value seems to be converted from signaling NaN to quiet NaN. The fix is to keep the value as uint32 while testing for special values. Reported as a bug in samtools: samtools/hts-specs#145
…cial value 0x7f800001 (NaN), used to indicate a missing value, was changed to 0x7fc00001 (also NaN) when the function le_to_float returned. This problem might be specific to MSVC and the calling convention where the return value is passed on the FPU register stack. In that process the value seems to be converted from signaling NaN to quiet NaN. The fix is to keep the value as uint32 while testing for special values. Reported as a bug in samtools: samtools/hts-specs#145
According to my tests (Windows, MSVC, 32-bit mode) the problem appears when floats are returned from functions, not when they are copied. With the default calling convention in my environment, the callee stores the return value on the FPU register stack using the fld instruction, and the caller retrieves it using the fstp instruction. The conversion happens somewhere between those two. I have submitted a pull request to htslib that handles the problem by keeping the values as uint32's while testing for the special values: |
Thank you for this. I am wondering if the change could be made at the level of htslib/vcf.h instead, so that the rest of the code remains unchanged? Do the |
bcf_float_set works fine. The problem is with le_to_float. That's the function that returns the signaling NaN on the FPU stack, which gets converted to a quiet NaN by the FPU. As the code that calls le_to_float is in vcf.c, I can see no other alternatives than to modify that source file. |
A tentative replacement of samtools#485 which requires modifications in user code rather than a single API-level change. This needs to be verified on the affected platform, works fine for me. See also samtools/hts-specs#145 for details
I believe it should be possible to do this at the API level. Can you please test this? pd3/htslib@83ded78 |
No, this won't help. Like I said, the damage has already been done when the uint_8 data was first converted to a float. But another alternative would be to modify the signature of le_to_float, so that it takes a float pointer as a parameter and returns void. Changing that signature will of course affect the implementation in vcf.c as well. |
I was talking rubbish, sorry. I understand the problem now, ignore what I suggested above please. |
:) |
I think some htslib implementation details may have leaked into the specification discussion... Anyway, I think this has been fairly comprehensively covered here, especially in @jmarshall's original comment. The only way that implementations can portably work with the current specification is to effectively treat all floating-point values as a bunch of bytes, up until the point where the value is actually needed. Anyone trying to use the value would have to check that it isn't a missing value code before trying to convert it. Missing values would have to be set by calling a function to set the appropriate binary value, and not by trying to pass it in as a float (htslib already has this). Maybe we should bring this up at the next file formats meeting? I'd favour John's option 2, as it's likely to be the least work. We would probably want to ensure that all missing values are written in the current representation, but we count bit 22 as "don't care" when checking for missing values. |
From §6.3.3:
Unfortunately using signalling NaNs was probably a mistake. On some platforms, operating on them (even just copying them about) converts them to quiet NaNs, and it is hard to control this. (See e.g. early paragraphs of the C committee's N1011 and Kahan's IEEE754 Status notes's comments on signalling NaNs.)
This happens on 32-bit x87, where just loading values on to the FP stack converts SNaNs to QNaNs. This causes the test suite failures in samtools/bcftools#389.
This could be worked around by very careful programming that avoided treating these values as floating point and so avoided FP operations that might convert MISSING etc to QNaNs. However this would be quite painful, and the pain would be repeated in every implementation that wanted to make this work on x86, not just htslib/bcftools.
Implementations could accept that 0x7F800001 (missing) might get converted to 0x7FC00001 in memory and treat either as MISSING for their computations, and then be careful to ensure they are the proper SNaN value when writing them out to BCF files. (And similarly for the other BCF reserved NaN special-purpose values.) So the pain and care would be limited to writing out BCF.
Ideally we might change the BCF specification to allow particular QNaN values for these special-purpose BCF values, perhaps in addition to the current SNaN values. There are a number of options, including:
(The do nothing approach.) Leave BCF's use of signalling NaNs as is; implementations will have to have extra code and carefulness at least when writing BCF files.
(The compatible approach.) Change the table in §6.3.3 to allow either 0x7F800001 or 0x7FC00001 for missing, and similarly for end-of-vector and the other reserved codes, interchangeably.
There is a risk with this that the quiet 0x7FC00001 values are not very distinctive and may be in use as ordinary (non-BCF-special) NaNs. But hopefully all such real NaNs are using the most plain value, 0x7FC00000.
(The distinctive approach.) Invent distinctive new quiet NaN values such as 0x7FCBCF01…0x7FCBCF07 for the BCF special-purpose values, and transition to them from the current 0x7Fx0000x values in some way.
The text was updated successfully, but these errors were encountered: