Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

return in any cases/platforms float32 #35

Open
daniel-mohr opened this issue Aug 11, 2021 · 1 comment
Open

return in any cases/platforms float32 #35

daniel-mohr opened this issue Aug 11, 2021 · 1 comment

Comments

@daniel-mohr
Copy link
Contributor

Do we really want to return in any cases/platforms float32? (Changing this could break usage of cfunits.)

I believe that every 4 byte integer could be stored in a 8 byte float (double) without losing precision. Therefore it could be more accurate to use numpy.float64 in all cases. Further for an integer with more than 4 bytes a float32 is not sufficient. What do you think?

@daniel-mohr
Copy link
Contributor Author

Recently I was talking with a friend about floating-point numbers and suddenly this issue here gets clear to me.

If we assume IEEE 754, we get for a 4 byte float (float32):

  • s: 1 bit for the sign
  • m_i: 23 bits for the mantissa
  • e_i: 8 bits for the exponent

And with a few mathematical formulas we get the decimal number from a few bits:

sign: S = (-1)^s
mantissa: M = 1 + \sum_{i=0}^22){m_i * 2**i} / 2^23
exponent: E = \sum_{i=0}^7){e_i * 2**i} - (2^7-1) = \sum_{i=0}^7){e_i * 2**i} - 127
decimal number: S * M * 2^E

And similar for a double/float64:

sign: S = (-1)^s
mantissa: M = 1 + \sum_{i=0}^51){m_i * 2**i} / 2^52
exponent: E = \sum_{i=0}^10){e_i * 2**i} - (2^10-1) = \sum_{i=0}^10){e_i * 2**i} - 1023
decimal number: S * M * 2^E

So, this means for a float32 we can only use 23 bits of the original mantissa of an integer. Converting an int32 to float32 leads to a relative error of up to \abs{\frac{2^24+1 - float32(2^24+1)}{2^24+1}} ~= 6e-8.

Whereas using a float64 we could store the full precision of an int32.

But converting an int64 to float64 leads again to a relative error of up to \abs{\frac{2^53+1 - float64(2^53+1)}{2^53+1}} ~= 1e-16.

So in both cases (int32 -> float32, int64 -> float64) we have a relative error of the machine epsilon.

There are some tools, where you can play around (and maybe more and better ones):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant