-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Understanding latitudinal bounds of bitrounding absolute error #38
Comments
Very nice plots! I assume you analyse the bitinformation of Kelvin instead of ˚C? |
Quickly ran So its just the temperature gradient causing this, right? Analysis and bitrounding done in ds["tos"].attrs
{'standard_name': 'sea_surface_temperature',
'long_name': 'Sea Surface Temperature',
'comment': 'Temperature of upper boundary of the liquid ocean, including temperatures below sea-ice and floating ice shelves.',
'units': 'degC',
'original_name': 'tos'} |
No, you are using ˚C 😄 What you can see are the 4,8,16˚C isolines of sea surface temperature. With bitrounding the rounding error that's introduced is an absolute error within every power of 2, but approximately a constant relative error across powers of 2. Hence for a constant number of keepbits, values between 8 and 16 have a higher absolute error under rounding than values between 4 and 8. This is just a consequence of floating-point numbers. So the underlying question is more whether floating-point numbers are the best number format to represent sea surface temperature. Given that sea surface temperature is a rather linearly distributed variable (as in its histogram doesn't span orders of magnitude) one could also use linear quantization to encode these values (similar to the scale, offset-integer options in netcdf) this would give you a constant magnitude of the rounding errors across the globe. Alternatively, you can also use Kelvin instead of ˚C. For Kelvin, the relevant powers of 2 are 256 and 512 and all sea surface temperature values should be between the two. Hence all floats used to describe these temperatures are linearly distributed in the required range. However, with Kelvin you bitinformation analysis should tell you to keep more mantissa bits as the information is shifted towards the tail of the mantissa bits. Then your lossless compression algorithm has to take care of the redundant exponent bits (sign and all exponent bits are identical when all values are in [256,512).) |
so it is a bit dangerous to run For global data, I could assume that there might be a gradient along latitudes and find |
a similar analysis challenge when bitrounding multi-level/depth data where the surface ocean layer, TOA or atmospheric surface layer has very different values compared to non-boundary layers. here, analysing and bitrounding 2d fields seems saver IMO. |
I wouldn't describe it as dangerous, as it still does exactly what it is supposed to. Apparently here you are in a situation where you'd like rigid absolute error bounds not relative ones. Floats guarantee you a hard cap on the relative error, but the absolute errors double with increasing powers of 2. So your choice of absolute errors already conflicts with the choice of the binary encoding, floats. Remember that bitinformation does not force you to any binary encoding, so that choice is on you. Sure you can use loads of mantissa bits to overcome the problem, but it's actually that one shouldn't round to a globally constant keepbits in the absolute error+values across several powers of two-situation. The underlying problem is that ˚C is from the perspective of data an arbitrary offset from Kelvin. Sure, (fresh) water freezes at 0˚C but that's about it. So today I learned that it's important to remind people of premisses have to be checked / questions have to be answered before doing the bitinformation+round business
Problem is obviously that most people want to use floats regardless of what data they are handling. Sure that makes sense. So for linearly distributed data (where you want absolute errors to have rigid bounds) we have to come up with a workaround to better adhere to 1.-4. while using floats. Let's take your sea surface temperature example and say you have the high precision data in ˚C and definitely want to store the data as ˚C, so what can one do?
quick illustrative example julia> sst = 30*rand(Float32,100_000) # some high precision data in ˚C
100000-element Vector{Float32}:
11.733397
13.493755
14.13639
12.861835
25.827908
7.3809776
0.36545455
16.824009
4.925149
17.725273
⋮
julia> offset = 256
julia> keepbits = 11 # get that from bitinformation(sst .+ offset)
julia> sst_rounded = sort(unique(round(sst.+offset,keepbits)))
241-element Vector{Float32}:
256.0
256.125
256.25
256.375
256.5
256.625
256.75
256.875
257.0
257.125
⋮
285.0
285.125
285.25
285.375
285.5
285.625
285.75
285.875
286.0 Now you have your data rounded to 0.125K accuracy ( julia> bitstring.(sst_round .- offset,:split)
241-element Vector{String}:
"0 00000000 00000000000000000000000" # zero or one keepbit here
"0 01111100 00000000000000000000000"
"0 01111101 00000000000000000000000"
"0 01111101 10000000000000000000000"
"0 01111110 00000000000000000000000"
"0 01111110 01000000000000000000000" # at least two here
"0 01111110 10000000000000000000000"
"0 01111110 11000000000000000000000"
"0 01111111 00000000000000000000000"
"0 01111111 00100000000000000000000" # > three
⋮
"0 10000011 11010000000000000000000"
"0 10000011 11010010000000000000000" # > seven
"0 10000011 11010100000000000000000"
"0 10000011 11010110000000000000000"
"0 10000011 11011000000000000000000"
"0 10000011 11011010000000000000000"
"0 10000011 11011100000000000000000"
"0 10000011 11011110000000000000000"
"0 10000011 11100000000000000000000" |
Yes that seems a good practical advice. |
True. There is still so much to learn and understand about bits for me... Never thought about this so far...
|
For me too! Thanks for posting these issues, I think together we are creating a great resource of do's and don'ts for data compression. Maybe we can even write them down somewhere in a more concise form!? |
The first idea of @observingClouds and me was writing this python wrapper
Definitely. Do you mean something like a jupyter notebook were code and explanation close to each other? Maybe even a collection of how to handle all different kinds of variables:
|
Yeah, or just a chapter in the documentation which can illustrate these issues with code and figures. Sure it won't be interactive like a notebook, but I don't know where to upload all that data for an interactive notebook which might just be an overkill.
I reckon that many of these issues can be abstracted to similar problems as described above, but it would be good to discuss them from the user/data perspective and not just abstractly from the bit-perspective! |
I try and fail to understand the error introduced by bitrounding comparing poles and tropics.
I take surface temperature
tos
data from CMIP6historical
MPI-ESM1.2-HR
, the analysis says I should keep 7 mantissa bits and bitround. Finally, I plot the time anomaly (to get rid of the pole-tropics gradient). Similar to Klöwer et al. Fig 3, I enforce decreasing levels of real bitwise information from left to right. (Decreasing bykeepbits
would be more useful.)Below is the absolute error due to bitrounding.
Note:
keepbits=6
might have been too strong here, as the absolute error shows spatial features and not pure random noise.What I cannot explain is the very low absolute errors in the high latitudes compared to higher rounding errors in the tropics and the quite abrupt cutoff at frontal lines. Anyone an idea? The temperature values are the pole are lower than in the tropics, but that doesnt explain this IMO.
data: CMIP6 tos historical last 10 years
all slides: https://docs.google.com/presentation/d/1VeFqeCTqsFc68dN4KuKsqMXVpieFpQblu6kLioijf_U/edit#slide=id.g1257457c99f_0_78
The text was updated successfully, but these errors were encountered: