Fix ABI readers using wrong dtype for resolution-based chunks #2627
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In-file data is 16-bit so our size has to be based on that. This is a continuation of #2621 where a user on slack (Matthew Scutter) discovered that his performance got worse after that PR. This was because he was using large chunk sizes already and it was making this bug (fixed in this PR) more apparent. As a summary of what's going on and how we compute chunks:
This 4 in the second to last step is the problem because the files are actually 16-bit integers in the file. So if we do all our calculations for 32-bit floats, give it to dask when xarray open_dataset is called, but dask sees 16-bit integers we'll have chunks that are about 2x the size we want them to be once we end up scaling these 16-bit integers to 32-bit floats with the scale factor and add offset in the file.
This PR fixes this by changing this second to last step to multiply by 2 (2 bytes == 16-bit integers). This should also prevent the issue that Matthew Scutter was seeing where the number of chunks would end up not being aligned depending on the rounding and dividing dask ends up doing. Here's what you see with dask's default of 128MB:
sqrt(117679104 / 2)
for 16-bit integers gets you 7670.6944 elements per dimension (not an equal number of elements). So you end up with 33.94113 on-disk chunks to include. So dask needs to choose 33 or 34 on-disk chunks (it actually chooses 33).The above points out that since dask is not computing nice round number of elements we get inconsistent and not resolution-aligned chunk sizes (33 on-disk chunks for 500m, 16 for 1km, 8 for 2km).
If after this fix we still don't have consistent chunking someone please tell me as soon as possible.
AUTHORS.md
if not there already