Fix ABI readers using wrong dtype for resolution-based chunks #2627

djhoese · 2023-11-06T19:43:25Z

In-file data is 16-bit so our size has to be based on that. This is a continuation of #2621 where a user on slack (Matthew Scutter) discovered that his performance got worse after that PR. This was because he was using large chunk sizes already and it was making this bug (fixed in this PR) more apparent. As a summary of what's going on and how we compute chunks:

Take current dask setting for number of bytes per chunk.
Get number of elements in one "square" chunk in one dimension (N elements in an N * N chunk of B bytes for each element).
Align this number of elements to be 4 * 226, representing the hardcoded on-disk chunk size of 226 elements and 4 for the highest multiplier of resolution (4 == 500m => 2 for 1km => 1 for 2km).
Scale this aligned number of elements to the specific resolution for a specific file type. So divide by 2 for 1km data, or divide by 4 for 2km data, or divide by 1 (no-op) for 500m data (C02).
Convert this number of elements back to an amount of memory by squaring it (2D square chunk) and then multiplying by the number of bytes for a single pixel. For our final data this is 4 for a 32-bit float.
Temporarily set the dask array.chunk-size setting to this number of bytes.

This 4 in the second to last step is the problem because the files are actually 16-bit integers in the file. So if we do all our calculations for 32-bit floats, give it to dask when xarray open_dataset is called, but dask sees 16-bit integers we'll have chunks that are about 2x the size we want them to be once we end up scaling these 16-bit integers to 32-bit floats with the scale factor and add offset in the file.

This PR fixes this by changing this second to last step to multiply by 2 (2 bytes == 16-bit integers). This should also prevent the issue that Matthew Scutter was seeing where the number of chunks would end up not being aligned depending on the rounding and dividing dask ends up doing. Here's what you see with dask's default of 128MB:

226 on-disk aligned 500 meter band number of bytes: 117679104
sqrt(117679104 / 2) for 16-bit integers gets you 7670.6944 elements per dimension (not an equal number of elements). So you end up with 33.94113 on-disk chunks to include. So dask needs to choose 33 or 34 on-disk chunks (it actually chooses 33).
For the 1km band that would be 29419776 bytes => 3835.347181155834 elements per dimension => 16.9706 on-disk chunks (dask chooses 16).
For the 2km band it chooses 8 on-disk chunks.

The above points out that since dask is not computing nice round number of elements we get inconsistent and not resolution-aligned chunk sizes (33 on-disk chunks for 500m, 16 for 1km, 8 for 2km).

If after this fix we still don't have consistent chunking someone please tell me as soon as possible.

Closes #xxxx
Tests added
Fully documented
Add your name to AUTHORS.md if not there already

In-file data is 16-bit so our size has to be based on that

codecov · 2023-11-06T19:58:29Z

Codecov Report

Merging #2627 (4e485e9) into main (ee63577) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main    #2627   +/-   ##
=======================================
  Coverage   95.18%   95.18%           
=======================================
  Files         354      354           
  Lines       51316    51318    +2     
=======================================
+ Hits        48846    48848    +2     
  Misses       2470     2470

Flag	Coverage Δ
behaviourtests	`4.24% <0.00%> (-0.01%)`	⬇️
unittests	`95.81% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
satpy/readers/abi_base.py	`95.36% <100.00%> (ø)`
satpy/tests/reader_tests/test_abi_l1b.py	`98.96% <100.00%> (+0.01%)`	⬆️

coveralls · 2023-11-06T20:26:10Z

Pull Request Test Coverage Report for Build 6775731325

11 of 11 (100.0%) changed or added relevant lines in 2 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 95.763%

Totals
Change from base Build 6737851658:	0.0%
Covered Lines:	48974
Relevant Lines:	51141

💛 - Coveralls

djhoese · 2023-11-10T16:22:11Z

This was tested by Matthew on slack. Looks good to go. Merging...

Fix ABI readers using wrong dtype for resolution-based chunks

4e485e9

In-file data is 16-bit so our size has to be based on that

djhoese added bug component:readers labels Nov 6, 2023

djhoese self-assigned this Nov 6, 2023

djhoese requested a review from mraspaud as a code owner November 6, 2023 19:43

djhoese merged commit 25ef6f9 into pytroll:main Nov 10, 2023
18 of 19 checks passed

djhoese deleted the bugfix-abi-16bit-chunks branch November 10, 2023 16:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ABI readers using wrong dtype for resolution-based chunks #2627

Fix ABI readers using wrong dtype for resolution-based chunks #2627

djhoese commented Nov 6, 2023

codecov bot commented Nov 6, 2023 •

edited

Loading

coveralls commented Nov 6, 2023

djhoese commented Nov 10, 2023

Fix ABI readers using wrong dtype for resolution-based chunks #2627

Fix ABI readers using wrong dtype for resolution-based chunks #2627

Conversation

djhoese commented Nov 6, 2023

codecov bot commented Nov 6, 2023 • edited Loading

Codecov Report

coveralls commented Nov 6, 2023

Pull Request Test Coverage Report for Build 6775731325

💛 - Coveralls

djhoese commented Nov 10, 2023

codecov bot commented Nov 6, 2023 •

edited

Loading