Unexpected behavior of "chunks" argument in open_mfdataset() #9119

arthur-e · 2024-06-13T20:38:18Z

arthur-e
Jun 13, 2024

I'm confused as to why using xr.open_mfdataset(..., chunks = {'time': N}) is not producing chunks with N elements along the time dimension.

An example dataset can be obtained using earthaccess, with NASA EarthData Search credentials.

import earthaccess
import xarray as xr
auth = earthaccess.login()

results = earthaccess.search_data(
    short_name = 'M2SDNXSLV',
    temporal = ("2024-01-01", "2024-05-31"))

# Could take about 1 minute on a broadband connection
earthaccess.download(results, 'data_raw/MERRA2')

This example easily fits into memory, but is an example for a tutorial I'm working on. A more salient reason for wanting a specific kind of chunking would be, e.g., a desire to calculate long-term trends, so all the elements along the time dimension should be in the same chunk.

Specifically, when xarray uses automatic chunking...

ds = xr.open_mfdataset('./data_raw/MERRA2/*.nc4', chunks = 'auto')
ds['T2MMEAN'].chunksizes

Frozen({'time': (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), 'lat': (361,), 'lon': (576,)})

It creates a separate chunk for each file (each time step), which is not ideal. If I instead try:

# This doesn't give the desired result
ds = xr.open_mfdataset('./data_raw/MERRA2/*.nc4', chunks = {'time': 122})

# Nor does this
ds = xr.open_mfdataset('./data_raw/MERRA2/*.nc4', chunks = {'lat': 91, 'lon': 144, 'time': 122})

I still get chunks that do not have 122 elements along the time dimension. The only way I can make this work is to rechunk the data after loading, which is specifically called out as a bad practice.

# This finally does it
ds.chunk({'time': 122})

Answered by dcherian

Jun 13, 2024

The chunks argument is applied on a per-file basis so this is expected.

We'd happily merge a PR making this point clear in the documentation:

xarray/xarray/backends/api.py

Lines 861 to 866 in 9237f90

      chunks : int, dict, 'auto' or None, optional  
    Dictionary with keys given by dimension names and values given by chunk sizes.  
    In general, these should divide the dimensions of each dataset. If int, chunk  
    each dimension by ``chunks``. By default, chunks will be chosen to load entire  
    input files into memory at once. This has a major impact on performance: please  
    see the full documentation for more details [2]_.  

 

View full answer

dcherian · 2024-06-13T20:50:03Z

dcherian
Jun 13, 2024
Maintainer

The chunks argument is applied on a per-file basis so this is expected.

We'd happily merge a PR making this point clear in the documentation:

xarray/xarray/backends/api.py

Lines 861 to 866 in 9237f90

    
               chunks : int, dict, 'auto' or None, optional 
        
                   Dictionary with keys given by dimension names and values given by chunk sizes. 
        
                   In general, these should divide the dimensions of each dataset. If int, chunk 
        
                   each dimension by ``chunks``. By default, chunks will be chosen to load entire 
        
                   input files into memory at once. This has a major impact on performance: please 
        
                   see the full documentation for more details [2]_.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected behavior of "chunks" argument in open_mfdataset() #9119

{{title}}

Replies: 1 comment

{{title}}

Select a reply

	chunks : int, dict, 'auto' or None, optional
	Dictionary with keys given by dimension names and values given by chunk sizes.
	In general, these should divide the dimensions of each dataset. If int, chunk
	each dimension by ``chunks``. By default, chunks will be chosen to load entire
	input files into memory at once. This has a major impact on performance: please
	see the full documentation for more details [2]_.

Unexpected behavior of "chunks" argument in open_mfdataset() #9119

arthur-e Jun 13, 2024

Replies: 1 comment

dcherian Jun 13, 2024 Maintainer

arthur-e
Jun 13, 2024

dcherian
Jun 13, 2024
Maintainer