Add functionality to allow appending cubes to existing netcdf file #565

rsignell-usgs · 2013-06-19T11:20:08Z

Currently writing Iris data to netcdf creates a new file. In the case where we are using Iris to process a large amount of data (in my case a 30 year global hindcast), we need to be able process timestep-by-timestep and append the processed result at each timestep to the existing netcdf file, as we don't have enough computer memory to hold all the timesteps at once. An example of what we would like to do is here, using NetCDF4 to append, but we would like to be able to accomplish this with Iris.

http://nbviewer.ipython.org/5777643

The relevant code snippet is:


nc = netCDF4.Dataset('cfsr.nc','r+')
for i in range(10):
    url='http://nomads.ncdc.noaa.gov/thredds/dodsC/modeldata/\
cmd_ocnh/2009/200905/200905%2.2d/ocnh01.gdas.200905%2.2d00.grb2' % (i+1,i+1)
    print url
    cubes = iris.load(url)
    t = cubes[4]
    slice=t.extract(iris.Constraint(longitude=lambda cell: -77.+360. < cell < -63.0+360.,latitude=lambda cell: 34. < cell < 46.0))
    nc.variables['Potential_temperature'][i,:,:,:]=slice.data

The text was updated successfully, but these errors were encountered:

bjlittle · 2013-06-21T07:30:02Z

Thanks @rsignell-usgs for posting this issue!

Yes, I completely agree ... being able to append to a NetCDF file as you process a streamed input would be very useful. It's the natural extension to the current NetCDF saving capability.

Let's see if we can get this addressed for you ... would @rhattersley or @esc24 care to comment please?

rhattersley · 2013-06-21T09:46:09Z

Thanks @rsignell-usgs for posting this issue!

I'll second that - it's a very interesting problem. 😀

@esc24 and I have had a quick chat about it and a couple of options came to mind.

The first, most well-defined, and simplest option is to provide a way to append a Cube to an existing netCDF file. This would check the metadata of the Cube against the metadata in the file and extend an existing variable where appropriate. For example (apologies for the boolean argument in this mock-up 😉):

for i in range(10):
    url = '...'
    my_2d_cube = iris.load(url)[4].extract(...)
    iris.save(my_2d_cube, 'cfsr.nc', append=True)

The second option (which is just an exploration at this stage, and not an alternative to the first) is to create a single, all-encompassing empty result Cube where the data is defined by a function instead of a numpy array, just as it is for deferred loading. iris.save() would then use this function to create data in chunks and write it to disk. For example, with lots of hand waving...:

def derive_data(...):
    url = '...'
    my_2d_cube = iris.load(url)[4].extract(...)
    return my_2d_cube.data
data = biggus.derived_data((55000, 768, 1024), derive_data, (768, 1024))
my_big_cube = Cube(data, ...)
iris.save(my_big_cube, 'cfsr.nc')

esc24 · 2013-06-21T10:19:25Z

There may be situations where one would want to append a cube to a file but not do an implicit merge even if it were possible. For example:

cubes = [day_one, day_two, day_three]
for cube in cubes:
    iris.save(cube, 'myfile.nc', append=True, merge=False)

should result in three data variables so that when loaded you get back what you saved (three cubes) (I'm not proposing another boolean keyword, but it should illustrate what I mean).
I'd also like this this to handle the case where the cubes cannot be merged such that:

cubes = [temperature, pressure, humidity]
for cube in cubes:
    iris.save(cube, 'myfile.nc', append=True)

is equivalent to iris.save(cubes, 'myfile.nc').

arulalant · 2015-12-21T15:11:16Z

Is this #565 still open ?

cubes = [temperature, pressure, humidity]
for cube in cubes:
iris.save(cube, 'myfile.nc', append=True)

Still I couldn't append cubes (list of different variables) into nc file.

Thanks,
Arulalan.T

rhattersley · 2015-12-29T14:13:43Z

Is this #565 still open ?

Yes, but I'm not aware of anyone working on a fix. Sorry!

pp-mo · 2021-01-05T19:24:23Z

I've been meaning for ages to respond to this.

I spent quite a while a few months back trying to implement an append, mostly focussed on @rhattersley case #1,
i.e. specifically appending to all the relevant variables along an unlimited dimension.

It's a natural usecase, e.g. to open a file and add "today's data" to it, especially as the necessary operation exists in netcdf (without requiring you to read+re-write all the existing, even by streaming it).

I spent quite a while climbing the mountain of "decoupling the saving code from actual file operations", so it instead produces an abstract representation of the required (new) data,
With this, I could then either create new variables or extend existing ones.
Working code ideas here : pp-mo#51

However, when I then sought to use this to define a file-append function, I hit big problems in unambiguously relating the "new data" to the existing content in a file, and especially in guaranteeing that the proposed append operation will be correct and safe
... which you really need, as we expect the existing dataset to typically be large and old, so when modifying it without first taking a copy, you really can't afford to fail halfway through !!

Sample of WIP here : pp-mo#52
So, the append operation itself is really pretty trivial
But, the resulting checking routine got dreadfully intricate, and I never completed this work.

A key problem is that Iris identifies data by it's CF identity, but we need to work with the actual variable + dimension names in the file, most likely generated by a previous Iris save, but not necessarily with the same results (notably, var- and dim-names and attributes) -- at least if you allow it to potentially add new variables + dims to an existing file.

So, I concluded that, if we really want this, it would far better be based on a lower-level append operation acting on general netcdf files, with no CF concepts involved.

E.G. I think there is append functionality in the CDO code.
.. or how about in xarray ?

But equally, I have come to really doubt the wisdom of the whole idea ...
If it's really so hard to be sure you are not about to trash the existing precious file, maybe taking a copy just isn't so terrible ??

pp-mo · 2022-08-25T10:08:09Z

I'm going to finally close this, as I don't think we are going to do it.
We are currently intending to improve Iris/Xarray interoperability,
we are planning to trial a much better lazy-data-preserving data-exchange
so I expect that in future that will be the appropriate way to support incremental writes.
See for context : #4835 (comment)

bjlittle self-assigned this Oct 1, 2020

bjlittle added the Peloton 🚴‍♂️ Target a breakaway issue to be caught and closed by the peloton label Oct 1, 2020

bjlittle assigned pp-mo Oct 28, 2020

This was referenced Jun 29, 2022

Netcdf separate loadsave #4803

Merged

read-write Xarray with nc-dataset adapter. #4835

Closed

pp-mo closed this as completed Aug 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add functionality to allow appending cubes to existing netcdf file #565

Add functionality to allow appending cubes to existing netcdf file #565

rsignell-usgs commented Jun 19, 2013

bjlittle commented Jun 21, 2013

rhattersley commented Jun 21, 2013

esc24 commented Jun 21, 2013

arulalant commented Dec 21, 2015

rhattersley commented Dec 29, 2015

pp-mo commented Jan 5, 2021 •

edited

Loading

pp-mo commented Aug 25, 2022

Add functionality to allow appending cubes to existing netcdf file #565

Add functionality to allow appending cubes to existing netcdf file #565

Comments

rsignell-usgs commented Jun 19, 2013

bjlittle commented Jun 21, 2013

rhattersley commented Jun 21, 2013

esc24 commented Jun 21, 2013

arulalant commented Dec 21, 2015

rhattersley commented Dec 29, 2015

pp-mo commented Jan 5, 2021 • edited Loading

pp-mo commented Aug 25, 2022

pp-mo commented Jan 5, 2021 •

edited

Loading