Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add functionality to allow appending cubes to existing netcdf file #565

Closed
rsignell-usgs opened this issue Jun 19, 2013 · 7 comments
Closed
Assignees
Labels
Peloton 🚴‍♂️ Target a breakaway issue to be caught and closed by the peloton

Comments

@rsignell-usgs
Copy link

Currently writing Iris data to netcdf creates a new file. In the case where we are using Iris to process a large amount of data (in my case a 30 year global hindcast), we need to be able process timestep-by-timestep and append the processed result at each timestep to the existing netcdf file, as we don't have enough computer memory to hold all the timesteps at once. An example of what we would like to do is here, using NetCDF4 to append, but we would like to be able to accomplish this with Iris.

http://nbviewer.ipython.org/5777643

The relevant code snippet is:


nc = netCDF4.Dataset('cfsr.nc','r+')
for i in range(10):
    url='http://nomads.ncdc.noaa.gov/thredds/dodsC/modeldata/\
cmd_ocnh/2009/200905/200905%2.2d/ocnh01.gdas.200905%2.2d00.grb2' % (i+1,i+1)
    print url
    cubes = iris.load(url)
    t = cubes[4]
    slice=t.extract(iris.Constraint(longitude=lambda cell: -77.+360. < cell < -63.0+360.,latitude=lambda cell: 34. < cell < 46.0))
    nc.variables['Potential_temperature'][i,:,:,:]=slice.data
@bjlittle
Copy link
Member

Thanks @rsignell-usgs for posting this issue!

Yes, I completely agree ... being able to append to a NetCDF file as you process a streamed input would be very useful. It's the natural extension to the current NetCDF saving capability.

Let's see if we can get this addressed for you ... would @rhattersley or @esc24 care to comment please?

@rhattersley
Copy link
Member

Thanks @rsignell-usgs for posting this issue!

I'll second that - it's a very interesting problem. 😀

@esc24 and I have had a quick chat about it and a couple of options came to mind.

The first, most well-defined, and simplest option is to provide a way to append a Cube to an existing netCDF file. This would check the metadata of the Cube against the metadata in the file and extend an existing variable where appropriate. For example (apologies for the boolean argument in this mock-up 😉):

for i in range(10):
    url = '...'
    my_2d_cube = iris.load(url)[4].extract(...)
    iris.save(my_2d_cube, 'cfsr.nc', append=True)

The second option (which is just an exploration at this stage, and not an alternative to the first) is to create a single, all-encompassing empty result Cube where the data is defined by a function instead of a numpy array, just as it is for deferred loading. iris.save() would then use this function to create data in chunks and write it to disk. For example, with lots of hand waving...:

def derive_data(...):
    url = '...'
    my_2d_cube = iris.load(url)[4].extract(...)
    return my_2d_cube.data
data = biggus.derived_data((55000, 768, 1024), derive_data, (768, 1024))
my_big_cube = Cube(data, ...)
iris.save(my_big_cube, 'cfsr.nc')

@esc24
Copy link
Member

esc24 commented Jun 21, 2013

There may be situations where one would want to append a cube to a file but not do an implicit merge even if it were possible. For example:

cubes = [day_one, day_two, day_three]
for cube in cubes:
    iris.save(cube, 'myfile.nc', append=True, merge=False)

should result in three data variables so that when loaded you get back what you saved (three cubes) (I'm not proposing another boolean keyword, but it should illustrate what I mean).
I'd also like this this to handle the case where the cubes cannot be merged such that:

cubes = [temperature, pressure, humidity]
for cube in cubes:
    iris.save(cube, 'myfile.nc', append=True)

is equivalent to iris.save(cubes, 'myfile.nc').

@arulalant
Copy link

Is this #565 still open ?

cubes = [temperature, pressure, humidity]
for cube in cubes:
iris.save(cube, 'myfile.nc', append=True)

Still I couldn't append cubes (list of different variables) into nc file.

Thanks,
Arulalan.T

@rhattersley
Copy link
Member

Is this #565 still open ?

Yes, but I'm not aware of anyone working on a fix. Sorry!

@bjlittle bjlittle self-assigned this Oct 1, 2020
@bjlittle bjlittle added the Peloton 🚴‍♂️ Target a breakaway issue to be caught and closed by the peloton label Oct 1, 2020
@pp-mo
Copy link
Member

pp-mo commented Jan 5, 2021

I've been meaning for ages to respond to this.

I spent quite a while a few months back trying to implement an append, mostly focussed on @rhattersley case #1,
i.e. specifically appending to all the relevant variables along an unlimited dimension.

It's a natural usecase, e.g. to open a file and add "today's data" to it, especially as the necessary operation exists in netcdf (without requiring you to read+re-write all the existing, even by streaming it).

I spent quite a while climbing the mountain of "decoupling the saving code from actual file operations", so it instead produces an abstract representation of the required (new) data,
With this, I could then either create new variables or extend existing ones.
Working code ideas here : pp-mo#51

However, when I then sought to use this to define a file-append function, I hit big problems in unambiguously relating the "new data" to the existing content in a file, and especially in guaranteeing that the proposed append operation will be correct and safe
... which you really need, as we expect the existing dataset to typically be large and old, so when modifying it without first taking a copy, you really can't afford to fail halfway through !!

Sample of WIP here : pp-mo#52
So, the append operation itself is really pretty trivial
But, the resulting checking routine got dreadfully intricate, and I never completed this work.

A key problem is that Iris identifies data by it's CF identity, but we need to work with the actual variable + dimension names in the file, most likely generated by a previous Iris save, but not necessarily with the same results (notably, var- and dim-names and attributes) -- at least if you allow it to potentially add new variables + dims to an existing file.

So, I concluded that, if we really want this, it would far better be based on a lower-level append operation acting on general netcdf files, with no CF concepts involved.

  • E.G. I think there is append functionality in the CDO code.
  • .. or how about in xarray ?

But equally, I have come to really doubt the wisdom of the whole idea ...
If it's really so hard to be sure you are not about to trash the existing precious file, maybe taking a copy just isn't so terrible ??

@pp-mo
Copy link
Member

pp-mo commented Aug 25, 2022

I'm going to finally close this, as I don't think we are going to do it.
We are currently intending to improve Iris/Xarray interoperability,
we are planning to trial a much better lazy-data-preserving data-exchange
so I expect that in future that will be the appropriate way to support incremental writes.
See for context : #4835 (comment)

@pp-mo pp-mo closed this as completed Aug 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Peloton 🚴‍♂️ Target a breakaway issue to be caught and closed by the peloton
Projects
None yet
Development

No branches or pull requests

6 participants