Passing file objects to netCDF4.Dataset doesn't work #295

rabernat · 2014-10-05T16:59:38Z

I am trying to port some code from scipy.io.netcdf_file to netCDF4.Dataset. I have encountered an issue which is pretty significant for me. netCDF4.Dataset expects a string as its argument and is unable to accept an open file object. The issue can be seen in the following code

import netCDF4
from scipy.io import netcdf_file

fobj = open('MODIS.nc', 'rb')
nc3 = netcdf_file(fobj)
fobj.close()

fobj = open('MODIS.nc', 'rb')
nc4 = netCDF4.Dataset(fobj) # this fails
fobj.close()

The second-to-last line raises

TypeError: expected string or Unicode object, file found

This may seem like an unnecessary feature (why not just pass the filename directly), but the problem is that I have a large archive of bzipped netcdf files on disk. The way I usually read them is

import bz2
bz2_fobj = bz2.BZ2File('MODIS.nc.bz2')
nc3 = netcdf_file(bz2_fobj)

If I can't do this with netCDF4, I will have do design a clumsy workaround involving system commands to manually unzip the files.

I considered tying to add this feature myself, but then I realized that the whole library was written in C. Hopefully you will consider adding support for reading file objects.

The text was updated successfully, but these errors were encountered:

shoyer · 2014-10-05T17:11:41Z

We would also love to have this, but unfortunately, I don't think it's an easy fix (as it would require delving into the netcdf C library). Hopefully @jswhit can elaborate.

jswhit · 2014-10-06T13:37:00Z

scipy.io_netcdf is a pure python module that reads and writes netcdf-3 formatted files directly. netcdf4-python is a python interface to the netcdf C library, and can handle the netcdf HDF5-based file format. There's no way for the C library to utilize a python file object.

shoyer · 2014-10-06T19:13:34Z

I doubt there's "no way", but I don't have well defined sense of how difficult it would be (or who actually knows enough to do it). There is, for example, a C side API for working with Python file objects: https://docs.python.org/2/c-api/file.html

jswhit · 2014-10-06T19:40:12Z

I should have said "impossible without extensive modifications to the HDF5 and netCDF C libs".

I suppose as a workaround we could dump the bytes from the open file object to a temp file, and then pass the name of that temp file to the netCDF C lib.

shoyer · 2014-10-06T19:57:27Z

I suppose as a workaround we could dump the bytes from the open file object to a temp file, and then pass the name of that temp file to the netCDF C lib.

Indeed, this is the simplest way to solve this problem. But I would say that sort of solution belongs in user code, not this library.

marqh · 2014-10-29T13:55:05Z

maybe I'm missing something here, but a filehandle has a .name attribute, so could the code to work around this, and then by extension offer the way to fix the issue in the python layer look like:

fobj = open('MODIS.nc', 'rb')
nc4 = netCDF4.Dataset(fobj.name)
fobj.close()

?

It's not especially pretty, but at least it enables expected behaviour to be preserved.

shoyer · 2014-10-29T16:55:47Z

@marqh that would work for this example, but in general a python file object only needs to adhere the file API -- it need not be an actual file on disk (e.g., it could be a BytesIO object).

niallrobinson · 2016-07-28T08:40:50Z

Hi everyone - I'd love to see a fix for this. Its coming up quite regularly in "map reduce" world (i.e. Hadoop, Spark, Dask) where we want to be able to pass file objects around and read them quickly, that is without dumping to disk. Is there anything on the horizon that might help out with this?

dopplershift · 2016-07-28T17:02:29Z

netCDF-C last fall gained support for reading directly from an in-memory buffer that contains the bytes of a netCDF file. It's been on my TODO list to expose this in the Cython wrappers here, but I haven't gotten to it. That's probably your best bet--it's not a file-like object, but at least you wouldn't have to have a file on disk any more.

niallrobinson · 2016-08-02T13:57:37Z

great - thanks for the update

jgerardsimcock · 2016-11-03T00:10:49Z

@dopplershift any update on this?

rabernat · 2016-11-03T03:58:49Z

As the original creator of this issue, I am pleased to see it is still alive. I am still very interested, although more for the reasons described by @niallrobinson. I believe the in-memory buffer solution could solve things. To clarify, would be be able to pass a BytesIO object?

niallrobinson · 2016-11-03T15:34:36Z

yup - still actively thinking/worrying about this ;)

dopplershift · 2016-11-03T17:42:36Z

It's still on my todo list, but it hasn't bubbled to the top. I'll try to squeeze it in sooner rather than later (since I don't think it's that hard), but can't make any promises (especially before AMS annual meeting in January).

I don't see BytesIO being supported, since the core functionality would be to read the entire contents of a file into memory and point netCDF at it. BytesIO is about wrapping such a buffer so you can access it like a file. So in my mind it would work like this (borrowing from above):

import bz2
from netCDF4 import Dataset
bz2_fobj = bz2.BZ2File('MODIS.nc4.bz2')
nc4 = Dataset(bz2_fobj.read())

Would that would serve the use cases mentioned here?

shoyer · 2016-11-03T18:05:02Z

An interface that accepts file images in the form of bytes would be a big improvement over what we have now.

The driver of performance is the number of memory copies. With scipy.io.netcdf and ByteIO, you can actually pull out np.memmap arrays from an in-memory file image with zero copies. In general, this is impossible for netCDF4, due to the fact that HDF5's memory layout is (often) incompatible with NumPy. But, if we can avoid making a copy in netCDF and simply reuse the raw bytes from Python, that would be very nice. If that's not possible, a memory copy is still an improvement over needing to read from disk.

jswhit · 2016-11-03T19:02:27Z

Here's the documentation for the netCDF-C routine (nc_open_mem) that we could wrap in cython:

http://www.unidata.ucar.edu/software/netcdf/docs/group__datasets.html#gac12fdf7579a2619b2aeb238cea2e7377

thehesiod · 2017-04-28T01:00:30Z

@jswhit nice!!! I'm going to try seeing if I can get this to work in a fork. Update, created linked PR, unfortunately nc_open_mem is broken :(

thehesiod · 2017-05-22T21:01:59Z

update for others on this thread, in master you can now open a file from memory (not released to pypi yet unfortunately)

ReimarBauer · 2017-10-23T07:14:42Z

@thehesiod can you show an example please. I am interested to use this with pyfilesystem2, e.g. webdav, ftp direct access.

dopplershift · 2017-10-23T21:25:20Z

You should be able to use:

Dataset('myname', memory=fobj.read())

There was a problem with myname needing to point to an existing (and valid) netCDF file, but that should be fixed in netCDF 4.5.0, which was just released.

thehesiod · 2017-10-23T23:48:06Z

Still must be non-empty name I believe

kuchaale · 2018-04-10T21:33:09Z

@ReimarBauer if still interested, here is the solution where I used pyfilesystem2 to read zipped netcdf files:

from fs.zipfs import ZipFS
import xarray as xr
import netCDF4

new_zip = ZipFS("results.zip")
bytes = new_zip.getbytes(u'one_file_within_zip.nc')
nc4_ds = netCDF4.Dataset('name', mode = 'r', memory=bytes)
store = xr.backends.NetCDF4DataStore(nc4_ds)
ds = xr.open_dataset(store)

mir-una · 2018-04-23T10:23:37Z

Hello, I am retrieving a BytesIO object from a REST API response and I would like to read directly the Dataset from it without having to first write the object on disk. Is there a way to do this?

dopplershift · 2018-04-23T18:13:50Z

@mir-una Just like the other ones above:

data_bytes = response.read()
nc4_ds = netCDF4.Dataset('name', mode='r', memory=data_bytes)

mir-una · 2018-04-23T22:40:35Z

@dopplershift thank you but I cannot figure it out, I am using the requests package, read() does not seem to be a method supported... I am doing the following:
response = requests.get(my_url,params=token, stream=True)
x=BytesIO(response.content)
y=Dataset('name',mode='r',memory=x)
I get an error: netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Dataset.init()
ValueError: memory mode only works with 'r' modes and must be bytes
I am using Python 2.7, is this Python 2.7 does not have a definition of bytes or is it something else?

dopplershift · 2018-04-23T22:43:34Z

@mir-una Using BytesIO is unnecessary, try:

response = requests.get(my_url, params=token, stream=True)
y=Dataset('name', mode='r', memory=respons.content)

tam203 · 2019-03-29T16:54:12Z

I'm trying to read in to a Dataset from memory as per the docs but it's not working tried 2.7 and 3.7 and get the same error

[ec2-user@ip-172-31-12-20 project]$ python3 inmem.py
Traceback (most recent call last):
  File "inmem.py", line 5, in <module>
    netCDF4.Dataset("in-mem-file", mode='r', memory=data)
  File "netCDF4/_netCDF4.pyx", line 2285, in netCDF4._netCDF4.Dataset.__init__
  File "netCDF4/_netCDF4.pyx", line 1855, in netCDF4._netCDF4._ensure_nc_success
FileNotFoundError: [Errno 2] No such file or directory: b'in-mem-file'

code:

import netCDF4
with open('./db8d6757c80a3fa51779a325ba76336451ea0344.nc','rb') as fp:
    data = fp.read()
ds = netCDF4.Dataset("in-mem-file", mode='r', memory=data)
print(ds)

netCDF4 version '1.5.0'

a FileNotFoundError seems irrelevant since I'm trying to read from memory. Help much appreciated.

jswhit · 2019-04-13T12:43:30Z

Can you post the file here? (attach to ticket as a gzipped tar file?)

jswhit · 2019-04-13T15:38:11Z

Also, what version of netcdf-c are you using? (you can check by looking at the __netcdf4libversion__ module variable).

tam203 · 2019-04-15T12:34:34Z

Thanks @jswhit you solved my issue over here

It was version 4.4.1.1 of the lib.

shoyer mentioned this issue Nov 2, 2016

Support creating DataSet from streaming object pydata/xarray#1075

Closed

This was referenced Apr 28, 2017

add support for opening file from memory #652

Merged

Reading Dataset from memory #406

Closed

NotSqrt mentioned this issue Jun 8, 2018

netcdf-c 4.5 MacPython/netcdf4-python-wheels#4

Open

carloshorn mentioned this issue Aug 5, 2020

Allow reading files passing file objects pytroll/satpy#1299

Open

Huite mentioned this issue Aug 7, 2020

SanDiego.nc file throws error in load_arbitrary_ugrid.py example script NOAA-ORR-ERD/gridded#56

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Passing file objects to netCDF4.Dataset doesn't work #295

Passing file objects to netCDF4.Dataset doesn't work #295

rabernat commented Oct 5, 2014

shoyer commented Oct 5, 2014

jswhit commented Oct 6, 2014

shoyer commented Oct 6, 2014

jswhit commented Oct 6, 2014

shoyer commented Oct 6, 2014

marqh commented Oct 29, 2014

shoyer commented Oct 29, 2014

niallrobinson commented Jul 28, 2016

dopplershift commented Jul 28, 2016

niallrobinson commented Aug 2, 2016

jgerardsimcock commented Nov 3, 2016

rabernat commented Nov 3, 2016

niallrobinson commented Nov 3, 2016

dopplershift commented Nov 3, 2016

shoyer commented Nov 3, 2016

jswhit commented Nov 3, 2016

thehesiod commented Apr 28, 2017 •

edited

Loading

thehesiod commented May 22, 2017

ReimarBauer commented Oct 23, 2017

dopplershift commented Oct 23, 2017 •

edited

Loading

thehesiod commented Oct 23, 2017 •

edited

Loading

kuchaale commented Apr 10, 2018 •

edited

Loading

mir-una commented Apr 23, 2018

dopplershift commented Apr 23, 2018

mir-una commented Apr 23, 2018

dopplershift commented Apr 23, 2018

tam203 commented Mar 29, 2019 •

edited

Loading

jswhit commented Apr 13, 2019

jswhit commented Apr 13, 2019 •

edited

Loading

tam203 commented Apr 15, 2019

Passing file objects to netCDF4.Dataset doesn't work #295

Passing file objects to netCDF4.Dataset doesn't work #295

Comments

rabernat commented Oct 5, 2014

shoyer commented Oct 5, 2014

jswhit commented Oct 6, 2014

shoyer commented Oct 6, 2014

jswhit commented Oct 6, 2014

shoyer commented Oct 6, 2014

marqh commented Oct 29, 2014

shoyer commented Oct 29, 2014

niallrobinson commented Jul 28, 2016

dopplershift commented Jul 28, 2016

niallrobinson commented Aug 2, 2016

jgerardsimcock commented Nov 3, 2016

rabernat commented Nov 3, 2016

niallrobinson commented Nov 3, 2016

dopplershift commented Nov 3, 2016

shoyer commented Nov 3, 2016

jswhit commented Nov 3, 2016

thehesiod commented Apr 28, 2017 • edited Loading

thehesiod commented May 22, 2017

ReimarBauer commented Oct 23, 2017

dopplershift commented Oct 23, 2017 • edited Loading

thehesiod commented Oct 23, 2017 • edited Loading

kuchaale commented Apr 10, 2018 • edited Loading

mir-una commented Apr 23, 2018

dopplershift commented Apr 23, 2018

mir-una commented Apr 23, 2018

dopplershift commented Apr 23, 2018

tam203 commented Mar 29, 2019 • edited Loading

jswhit commented Apr 13, 2019

jswhit commented Apr 13, 2019 • edited Loading

tam203 commented Apr 15, 2019

thehesiod commented Apr 28, 2017 •

edited

Loading

dopplershift commented Oct 23, 2017 •

edited

Loading

thehesiod commented Oct 23, 2017 •

edited

Loading

kuchaale commented Apr 10, 2018 •

edited

Loading

tam203 commented Mar 29, 2019 •

edited

Loading

jswhit commented Apr 13, 2019 •

edited

Loading