Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault when using MemoryFile objects #550

Closed
mraspaud opened this issue Jul 18, 2022 · 14 comments
Closed

Segfault when using MemoryFile objects #550

mraspaud opened this issue Jul 18, 2022 · 14 comments
Labels
bug Something isn't working

Comments

@mraspaud
Copy link
Contributor

Code Sample, a copy-pastable example if possible

from matplotlib import pyplot
import rioxarray
from rasterio.io import MemoryFile

filename = "test.jp2"

def read_file(filename):
    with open(filename, "rb") as of:
        data = of.read()
        memfile = MemoryFile(data)
        src = memfile.open()
        proj = rioxarray.open_rasterio(src, chunks=2048)
        return proj.squeeze("band")

proj = read_file(filename)
proj.plot.imshow()
pyplot.show()

Problem description

The above results in a segmentation fault every time (at least on my laptop)

Expected Output

I expect the display of my data.

Environment Information

rioxarray (0.11.2.dev0) deps:
rasterio: 1.2.10
xarray: 2022.3.0
GDAL: 3.4.2
GEOS: None
PROJ: None
PROJ DATA: None
GDAL DATA: None

Other python deps:
scipy: 1.8.0
pyproj: 3.3.1

System:
python: 3.10.2 | packaged by conda-forge | (main, Feb 1 2022, 19:28:35) [GCC 9.4.0]
executable: /home/a001673/mambaforge/envs/satpy/bin/python
machine: Linux-4.18.0-372.16.1.el8_6.x86_64-x86_64-with-glibc2.28

Installation method

  • from a fresh git clone, with pip install -e .
@mraspaud mraspaud added the bug Something isn't working label Jul 18, 2022
@mraspaud
Copy link
Contributor Author

I suspect this is due to gdal's memory file object being disposed off before dask can make use of it...

@mraspaud
Copy link
Contributor Author

Just for clarification and some context: The opening of the file and reading to put the data in the MemoryFile is of course contrived, and just comes from the fact that I need to be able to pass open files to rioxarray. In my case, I want to use fsspec in the background (https://filesystem-spec.readthedocs.io/en/latest/) so as to be able to read remote files on the fly.

@snowman2
Copy link
Member

snowman2 commented Jul 18, 2022

See: rasterio/rasterio#2517 (comment) for an example of using MemoryFile using with.

@mraspaud, with this addition to rasterio 1.3 (recently released), I believe you should be able to use fsspec files directly (related #246):

def read_file(filename):
    with open(filename, "rb") as src:
        proj = rioxarray.open_rasterio(src, chunks=2048)
        return proj.squeeze("band")

@mraspaud
Copy link
Contributor Author

I tried using the context, but that resulted in an empty data array being returned (which I think is quite logical this the context will be closed at the time of plotting).

The example you provide does not seem to work (so passing a regular open file to open_rasterio):

...
  File "/home/a001673/usr/src/rioxarray/rioxarray/_io.py", line 959, in open_rasterio
    result = _prepare_dask(result, riods, filename, chunks)
  File "/home/a001673/usr/src/rioxarray/rioxarray/_io.py", line 664, in _prepare_dask
    mtime = os.path.getmtime(filename)
  File "/home/a001673/mambaforge/envs/satpy/lib/python3.10/genericpath.py", line 55, in getmtime
    return os.stat(filename).st_mtime
TypeError: stat: path should be string, bytes, os.PathLike or integer, not BufferedReader

Also rasterio seems to be having problems with opened jp2 files (tif works fine), so I'll be reporting an issue for that on rasterio when I get a minimal case.

@mraspaud
Copy link
Contributor Author

But thanks a lot for the heads up on @djhoese 's amazing contribution :)

@snowman2
Copy link
Member

The example you provide does not seem to work (so passing a regular open file to open_rasterio):

Haven't tested that feature with dask ... looks like we need a patch to handle that.

@snowman2
Copy link
Member

I tried using the context, but that resulted in an empty data array being returned (which I think is quite logical this the context will be closed at the time of plotting).

Sounds like you may need to add a .load() call after open_rasterio.

@djhoese
Copy link
Contributor

djhoese commented Jul 18, 2022

I think using .load() would go against what we generally try to do in Satpy which is to delay loading data until it is actually being computed.

@mraspaud
Copy link
Contributor Author

I think using .load() would go against what we generally try to do in Satpy which is to delay loading data until it is actually being computed.

Yes exactly.

@jamespolly
Copy link

jamespolly commented Jul 30, 2022

Edited for brevity, "reproducible" code, and a possible solution:

Regarding @snowman2's suggestion to use with and fsspec, I've had trouble with rioxarray while using this pattern when reading from private Google Cloud Storage buckets. Interestingly it does not seem to happen for xarray.open_dataset(), and when using rioxarray.open_rasterio() it seems like file-like objects are being closed before subsequent operations can go back to retrieve data from them when needed. See below.

Using rioxarray to open geotiffs from a private Google Cloud Storage bucket, the resulting DataArray opens, and I can even access the coordinate values, but accessing the information that would show up in a plot is not possible:

gcp_creds = {'token': '/path/to/token.json'}
fs_args = {'project': 'my-gcs-project'}
storage_options = {**gcp_creds, **fs_args}
protocol= 'gcs'

load_args = {}
fs = fsspec.filesystem(protocol, **storage_options)
load_path = 'my_bucket_name/path/to/file.tif'
with fs.open(load_path) as fs_file:
    da = rioxarray.open_rasterio(fs_file)

Note the file successfully opens but when I try to do da.plot() the following error is generated:

Exception ignored in: 'rasterio._filepath.filepath_read'
Traceback (most recent call last):
  File "/home/jpolly/miniconda3/envs/windninja/lib/python3.9/site-packages/fsspec/spec.py", line 1539, in read
    raise ValueError("I/O operation on closed file.")
ValueError: I/O operation on closed file.
ERROR 1: _TIFFPartialReadStripArray:Cannot read offset/size for strile around ~0

As a work around to this I've found the following pattern does what I want:

gcp_creds = {'token': '/path/to/token.json'}
fs_args = {'project': 'my-gcs-project'}
storage_options = {**gcp_creds, **fs_args}
protocol= 'gcs'

load_args = {}
fs = fsspec.filesystem(protocol, **storage_options)
load_path = 'my_bucket_name/path/to/file.tif'
da = rioxarray.open_rasterio(fs.open(load_path, **fs_open_args_load), **load_args)

Note that this solution fails when trying to pass any size for 'chunks' as a load argument, e.g., load_args = {'chunks': 2048}.

I can understand how the use of with leads to a file being closed and the ValueError("I/O operation on closed file.") error. It is strange to me that the same pattern works with xarray but not rioxarray, and that chunking does not seem to work.

Is this just something rioxarray currently does not support?

@snowman2
Copy link
Member

#558 should address this: #550 (comment)

@snowman2
Copy link
Member

I think using .load() would go against what we generally try to do in Satpy which is to delay loading data until it is actually being computed.

From what I can tell, calling .load() is equivalent to the initial code snippet above:

    with open(filename, "rb") as of:
        data = of.read()

Since the data is lazily loaded by rioxarray.open_rasterio, calling .load() is the only way to load in the data before closing the file handle in the current state of the code.

Ideas to improve this behavior are definitely welcome. It may be worth looking into how xarray handles this scenario as they may have found a solution for this.

@mraspaud
Copy link
Contributor Author

I think using .load() would go against what we generally try to do in Satpy which is to delay loading data until it is actually being computed.

From what I can tell, calling .load() is equivalent to the initial code snippet above:

    with open(filename, "rb") as of:
        data = of.read()

Yes, it is, however, the of.read() was just a means to get data into the MemoryFile to illustrate the problem.

In the real case, we want to keep things lazy as much as possible. I believe with your example here: #550 (comment) and the fix in #550 I'm good.

Ideas to improve this behavior are definitely welcome. It may be worth looking into how xarray handles this scenario as they may have found a solution for this.

Yes, I think they do, but I don't remember how.

@snowman2
Copy link
Member

I think this is resolved. If something is missing still, please comment.

weiji14 added a commit to GenericMappingTools/pygmt that referenced this issue Aug 30, 2023
Opening the GeoTIFF in a with statement, and loading the RGB image data fully into memory using `image.load()` as suggested in corteva/rioxarray#550 (comment), which should fix the segfault hopefully.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants