Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix reading from zstd decompression stream #443

Merged
merged 2 commits into from
Jul 8, 2024

Conversation

ChristofKaufmann
Copy link
Contributor

Since 0.8.0 reading a zstandard compressed file does not work anymore. The following code works in 0.7.2, but not in 0.8.0 and 0.9.0:

import geopandas as gpd
import zstandard
import io

# make an uncompressed.geojson and a compressed.geojson.zst (fixture)
gdf = gpd.GeoDataFrame(geometry=[Point(4, 2), Point(4, 2)])

gdf.to_file('uncompressed.geojson', driver='GeoJSON')

with open('uncompressed.geojson', 'rb') as ro:
    content = ro.read()

compressed_content = zstandard.compress(content)
with open('compressed.geojson.zst', 'wb') as w:
    w.write(compressed_content)

# try to read geojson from BytesIO
# works in 0.7.2 and 0.9.0
with open('uncompressed.geojson', 'rb') as fh:
    mem = io.BytesIO(fh.read())

gdf = gpd.read_file(mem)
print(gdf)

# try to read geojson from file handle
# works in 0.7.2 and 0.9.0
with open('uncompressed.geojson', 'rb') as fh:
    gdf = gpd.read_file(fh)

print(gdf)


# try to read geojson from zstandard decompression stream
# works in 0.7.2, but not in 0.9.0
with open('compressed.geojson.zst', 'rb') as fh:
    dctx = zstandard.ZstdDecompressor()
    with dctx.stream_reader(fh) as reader:
        gdf = gpd.read_file(reader)

print(gdf)

The last test results in the stack trace:

Traceback (most recent call last):
  File "<stdin>", line 4, in <module>
  File "/usr/local/lib/python3.11/dist-packages/geopandas/io/file.py", line 294, in _read_file
    return _read_file_pyogrio(
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/geopandas/io/file.py", line 547, in _read_file_pyogrio
    return pyogrio.read_dataframe(path_or_bytes, bbox=bbox, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pyogrio/geopandas.py", line 252, in read_dataframe
    result = read_func(
             ^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pyogrio/raw.py", line 197, in read
    get_vsi_path_or_buffer(path_or_buffer),
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pyogrio/util.py", line 43, in get_vsi_path_or_buffer
    path_or_buffer.seek(0)
OSError: cannot seek zstd decompression stream backwards

A zstd decompression stream can be seeked only forwards, but not backwards. Hence its seekable method returns False. The issue has been introduced in #397.

This PR fixes it. Instead of only checking, whether the seek method exists, this PR checks whether the seekable method exists and returns True. Only then seek(0) is used.

Do you want me to add any test cases for this? If so, based on zstandard?

@brendan-ward
Copy link
Member

Good catch, and sorry for introducing the bug.

I don't want us to add zstandard as a dependency, even for testing, so I created a very lightweight mock class that provides a non-seekable byte stream, and added it to the tests I'd want to see.

Subsequent read operations against the non-seekable bytes (also for the zstandard stream) will fail, but I'm assuming that is well expected for this use case.

@ChristofKaufmann
Copy link
Contributor Author

Very nice unit tests! It would have been difficult for me to write these. Thank you!

I agree. It is expected that streams can be read only once.

Before merging, do you want to have the Docker GDAL Test with the latest ubuntu fixed (probably just add --break-system-packages to pip install)? Or is there anything else left to do here where I can help with?

@brendan-ward
Copy link
Member

Fix for latest GDAL is already in #438, no need for that to be here.

@brendan-ward brendan-ward requested a review from theroggy July 8, 2024 12:18
Copy link
Member

@theroggy theroggy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me!

@brendan-ward brendan-ward merged commit a99f3f6 into geopandas:main Jul 8, 2024
19 of 20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants