-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent behavior between Parquet and JSON when chunks are missing #493
Comments
Thanks for the report, I'll look into it. What version of fsspec and kerchunk do you have? |
Thanks!
In case it's useful, full pixi lockfile is in details below:
|
It may be worth trying with the most recent fsspec, if you get to it before I have a chance to look. |
Confirming that I hit the same error with the latest (GitHub) versions of
For reference, here's the full script I'm using to test this out: from kerchunk import hdf, df
import fsspec.implementations.reference
from fsspec.implementations.reference import LazyReferenceMapper
import json
fname = "/css/geostationary/BackStage/GOES-17-ABI-L1B-FULLD/2022/001/00/OR_ABI-L1b-RadF-M6C01_G17_s20220010000320_e20220010009386_c20220010009424.nc"
h5chunks = hdf.SingleHdf5ToZarr(fname)
refs = h5chunks.translate()
# Write to JSON
with open("test.json", "w") as f:
f.write(json.dumps(refs, indent=2))
# Write to parquet
fs = fsspec.filesystem("file")
out = LazyReferenceMapper.create(record_size=10_000, root="test.parq", fs=fs)
out.flush()
df.refs_to_dataframe(refs, "test.parq")
# Test the read
import xarray as xr
tjson = xr.open_dataset("test.json", engine="kerchunk")
tparq = xr.open_dataset("test.parq", engine="kerchunk")
tjson.Rad.mean().values # This works fine
tparq.Rad.mean().values # This fails |
Please try with fsspec/filesystem_spec#1663 |
Thanks for the quick fix! That partially worked. It did resolve the specific issue of having an array with missing keys (i.e., However, in my tests, I discovered a related issue: In some situations, we will identify variables with attributes but no chunk data. (Since this is also an inconsistency between parquet and JSON behavior, I think it makes sense to keep it as part of the same issue). For the same file, here's an excerpt of the output kerchunk JSON reference file: "..."
"Rad/47.33": "base64:eAHt0DENAAAAA...",
"Rad/47.34": "base64:eAHt0DENAAAAA...",
"Rad/47.35": "base64:eAHt0DENAAAAA...",
"algorithm_dynamic_input_data_container/.zarray": "{\"chunks\":[],\"compressor\":null,\"dtype\":\"<i4\",\"fill_value\":-2147483647,\"filters\":null,\"order\":\"C\",\"shape\":[],\"zarr_format\":2}",
"algorithm_dynamic_input_data_container/.zattrs": "{\"_ARRAY_DIMENSIONS\":[],\"input_ABI_L0_data\":\"OR_ABI-L0-F-M6_G17_s20220010000320_e20220010009386_c*.nc\",\"long_name\":\"container for filenames of dynamic algorithm input data\"}",
"algorithm_product_version_container/.zarray": "{\"chunks\":[],\"compressor\":null,\"dtype\":\"<i4\",\"fill_value\":-2147483647,\"filters\":null,\"order\":\"C\",\"shape\":[],\"zarr_format\":2}",
"algorithm_product_version_container/.zattrs": "{\"_ARRAY_DIMENSIONS\":[],\"algorithm_version\":\"OR_ABI-L1b-ALG-RAD_v01r00.zip\",\"long_name\":\"container for algorithm package filename and product version\",\"product_version\":\"v01r00\"}",
"band_id/.zarray": "{\"chunks\":[1],\"compressor\":null,\"dtype\":\"|i1\",\"fill_value\":null,\"filters\":null,\"order\":\"C\",\"shape\":[1],\"zarr_format\":2}",
"band_id/.zattrs": "{\"_ARRAY_DIMENSIONS\":[\"band\"],\"long_name\":\"ABI band number\",\"standard_name\":\"sensor_band_identifier\",\"units\":\"1\"}",
"band_id/0": "\u0001",
"..." Note that Again, The corresponding
Note: When I try to open this with
The corresponding error backtrace is in the details below:
I can fix this by manually removing the variables with no data from the # Clean up parquet zmetadata
with open("test.parq/.zmetadata", "r") as f:
zm = json.load(f)
# Create a copy of the original metadata
with open("test.parq/.zmetadata.orig", "w") as f:
f.write(json.dumps(zm, indent=2))
# Remove keys that don't have any array information
varnames = {k.split("/")[0] for k in zm["metadata"].keys() if "/" in k}
haskeys = {x.name for x in Path("test.parq").glob("*/")}
keepvars = varnames.difference(haskeys)
zm2 = zm.copy()
zm2_meta = zm2["metadata"]
for key in list(zm2_meta.keys()):
for v in keepvars:
if v in key:
zm2_meta.pop(key)
with open("test.parq/.zmetadata", "w") as f:
f.write(json.dumps(zm2)) Once I remove the offending variables from |
OK, arrays with no references at all is another edge case that wasn't covered... |
I wrote the following test as a minimal reproducer - but it doesn't show the problem at all. an you see a difference?
|
Sorry for not getting back to this sooner. I'm not really sure how to create a minimal reproducible example of this, but I have identified the problem: The variables in question are scalar variables with attributes but no values (and therefore, no offsets). See details here for an example h5dump; note that the
I.e., The problem isn't so much that The relevant kerchunk code is here: https://github.com/fsspec/kerchunk/blob/main/kerchunk/hdf.py#L613-L617 This returns an empty dict, because the But, there are still
Digging deeper, it looks like the fsspec JSON reader "succeeds" here because it just loops over all the references in the JSON, without worrying about the need for every However, the parquet loader fails here because it explicitly expects a references file to exist: Specifically, it seems that, if the length of the object is zero, the implementation automatically assumes the scalar is stored in a chunk called 0. This assumption is incorrect for the (unusual) case where the array is completely empty and has no offsets. |
OK, I think I understand. I'm short of time this week and next, but please ping me again, and I will sort it out. |
Hi! Just checking in to see if you've been able to make any progress on this. Let me know if there are any prototype implementations, PRs, etc. that you'd like me to test out. Also, note that fsspec/filesystem_spec#1663 did actually solve the first part of this problem, which is arguably a bigger issue because I can't hack around that, whereas the empty variables issue can be hacked around by just removing the offending variables from the |
Sorry, I've not been working on this. I'll try to find some time. So there were two problems:
|
No worries! To clarify, the two issues are:
|
agreed |
I am looking back at this, and it appears to me that the 2. case was also already solved (test function below). In this case, the data is a scalar value, and if there is no refs parquet file, then an access to the data looks like the fill value.
Notice that |
(I reinstated fsspec/filesystem_spec#1738 ) |
you can use (I tested all of that on |
Taking the first file from here (https://noaa-goes17.s3.amazonaws.com/index.html#ABI-L1b-RadF/2022/001/00/) as an example:
The following code:
Produces the following JSON output (excerpt; slightly clipped):
Note that the radiance chunks begin at
0.16
--- there is noRad/0.{0--15}
. That's weird --- I'm assuming this is some HDF5 sparse data cleverness. But in any case,xarray.open_dataset("test.json", engine="kerchunk")
and subsequent summarizing of the entire Rad array (dat.Rad.mean().values
) works fine here.However, if you spit this out as a Parquet dataset, then it produces a file with rows 0-15 containing
nan
paths and 0 values, and then the real data start at row 16. That's fine...except that reading that Parquet file fails with an error like this (full backtrace in details):I've traced this back to a
references.get("Rad/0.0")
call that returns anan
"url" that can't be parsed by subsequent code. Here's some relevantpdb
traces:The text was updated successfully, but these errors were encountered: