-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
open_virtual_dataset with dmr++ #113
Conversation
for more information, see https://pre-commit.ci
chunk key parsing speedup
for more information, see https://pre-commit.ci
Thanks for taking a look and giving my suggested changes to the chunk key parsing a try @ayushnag ! Continuing the discussion on performance I think the remaining bottlenecks (aside from your point about I/O in the cloud maybe) with this now lie primarily outside the scope of this work, and I don't expect changing XML readers to make a significant improvement. |
virtualizarr/xarray.py
Outdated
group : str, default None | ||
Group path within the dataset to open. For example netcdf4 and hdf5 groups |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to separate out the addition of this kwarg into a separate pull request, and implement it for the existing HDF5 reader. Then this PR wouldn't need to change the API of open_virtual_dataset
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see #165
virtualizarr/manifests/manifest.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The int32 to int64 change had to be made since I ran into some large byte offsets with the Atlas ICE-SAT dataset. Here is an example error: OverflowError: Python integer 6751178683 out of bounds for int32
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well that justifies #177!
Some questions about writing unit tests:
|
@TomNicholas this PR should be ready to go now. You can take a look at the code again if you wish and I can add updates. I have also added docs and release notes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, but I'm still unclear if you prefer to merge into EarthAccess or here.
virtualizarr/readers/dmrpp.py
Outdated
attrs.update(self._parse_attribute(attr_tag)) | ||
return xr.Dataset( | ||
data_vars=data_vars, | ||
coords=xr.Coordinates(coords=coord_vars, indexes={}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the indexes
variable here technically be passed through from open_virtual_dataset
? Right now it doesn't matter, but it would in order to support #18.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you're right, I will fix that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and #113 (comment) are the only comment that I think actually need to be addressed before merging, because they affect public API. Everything else can be left until later if you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to get this fixed however I think there is an issue which is that the indexes cannot be created since we don't have the path to the actual data file. Technically the dmrpp file does have a data path but it is only the file name and not the full path. E.g. "data.nc" instead of "s3://.../data.nc". So in that case if I add support for the indexes param it will fail when indexes=None
when usually in virtualizarr indexes=None
indicates auto-create indexes.
What I could do instead is to treat indexes=None
and indexes={}
as the same but raise a warning if indexes=None
that the behavior is not as expected. However I can still accept manually created indexes (e.g. {"time": Index})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just raise a NotImplementedError
on indexes={}
. But at least something that isn't silently doing an inconsistent behaviour to the other readers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought indexes={}
would be acceptable since that is how most people indicate that indexes should not be created?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant indexes=None sorry
attr[attr_tag.attrib["name"]] = values[0] if len(values) == 1 else values | ||
return attr | ||
|
||
def _parse_filters( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of these methods don't technically need to be methods, because they don't use self
. It could all be functional instead. But this isn't very important, just a note.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you're right these can be standalone. Even many of the functions that use self are just using the class constants. I will update after the initial merge
"offset": int(chunk_tag.attrib["offset"]), | ||
"length": int(chunk_tag.attrib["nBytes"]), | ||
} | ||
return ChunkManifest(entries=chunkmanifest) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think you could rewrite this to use ChunkManifest.from_arrays()
? It would potentially be a lot more performant. (But this could also be left to a later PR.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh nice didn't know about that function. I will give it a try
@TomNicholas do you know why the test is failing? I have added dmrpp as a valid file type so it should be recognized in the unit test. It also passes when I run network tests locally |
I'm just looking at that. I suspect it's just a bad merge from |
Amazing piece of work here @ayushnag !! |
group=
param