Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When and where to get missing information for active storage #31

Open
bnlawrence opened this issue Oct 24, 2022 · 9 comments
Open

When and where to get missing information for active storage #31

bnlawrence opened this issue Oct 24, 2022 · 9 comments
Assignees
Milestone

Comments

@bnlawrence
Copy link
Collaborator

bnlawrence commented Oct 24, 2022

At one point it was said

I don't think that it makes sense to pass in a netCDF.Dataset instance, as on principle we don't want those hanging around as open file handles, but ...

But we need access to netcdf attributes inside the active storage (to get all the information about missing values, compression etc).

Where and when do we think we should open the dataset and get that info?

@bnlawrence
Copy link
Collaborator Author

It appears that we can do it in one of three places:

  1. outside Active Storage, and initialise with a Dataset instance
  2. when instantiating
  3. or during the operation

Given the objection above, it looks like 2 is the right answer?

@bnlawrence
Copy link
Collaborator Author

(I think doing it at instantiation would make ncvar a required attribute, not a keyword, since these are per-variable properties.)

@valeriupredoi
Copy link
Collaborator

valeriupredoi commented Oct 24, 2022

unless the Active Storage device makes the metadata available for reading locally (in some way), I reckon the best way is to have it passed to the client (just the metadata) and loaded outside the active call, since that's needed both for active and passive cases. Note that we will need the metadata not only for missing/fill values, but also for such magnificent things like cell measures, various other attributes, fixing units, fixing units of coordinates etc - a whole lot of metadata that we should think of a mechanism to be passed/loaded/used that is general enough to accommodate all those

@davidhassell
Copy link
Collaborator

I would for go for "2. when instantiating", and agrees that ncvarwould then be a required attribute. However, I would also allow missing data info to be optionally set at instantiation time - thereby saving opening and parsing the file if that information is already to hand (which it will be in cf-python)

V - what's the use case for passing other metadata (like cell measures) to the active storage? Perhaps I have misunderstood!

@valeriupredoi
Copy link
Collaborator

valeriupredoi commented Oct 24, 2022

@davidhassell a use case: we need to compute a mean of a variable that is masked with a cell measure (eg areacella or areacello) - we can't really get a reliable mean without masking first since the info the mask carries is then destroyed if the data is not masked first, then some statistic is computed. In the same vein, data that has incorrect units needs first be fixed (eg apply a fixed factor to it to bring it to correct units) then and only then a computation can be done on it

@davidhassell
Copy link
Collaborator

Hi V - I think that use case is out of scope, as we can't use active storage to do the work unless it's the first operation in the stack, and something like x = where(cell_measure < 1e6, np.ma.masked, x) is definitely an operation ...

@bnlawrence
Copy link
Collaborator Author

Ok, so we have a consensus on 2., but I am not sure how to handle the "allow missing data", as there are a lot of options just for missing data alone, let alone filters and compression, so would we assume that if any keyword attributes were present then all the keywords had be seen set appropriately? For the moment I'm going to ignore this, we can put that in a future version, since. by default that'd preserve backwards compatibility.

@bnlawrence
Copy link
Collaborator Author

Actually, I'm wrong, since we get the compression and filter info from the zarr metadata, we're just left with the missing stuff, which is well posed ... so I'll put that in now.

@valeriupredoi
Copy link
Collaborator

For the moment I'm going to ignore this, we can put that in a future version, since. by default that'd preserve backwards compatibility.

yeah my thought too about backwards compatibility - hence me wondering about a design scheme to have all this mostly preserved when we start doing more complex stuff 🍺

@bnlawrence bnlawrence self-assigned this Oct 27, 2022
@bnlawrence bnlawrence modified the milestones: Post-Prototype, Prototype Oct 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants