Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update metadata API #53

Merged
merged 10 commits into from
Oct 3, 2022
Merged

Update metadata API #53

merged 10 commits into from
Oct 3, 2022

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Sep 28, 2022

Fixes #52

In this PR I implement conclusions from #52 discussion.

Currently I implemented dropping of default methods. The question is - do we also want to drop metadatakeys and colmetadatakeys. I kept them returning () by default for now. This is to make it easier to write generic functions (i.e. i check if there are some metadata keys, and if some type does not support metadata I get an empty tuple so I know there are no metadata without having to catch an error)

@codecov
Copy link

codecov bot commented Sep 28, 2022

Codecov Report

Base: 96.55% // Head: 95.34% // Decreases project coverage by -1.20% ⚠️

Coverage data is based on head (1dd4887) compared to base (32ef840).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #53      +/-   ##
==========================================
- Coverage   96.55%   95.34%   -1.21%     
==========================================
  Files           1        1              
  Lines          58       43      -15     
==========================================
- Hits           56       41      -15     
  Misses          2        2              
Impacted Files Coverage Δ
src/DataAPI.jl 95.34% <100.00%> (-1.21%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@bkamins
Copy link
Member Author

bkamins commented Sep 28, 2022

I also changed the requirement from specific exception types to just throwing errors in the docstrings.

@bkamins
Copy link
Member Author

bkamins commented Sep 28, 2022

@ExpandingMan + @nalimilan

I would propose to try to keep the momentum in this discussion (of course respecting your other duties 😄). I think that it is pretty clear what we need to decide:

  • do we additionally add default kwarg now or wait till users need it (I am OK with either approach - but I agree with what @nalimilan proposed that we can define it later if users need it after we learn how people use metadata; I already thought how to do it with kwarg)
  • do we additionally add metadata and colmetadata methods that would return full information on metadata now or wait till users need it (I am OK with either approach; the comment is the same as above)
  • do we want to define metadatakeys and colmetadatakeys by default (with returning ()) or we want to throw MethodError if metadata is not defined (I would keep returning ())

The issue of column consisting of several chunks should be resolved in Parquet2.jl I think (as I have commented in #52).

Thank you!

@ExpandingMan
Copy link

I also commented in #52 .

In response to your questions above (respectively):

  • I don't really want to remove the default methods in current Parquet2.jl master, but otherwise I don't feel any great urgency on this.
  • These seem somewhat more important, because it is likely to commonly be the case that there are underlying metadata AbstractDict objects and without a method to return the whole thing, we are mandating that these be copied rather than referenced. It seems silly to have the objects with all the behavior we want just sitting there and not expose them.
  • Yes, the more I think about it the more convinced I become that we should remove all methods that don't have any meaningful fallback behavior. In particular I think we should remove the metadatakeys and colmetadatakeys methods since it is ambiguous whether object metadata is invalid (should be an error) or simply empty in a particular case.

So, in summary, yes, this PR looks good to me.

@bkamins
Copy link
Member Author

bkamins commented Sep 29, 2022

I don't really want to remove the default methods in current Parquet2.jl master

Parquet2.jl can have more definitions than DataAPI.jl - this is OK. Let us just decide if default should be positional or kwarg as commended in #52, so that when we later add something to DataAPI.jl there is less friction.

It seems silly to have the objects with all the behavior we want just sitting there and not expose them.

Then I would for now define metadata(obj) and colmetadata(obj, col) for now (without style kwarg) and require them to return an AbstractDict with values that are (value, style) and add the comment in the docstring that they are indented as low-level API. The reason for not supporting the style kwarg is simplicity. Note that in DataFrames.jl we would need to always return a copy for performance reasons (so this is a different case than Parquet2.jl).

I think we should remove the metadatakeys and colmetadatakeys methods since it is ambiguous whether object metadata is invalid (should be an error) or simply empty

@nalimilan - what do you think. My original thinking was that every object supports metadata, but just for some objects there is no way to set metadata with e.g. metadata! so such objects just always have an empty set of metadata keys. But we can switch to the definition proposed by @ExpandingMan that some objects do not support metadata (and that this is different from having no metadata).

@Tokazama - can you please also comment on this point, as you are creating a more general metadata mechanisms.

@Tokazama
Copy link

It would be preferable to have metadata(obj, key, default) instead of a keyword argument. In my experience with LoopVectorization, we found some instances where managing keywords would do some funny stuff to performance sensitive code.

Then I would for now define metadata(obj) and colmetadata(obj, col) for now (without style kwarg) and require them to return an AbstractDict with values that are (value, style) and add the comment in the docstring that they are indented as low-level API. The reason for not supporting the style kwarg is simplicity. Note that in DataFrames.jl we would need to always return a copy for performance reasons (so this is a different case than Parquet2.jl).

This is at the heart of why I'm hesitant to just return the underlying metadata storage type. There are going to be different approaches for how style is stored. We currently only require style can be returned as part of a tuple and that :default is somehow supported. There are instances where the style is the same for each value and you may just want to have something that stores that alongside your dictionary. I'm also approaching some of this with metadata backed by NamedTuple and subtypes of MetaStyle so that the current API is met doing (md[key], MetaStyle(md[key]).

@bkamins
Copy link
Member Author

bkamins commented Sep 29, 2022

@Tokazama - so what would be your recommendation? I understand that:

  • make default a positional argument (like in Parquet2.jl now) - I accept your reasoning and we can agree to have it this way (we can add it to DataAPI.jl at any time, but Parquet2.jl can already add it now)
  • do not provide metadata(obj) and colmetadata(obj, col) in DataAPI.jl as different types can internally support storing of metadata in different ways (Parquet2.jl can define it for its own purposes as it does not conflict with DataAPI.jl but probably should document this as an extension that potentially might change in the future - though this is unlikely that we will have a conflict there)

@Tokazama - and what do you think about the default behavior for metadatakeys? should it be MethodError or () should be returned?

@Tokazama
Copy link

Tokazama commented Sep 29, 2022

  • do not provide metadata(obj) and colmetadata(obj, col) in DataAPI.jl as different types can internally support storing of metadata in different ways (Parquet2.jl can define it for its own purposes as it does not conflict with DataAPI.jl but probably should document this as an extension that potentially might change in the future - though this is unlikely that we will have a conflict there)

I can make metadata(obj) work (with a little work on top of PropertyDicts.jl). I just wanted to point out some reasons why we may not have decided to do this originally. Even this seemingly innocuous solution could have some sharp edges.

@Tokazama - and what do you think about the default behavior for metadatakeys? should it be MethodError or () should be returned?

I can see how there's a bit of awkwardness around defining default keys, especially when we don't have default metadata elsewhere. It does leave us without any generic way of checking if an object has metadata though.

@Tokazama
Copy link

Having something like metadata(obj) would solve #51, so perhaps that's the way to go.

I think it would be worth having MetadataStyle and MetadataDefault <: MetadataStyle here at some point, but I've been trying to work on some compelling examples before fully proposing that.

@nalimilan
Copy link
Member

I think we should remove the metadatakeys and colmetadatakeys methods since it is ambiguous whether object metadata is invalid (should be an error) or simply empty

@nalimilan - what do you think. My original thinking was that every object supports metadata, but just for some objects there is no way to set metadata with e.g. metadata! so such objects just always have an empty set of metadata keys. But we can switch to the definition proposed by @ExpandingMan that some objects do not support metadata (and that this is different from having no metadata).

I agree it's useful to have fallbacks returning (), as otherwise as @Tokazama noted there's no way to check whether an object contains metadata (other than try... catch, which shouldn't be used for this). If we dropped these we would have to add hasmetadata returning a Boolean. We discussed this when adding these functions. Another option I mentioned was to have the fallbacks return nothing. But it's not clear what's the point of being able to differentiate objects that don't support metadata from objects that support it but don't have any.

BTW I realize we don't provide a way to check whether an object supports metadata!. Maybe there should be one?

I also agree that we'd better add the default argument now (either positional or keyword) if Parquet implements it, to ensure API consistency.

src/DataAPI.jl Outdated Show resolved Hide resolved
src/DataAPI.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented Sep 29, 2022

BTW I realize we don't provide a way to check whether an object supports metadata!.

This was intentional. I did not see a use case of querying it. I.e. in what situation would you want to add metadata without knowing if some object supports it?

I will add default as a positional argument as @Tokazama proposed.

Co-authored-by: Milan Bouchet-Valat <[email protected]>
@nalimilan
Copy link
Member

This was intentional. I did not see a use case of querying it. I.e. in what situation would you want to add metadata without knowing if some object supports it?

I was thinking about some package which would accept any Tables.jl table and attach metadata to it if possible.

@bkamins
Copy link
Member Author

bkamins commented Sep 29, 2022

I was thinking about some package which would accept any Tables.jl table and attach metadata to it if possible.

But what would be the use of it if after the operation you would not be sure if metadata was added?

But maybe it makes sense. How would you call these functions ismetadatamutable and iscolmetadatamutable?

PS. I have added default.

@nalimilan
Copy link
Member

But what would be the use of it if after the operation you would not be sure if metadata was added?

I mean that the object would be returned to the caller, which would presumably choose a type that supports metadata if they care about it. For example the package could set column labels or units, which are nice but not strictly necessary to use the data.

But maybe it makes sense. How would you call these functions ismetadatamutable and iscolmetadatamutable?

Yeah, that's a reasonable choice.

@bkamins
Copy link
Member Author

bkamins commented Sep 30, 2022

What I did:

  1. add metadatasuport and colmetadatasupport that return a named tuple indicating if metadata is supported for a given type for reading and writing (two fields); this is done for type, not for value as I think we should have a requirement that this information should be possible to be resolved statically during compile time; @Tokazama - in your use case (if I understand your approach correctly this means that once you load a package that allows setting and getting metadata for some type you need to register this type by adding methods for one or both of these functions)
  2. I removed metadatakeys and colmetadatakeys implementations - by default they error now (as @ExpandingMan proposed); this is OK as we add the *support methods
  3. @ExpandingMan - I have thought if we need metadata(obj) exposing a dict for performance reasons. I think we do not need it for performance reasons as metadatakeys(obj) should have almost zero cost, and later users should query metadata with metadata(obj, key). (of course you can add these to Parque2.jl if you want but I think using the API without this convenience function should not lead to performance regressions, and as @Tokazama noted sometimes requiring to return such an object might be problematic)

I think that after this change (and comments) all that was expected is changed (but please comment of course, as usual maybe function naming will need discussion).

The implemented changes are slightly breaking (as we stop returning () for metadatakeys and colmetadatakeys and instead throw an error, but I think it is acceptable; all people using metadata are in this thread + I will add an appropriate note in release notes)

@Tokazama
Copy link

if I understand your approach correctly this means that once you load a package that allows setting and getting metadata for some type you need to register this type by adding methods for one or both of these functions

Yeah, I think that should suffice.

I've looked through the changes and they seem okay, but I haven't had time to meditate on them much. It's difficult to predict the implications of each decision, so the work put in on your end is definitely appreciated.

@bkamins
Copy link
Member Author

bkamins commented Sep 30, 2022

It's difficult to predict the implications of each decision

Agreed. That is why initially we wanted with @nalimilan to define a minimal set of functions to give us most flexibility later.
Maybe if @ExpandingMan comments that the changes are also OK then this is a good sign (as we have a real test in Parquet2.jl where it is tested).

src/DataAPI.jl Outdated Show resolved Hide resolved
src/DataAPI.jl Outdated Show resolved Hide resolved
src/DataAPI.jl Outdated Show resolved Hide resolved
src/DataAPI.jl Show resolved Hide resolved
src/DataAPI.jl Outdated Show resolved Hide resolved
Co-authored-by: Milan Bouchet-Valat <[email protected]>
src/DataAPI.jl Outdated Show resolved Hide resolved
Co-authored-by: Milan Bouchet-Valat <[email protected]>
@bkamins
Copy link
Member Author

bkamins commented Sep 30, 2022

@Tokazama + @ExpandingMan - is it OK to merge this PR as-is?
(an this would be a released API that hopefully we would not change in DataAPI.jl soon)

@ExpandingMan
Copy link

This looks good to me, it seems to improve on most of what I commented on in my original issue.. In general, the only thing I'd ever get worried about are changes that for whatever reason lock us out of adding methods in the future, and I don't think any of this does that.

Note also that I have not tagged Parquet2.jl yet so that'll give us a chance to run tests on that again.

@Tokazama
Copy link

Tokazama commented Oct 1, 2022

Is the default MethodError the correct way to handle errors when we have an object without any metadata that could otherwise have some? In other words, if we call metadata(df::DataFrame, key) and df has nothing for its metadata, should we be throwing a KeyError(key) or MethodError(metadata, (df, key))?

@ExpandingMan
Copy link

The method error is not explicitly thrown, it is only thrown if you call a function on an object it doesn't have a method for. Therefore you won't get this with DataFrame, you'll get a KeyError. The problem with before this PR was that you would get ArgumentError even on objects that did not implement any metadata methods which not only makes no sense but can lead to method ambiguities.

@Tokazama
Copy link

Tokazama commented Oct 1, 2022

The method error is not explicitly thrown, it is only thrown if you call a function on an object it doesn't have a method for. Therefore you won't get this with DataFrame, you'll get a KeyError. The problem with before this PR was that you would get ArgumentError even on objects that did not implement any metadata methods which not only makes no sense but can lead to method ambiguities.

I understand functionally what is happening here. I'm just wondering if there's a correct error to produce on my end. If there is no metadata collection attached (but it could have metadata attached), should it still be a KeyError? It's not terribly important. I'm just trying to do this correctly.

@bkamins
Copy link
Member Author

bkamins commented Oct 1, 2022

I'm just wondering if there's a correct error to produce on my end.

The current state of this PR says that we should throw an error in this case without specifying error type.
Different packages can use different strategies what exact error type to return if metadata is missing. KeyError is most natural most of the cases. MethodError should only be returned if some object does not support metadata.

However, for example in DataFrames.jl with @nalimilan we decided to throw ArgumentError instead of KeyError if metadata key is missing. (although my initial implementation was throwing KeyError). The reason is easier error handling as we return ArgumentError consistently in other places in DataFrames.jl.

In general AFAICT we usually in Julia follow https://www.tensorflow.org/guide/versions rules for exceptions which, in particular state that changing the type of exception thrown is not breaking unless you documented the type of exception thrown (this is exactly now the docs do not specify exception type, but just state that exception is thrown).

@bkamins
Copy link
Member Author

bkamins commented Oct 1, 2022

Note also that I have not tagged Parquet2.jl yet so that'll give us a chance to run tests on that again.

Thank you. I think the only thing that needs adding is metadatasupport and colmetadatasupport functions in Parquet2.jl.
You define the convenience metadata(obj) and colmetadata(obj, col) that expose the dict, but I think it is not a problem (maybe just add a note in Parquet2.jl that this is an extension to the standard API from DataAPI.jl?)

@bkamins
Copy link
Member Author

bkamins commented Oct 1, 2022

@ExpandingMan - I have created an account 😄 and commented in https://gitlab.com/ExpandingMan/Parquet2.jl/-/issues/17 to keep track of todo things.

If there will be no additional comments on this PR I will merge it on Monday, Oct 3, 2022 and make a 1.12.0 release. OK?

@ExpandingMan
Copy link

I think the only thing that needs adding is metadatasupport and colmetadatasupport functions in Parquet2.jl.
You define the convenience metadata(obj) and colmetadata(obj, col) that expose the dict, but I think it is not a problem (maybe just add a note in Parquet2.jl that this is an extension to the standard API from DataAPI.jl?)

I should comment on that as that may be the one thing that I don't feel is addressed in this PR. In the case of Parquet2.jl, exposing those methods is the only way to get the full metadata dict without copying it. I'm not sure I followed all of the above, but I still think there are likely to be a large number of cases like Parquet2.jl in which the metadata dict object we want is already "just sitting around" and not exposing a method to get it forces it to be copied.

I don't think this needs to be addressed now as long as the door is open, but thought I'd mention it.

@Tokazama
Copy link

Tokazama commented Oct 1, 2022

I don't think this needs to be addressed now as long as the door is open, but thought I'd mention it.

Yeah, I don't think it's something set in stone. We're just trying to move forward conservatively so that we get work done now without requiring huge breaking changes down the road.

@bkamins
Copy link
Member Author

bkamins commented Oct 1, 2022

Exactly - I think these methods in the future can be added to the main DataAPI.jl specification, but since this is only a performance optimization issue it can be left for later. I would just recommend to add a documentation of these two extra methods in Parquet2.jl.

src/DataAPI.jl Show resolved Hide resolved
@bkamins bkamins merged commit ca91305 into main Oct 3, 2022
@bkamins bkamins deleted the bk/update_metadata branch October 3, 2022 07:04
@bkamins
Copy link
Member Author

bkamins commented Oct 3, 2022

Thank you! I will release DataAPI.jl 1.12.0 soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

a few concerns about metadata methods
4 participants