-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define the list of codecs in the v3 spec #312
Conversation
Does |
I cannot answer this question. For context, my personal opinion is that it's not a great idea to have a mandatory list of codecs, for exactly the reasons you list. But in #293 I was in the clear minority -- everyone else participating in the discussion preferred a strictly mandated list of codecs. The core problem is that the current v3 spec is incoherent w.r.t codecs, and we need to make it coherent. This PR represents the solution that the majority of people supported in #293. |
I see. I wasn't aware of that discussion. I would like to also caution against defining a spec based on the features offered by the python implementations of a particular codec. For example, I noted here that the |
(I think we could salvage this PR by changing some "MUST support" statements to "SHOULD support") |
This is in line with my thoughts in #293:
and @normanrz:
|
I think this is ready for review. I removed the requirement that zarr implementations MUST support the core codecs. instead, they SHOULD support them. |
@zarr-developers/steering-council could someone have a look at this? |
@zarr-developers/steering-council I really would appreciate some feedback on this PR. It has been ready for review for 3 weeks with no response. What is the blocker here? For context, in |
docs/v3/core/v3.0.rst
Outdated
developers and Zarr users deem the codec worth removing, e.g. because of a technical flaw in the | ||
algorithm underlying the codec. | ||
|
||
Community codecs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "Community" used elsewhere? Maybe "Extension", which I think is used in other places ("extension data types" at least).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forget how I picked the word "community"... I don't think it's a term of art in the spec. If "extension" is a better fit I'm happy to switch to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I favor "extension" over community because it fits with the existing Zarr extension framework.
docs/v3/core/v3.0.rst
Outdated
---------------- | ||
|
||
Zarr implementations MAY support a codec that is not in the list of core codecs | ||
(hereafter termed a "community codec"), provided the community codec does not use an identifier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth namespacing codec identifiers somehow? I'm slightly against, since ideally codecs can graduate from "community" (or "extension") to "core" (after careful consideration). It'd be nice to avoid needing to rewrite metadata files when that happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do any namespacing of the identifiers, we should not use any information about whether the codec is core or community / extension, for the reasons you note. Otherwise I don't have a preference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what is written here is fine. Namespacing codecs sounds complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this great. I support the change "community codec" -> "extension codec"
docs/v3/core/v3.0.rst
Outdated
developers and Zarr users deem the codec worth removing, e.g. because of a technical flaw in the | ||
algorithm underlying the codec. | ||
|
||
Community codecs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I favor "extension" over community because it fits with the existing Zarr extension framework.
docs/v3/core/v3.0.rst
Outdated
---------------- | ||
|
||
Zarr implementations MAY support a codec that is not in the list of core codecs | ||
(hereafter termed a "community codec"), provided the community codec does not use an identifier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what is written here is fine. Namespacing codecs sounds complicated.
Co-authored-by: Ryan Abernathey <[email protected]>
@d-v-b did you decide to stick with "community" over "extension"? I weakly favor "extension," but I don't want to block this PR on that point. |
i'm happy to switch over to "extension" but i haven't changed the text yet |
"community" is out, "extension" is in. |
Few comments:
|
This PR is deliberately non-normative about the content of the extension codecs. The intention is that the maintenance of the extension codecs should fall to their authors. If spec maintainers must enforce a recommendation that extension codec authors avoid breaking changes to their codecs, then we are making spec maintainers assume a maintenance role for those codecs, which is an outcome we want to avoid. In fact, the entire point of this PR is to provide a path for members of the community to publish extension codecs with as little work from spec maintainers as possible. This is the spirit of the following text from the PR:
Also, your requested recommendation is not so simple. We cannot simply recommend that codec developers avoid breaking changes to their codecs -- a codec developer can use semver in the configuration for their codec, and they can define in their codec specification how implementations of that codec should handle breaking changes. It is exactly because this matter is complicated that I do not wish to comment on it in this PR, hence my use of limited language above. If we are to make recommendations to codec developers (and managing versioning is just one of many possible recommendations), it should not be in the zarr v3 specification but rather a "best practices for codec development" document hosted elsewhere. I would be happy to review such a document if anyone wants to open a PR to add one. |
Co-authored-by: David Stansby <[email protected]>
The goal of this PR is not to carefully map the space of codecs. We should do that somewhere else. The goal of this PR is to clarify the language of the spec in a useful way for codec development, in particular solving two acute, practical problems:
So, for the purposes of this PR, there are only two classes of codecs under consideration: those listed in the spec ("core codecs"), and those not listed in the spec ("extension codecs"). For the codecs currently listed in the spec (core codecs), this PR addresses the following question: SHOULD implementations support those codecs, or MUST they support those codecs? The answer I have reached in this PR is a decisive SHOULD. Suppose someone wants to write a Zarr implementation in a language X that doesn't have any bindings to blosc. If the spec requires that all Zarr implementations MUST support the core codecs, then language X cannot have a valid Zarr implementation until blosc bindings are added, and this is an unfair burden on members of the language X ecosystem. Crucially, this is not a hypothetical scenario. Because the alternative would cause serious problems, this PR alters the spec to state that the core codecs SHOULD be supported by Zarr implementations. For codecs not listed in the spec (extension codecs), this PR addresses the following question: what is the relationship between extension codecs and core codecs? (answer: they have to use a different name from the core codecs). This PR also defines a service for zarr community members, wherein the zarr spec developers will host their codec. But, as stated in the PR, being listed doesn't confer any special status to an extension codec. This PR represents my attempt to make the minimal amount of changes to resolve serious ambiguity in the spec document which is a barrier for codec development. If there are proposed changes to this PR that can better solve the problems I am trying to solve then please suggest them.
I think the changes I am proposing in this PR could work for data types as well. I would use the same two classes, with the same rules for members of each class. |
Aside from my comments above, I wanted to say that I am 👍 this change overall, and thanks a bunch for opening a concrete proposal to push this forward. |
@d-v-b, though I appreciate your immediate goals, since this touches on our shared understanding of terminology it's unfortunately bigger. I was hoping the table would help since I believe a source of conflict here is that the definitions have shifted for some of us: we're not speaking the same language. How would you propose to work towards consensus here? |
I don't know what the disagreement or conflict is -- what changes do you specifically want to see in this PR? We have discussed adding recommendations to codec developers, but since we disagree on the content of that recommendation I think it's out of scope for this PR, and should probably be contained in a separate PR; I'd be happy to continue that particular conversation in such a PR. |
The fact that we disagree means that we will likely be having this conversation in any follow up PR, so unfortunately, I'd say it's in scope until we at least agree on the definition of what's in this PR. Questions I have are:
|
no, pending feedback of course.
These questions are important but they are out of scope for this PR. As stated in the text of this PR, an extension codec is any codec that is not listed in the list of core codecs, and any extension codec may be submitted for publication on the zarr-specs repo. I have labored to keep this PR simple, but it feels to me that you are broadly suggesting that I make this PR more complex. Given that nothing in this PR precludes additional efforts to refine our definition of extension codecs, I would ask that we put those efforts and that conversation in a separate venue, e.g. an issue or a PR. |
I'm sorry, I disagree. Since this PR is defining these terms, it makes agreeing on that definition important, and will impact those future PRs.
No. Apologies. I'm striving to find an agreed understanding of these things you are defining in admittedly nicely simple terms. That's critical since as I described to you on the call and above I feel less than comfortable with the statement "any extension codec maybe be submitted" when the definition of extension is class D above (as an example). |
"codecs in the spec" and "codecs not in the spec" are concepts that exist in the spec already -- I am merely giving these concepts names in my PR.
Why does it make you uncomfortable? |
Rather than reiterate my points from above, I'll add that @rabernat's suggestion of a maturity level from https://stac-extensions.github.io/ lessened the nervous. So perhaps the change I'd consider here is how do we not close the door on such an addition in a follow-up PR? Perhaps as with @dstansby's suggestion, we try to say less rather than trying to work out the definitions at this point. |
What specifically in this PR closes the door on follow-up PRs? |
In my reading, it's defining extension points in the most permissive way when I would define them more restrictively. If we can agree on text to basically say "there are codecs not in the spec" then we can decide on this maturity categorization later. |
This is exactly the kind of thing that can be changed in a follow-up PR, no? |
Maybe it's best if we focus on a real example, because that's what this PR is trying to solve: we are actively porting zarr-python v2 codecs into zarr-python v3. To avoid fragmentation of the zarr ecosystem, we should put the specifications for those codecs somewhere outside of zarr-python to ensure that non-python implementations can support them. This PR is an attempt to solve that problem. Do you have a better solution? Do you think this PR fails at solving that problem? |
This may be too far in the other direction but this is something I feel more comfortable with: diff --git a/docs/v3/core/v3.0.rst b/docs/v3/core/v3.0.rst
index 0ab3cf2c..e944d63e 100644
--- a/docs/v3/core/v3.0.rst
+++ b/docs/v3/core/v3.0.rst
@@ -1233,19 +1233,18 @@ An existing core codec SHOULD be removed from the list when a sufficient number
developers and Zarr users deem the codec worth removing, e.g. because of a technical flaw in the
algorithm underlying the codec.
-Extension codecs
-----------------
+Other codecs
+------------
Zarr implementations MAY support a codec that is not in the list of core codecs
-(hereafter termed an "extension codec"), provided the extension codec does not use an identifier
+provided the extension codec does not use an identifier
that is already used by a core codec; the identifiers of core codecs are reserved.
-
-This specification places no other constraints on extension codecs. It is possible, through discouraged,
+This specification places no other constraints on these codecs. It is possible, through discouraged,
that separate developers may define distinct codecs that use the same identifier.
To minimize the impact of such name collisions, codec developers are strongly encouraged
-to publish their codec specifications as additions to the "extension codecs" section of Zarr v3 specification.
-Publication in the "extension codecs" section does not confer primacy or an official designation to a codec.
-The list of extension codecs exists expressly as a tool to enable coordinated codec development.
+to register their codec names in the XXX section of Zarr v3 specification.
+Listing a codec in this way exists expressly as a tool to enable coordinated codec development
+and does not confer primacy or an official designation to a codec.
Stores
======
I note that I was also not sure where these codecs should be listed and so left a XXX. One thing that occurs to me is how to prevent future collisions:
|
thanks for the suggestions. I think the name "other codecs" is a step down from "extension codecs", but I'm happy to listen to what other people think, in particular those who preferred "extension codecs" over "community codecs" (@TomAugspurger @rabernat).
I think whenever we decide on a concrete location for the non-core codecs we can use a hyperlink to point to that location. So IMO it's fine if the text says "the section for the spec".
It would be helpful if you could explain in more detail what collisions you are referring to. Name collisions between core codecs and non-core codecs are impossible because the names of core codecs are reserved. Name collisions between non-core codecs are totally possible but discouraged (as per common sense).
Beyond excluding names that violate github TOS, managing incoming names seems too open-ended. if we attempt to curate the list of non-core codecs in such a way that we try to avoid name collisions between non-core codecs, then we encourage name squatting, e.g. where one implementation of
I am guessing here that you are describing a scenario in which a non-core codec is made a core codec, and this results in the name of that codec becoming exclusive, which thereby invalidates any other non-core codecs that were currently using that name. I see a few options:
But before we get into these issues, I want to ask: does this PR prevent us from solving these problems in a follow-up PR?. If so, what changes should we make to this PR so that we can deal with these more complex issues in a follow-up PR? If not, can we have this discussion in a follow-up PR? Because the name collision stuff is entirely hypothetical at this point. For the problems we have today, I think the content of this PR is necessary and sufficient. |
Agreed, but I was trying to leave us open to introduce either or both in the follow up.
Just that we might prepare that location in this PR if we're saying to use it.
One collision scenario:
Trying to do that would worry me, too. (This is likely the reason for the URI rule in the previous version.)
Yes exactly, thanks.
💯
This could invalidate data, no?
i.e. pinned to a major version. Certainly lowers the benefit of the list, and might put the SHOULD in question.
This would work for the list, but there can still be codecs not in the list (back to data invalidation).
Ha! I imagine there will be reasons not to, but I like this as a thought experiment.
I think this diff introduces the issue -Each codec must be defined via a separate
-specification. In order to refer to codecs in array metadata
-documents, each codec must have a unique identifier, which is a URI
-that dereferences to a human-readable specification of the codec.
If you want to roll back the URI change, then we can deal with that later as well. Another option would be to go to an integration branch and we can collaborate on the overall fix.
I sincerely and wholeheartedly disagree and I will continue to say this widely. Spec development's impacts are too great to make changes to the main version of this document without considering those hypotheticals. It is exactly the job of zarr-spec core-devs. |
-Each codec must be defined via a separate
-specification. In order to refer to codecs in array metadata
-documents, each codec must have a unique identifier, which is a URI
-that dereferences to a human-readable specification of the codec. I removed the codec URI statement because that text conflicts with the actual definitions of the codecs we have today, and the actual practice of codec development.
I am happy to consider changes to the use of URIs in the codec definition, but they have to be consistent with how we are doing things today, unless you are interested in breaking changes to all of our codecs. |
I'm confused -- if you agree that "other codecs" is a worse name than "extension codecs", then why propose it here? |
This quote omits a crucial part of my statement, which was that we should address the hypothetical issues in a separate PR. I will not say anything more about this. |
I agree, but also with the spec. That's why there's no easy answer.
It was an attempt to keep the names unused for following changes.
Except that every commit to this repository is for all intents and purposes a release, so (again a hypothetical) someone could read the spec and start reading in that interim period. |
This point of view really resonates with me. As we get closer to the Zarr Python 3 release, it's apparent to me that we currently have a choice between either:
This is forcing us to come face-to-face with problems and inconsistencies in the V3 spec that have been present for a long time. IMO, the root of these problems is the flawed notion that we could somehow write a good spec prior to actually attempting to implement it. But now that we are implementing it, the spec is already frozen and unable to accommodate changes needed to enable these features. Codecs are the prime example: where do the they belong? How do we add new ones? Does this require ZEPs? The spec is inconsistent and incomplete on this topic, and I'm grateful to @d-v-b for trying to find solutions. If we accept that the V3 spec is basically immutable, here's the alternative: rather than trying to list the codecs in the spec, we return to the original spirit of the V3 spec, which is to liberally enable extensibility. On the topic of codecs, it says:
To me, this pairs extremely well with the idea of community-developed extensions. All codecs (except The way we would move forward with this route would be as follows:
This seems especially appropriate since we clearly can't find consensus about which codecs belong in a list of "core" codecs. We need to find a way to move forward with development without such a consensus. Until something like that happens, Zarr Python will have to continue with the process of implementing things that are out of spec in order to reach feature parity with V2. (Perhaps we could issue warnings in these cases?) The hope is that we can retroactively adopt the extension approach to bring these features in line with some kind of spec. How does this sound to people? |
I am going to close this PR. After #293 I believed that there was broad consensus on two points:
Beyond the ongoing harm caused by a specification document that is half-baked, there is also an immediate need for guidance as we develop new codecs for zarr v3 (a process that is encouraged by language in the spec). Thus I attempted to implement the conclusions of #293 in this PR, and with a few exceptions this PR received broad support. I took great care to ensure that my changes were as minimal as possible, and I adapted the text to helpful feedback. With one notable exception, these changes were approved by the members of this discussion. However, it has since become clear that there is no clear path forward for this PR, and perhaps for any PR against the spec. I don't think it's worth the effort to continue here.
@joshmoore if this is something you believe, then I think you should feel great urgency to fix the incoherent parts of the spec. The spec has been broken for over a year. That is the cause of much more harm than anything in this PR. |
Davis, I'm sorry for the frustration this conversation has caused. I really appreciate your efforts, I'm committed to working with you to find our way out of this situation so we can keep development moving forward. |
I appreciate and agree with your analysis and I like framing it as a call for urgency. For what it's worth I did invest substantial time in this PR because I support what it's trying to do. I found myself, however, running into root causes which I then tried to resolve, recursively.
I hope you can appreciate that I was also frustrated at trying to unravel a year-long ball of yarn with the pressure of this concrete change. That's not your fault, but I was also only realizing it in real-time, surprised to find the state of things. My urgency, then will be directed at those root causes, largely in reverse order, and I'd encourage those who are still motivated to tackle those issues so we can more efficiently tackle the many remaining issues. |
Follow-up from #293
The way the v3 spec describes codecs today is inconsistent with the needs of a large number of users and implementers. Specifically, the spec says that there is no formal list of official codecs, but a number of people believe that a formal list of official codecs is either the true intention of the spec, or simply a much better specification than the alternative.
Changes in this PR:
MUSTSHOULD support core codecsHappy to take feedback, in particular on the styling. I'm bad at rst.
I think it's worth considering the application of the same "core" vs "community" distinction for other extension points, e.g. data types.