Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSS] Reducing cadence of major arrow-rs releases introducing patch releases #5368

Closed
alamb opened this issue Feb 6, 2024 · 46 comments
Closed
Labels
arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate

Comments

@alamb
Copy link
Contributor

alamb commented Feb 6, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

As more people use arrow, the overall burden to users from frequent major releases is increasing. Furthermore, the pace of breaking API changes is decreasing, so the burden on maintainers to avoid breaking changes is decreasing

As the arrow crate becomes more widely used in the ecosystem by projects other than DataFusion and other early adopters, the frequent major releases causes several issues:

  1. Crates must match the major arrow versions. For example, if a crate uses DataFusion that forces everything in the entire project to exactly that version of arrow-rs).
  2. parquet and arrow releases are coupled so releasing a version of parquet requires releasing a new version of arrow

The major version bumps imposes non trivial overhead on user crates. Some crates like arrow_serde have implemented clever, though complex, workaround like having feature flags for each arrow version (see the recent discussion with @chmp on arrow_serde chmp/serde_arrow#131)

Also, from what I can see many of the recent arrow-rs changes aren't really adding new APIs, they are more like filling in feature gaps and bugs, which also reflected in the slower pace of the last few releases.

Describe the solution you'd like
I propose we set a more regular major release cadence (e.g. every 3 months) and only do minor, compatible, releases between those releases.

This would absolutely require more maintainer effort, but at this stage in the project the effort may be more manageable as the APIs are in a pretty good place I think

Describe alternatives you've considered
I think there are various alternatives to trigger releases / what cadence. I don't have a hugely strong opinion in this matter

Additional context
At some point in the past we actually had fewer major releases -- see #1120

There was non trivial process overhead so we (well , really I) abandoned it and went YOLO on major releases as there wasn't really any maintenance bandwidth to do anything else

@alamb alamb added the enhancement Any new improvement worthy of a entry in the changelog label Feb 6, 2024
@tustvold
Copy link
Contributor

tustvold commented Feb 6, 2024

I have been taking a somewhat brute force approach to this, by simply reducing the cadence of releases. This isn't necessarily ideal, but it does somewhat avoid this issue. I think the key to making this work, is to devise a minimally intrusive process where we can still make breaking changes, but still maintain patch releases. I don't really know the best way to achieve this.

@Xuanwo
Copy link
Member

Xuanwo commented Feb 7, 2024

I once proposed an idea named MSAV (minimum supported arrow version) on icelake-io/icelake#234.

Libraries relying on arrow should declare an MSAV and use >=47, rather than sticking to a specific version. Meanwhile, binaries should opt for a precise version of arrow to utilize.

Most libraries only use a part of arrow's public API, which is relatively stable. However, if there are changes, libraries can update their MSAV, allowing downstream users to be aware of and adapt to these changes.

Just so you know, I'm not sure if it's a good idea.

@alamb
Copy link
Contributor Author

alamb commented Feb 7, 2024

I think the key to making this work, is to devise a minimally intrusive process where we can still make breaking changes, but still maintain patch releases. I don't really know the best way to achieve this.

Here are some options I can think of:

  1. Release all versions from main: (merge no breaking API changes to main until we are ready to make another major version, but otherwise keep the release cadence the same)
  2. Release minor versions from a maintenance branch (merge anything to main, and backport compatible changes to maintenance branch)

I think option 1 is the lowest maintainer overhead approach, but has the drawbacks of:

  1. Limits when breaking API changes can be merged
  2. We would have to come up with some criteria of "when would we merge such breaking changes"

@alamb
Copy link
Contributor Author

alamb commented Feb 7, 2024

I once proposed an idea named MSAV (minimum supported arrow version) on icelake-io/icelake#234.

Thanks for the idea @Xuanwo -- it isn't clear to me how the MSAV approach would be different from having arrow-re release minor versions (e.g. 47.1.0, 47.2.0, 47.3.0)

@Xuanwo
Copy link
Member

Xuanwo commented Feb 7, 2024

it isn't clear to me how the MSAV approach would be different from having arrow-re release minor versions

MSAV enables arrow to make breaking changes without requiring library developers to update their MSAV, provided the library remains unaffected.

For instance, the library sun only uses parquet v47. When Arrow releases version 48, the developer reviews the changelog and discovers that only arrow-array is affected. Therefore, they can continue using version >=47 without needing to upgrade.

And, yes. I think this approach is difficult to implement across the entire ecosystem.

@aljazerzen
Copy link

I like this proposal. In essence, it means decoupling versions of Arrow the project and arrow-rs the crate.

Even more robust way of versioning would be to only release a major version every X amount of months, but only if there were any breaking changes. Otherwise, only a minor version can be published.


Regarding MSAV approach: I don't think this is sound advice to users of arrow-rs - it would lead to occasional breaking builds.

That's because if, for example, library sun uses parquet >= 47 and then Arrow releases a new version, builds of exising sun could start failing, since semver requirement of >= 48 would also match 48. At this point, sun could release a new version that would state it depends on >=48, but the old version would be broken. Which means that all dependents of sun would also break.

MSRV works only because all new versions of Rust are guaranteed to be backward compatible - they never introduce breaking changes. This is noted with the fact that Rust itself is on major version 1.x for a long time now.

@alamb
Copy link
Contributor Author

alamb commented Mar 20, 2024

I like this proposal. In essence, it means decoupling versions of Arrow the project and arrow-rs the crate.

Thanks @aljazerzen -- can you be clearer about what proposal you are referring to?

@aljazerzen
Copy link

aljazerzen commented Mar 20, 2024

I propose we set a more regular major release cadence (e.g. every 3 months) and only do minor, compatible, releases between those releases.

I mean the original one.


I found this issue because I'm running into the problem of managing transitive dependencies to arrow-rs. Because I depend on duckdb-rs and because it had not need updated to use arrow-rs 50, I cannot upgrade my dependency on arrow-rs, since I want compatible versions.

Other popular crates (such as serde or regex) don't cause this problem because they only publish minor or patch releases. If this was the case with arrow-rs, then duckdb-rs could depend on arrow-rs~=1.49, which would be compatible with a new version of arrow-rs 1.50.

Obviously arrow-rs cannot go back to version 1.x, but it could stop releasing new major versions that don't contain breaking changes or at least release them with less cadence.

It would mean much less toll on downstream maintainers.

@alamb
Copy link
Contributor Author

alamb commented Mar 20, 2024

Obviously arrow-rs cannot go back to version 1.x, but it could stop releasing new major versions that don't contain breaking changes or at least release them with less cadence.

Yes, I think this is the key -- the actual version number isn't really important -- what is important is not releasing breaking changes.

It would mean much less toll on downstream maintainers.

Indeed

@alamb
Copy link
Contributor Author

alamb commented Mar 26, 2024

Another point that @tustvold made that is worth repeating is that in the current model, we sometimes make new APIs that are expected to be changed prior to the next breaking major release, so ensuring we don't release such an API would be an additional overhead / require some additional discipline

@tustvold
Copy link
Contributor

I think we should separate two issues that appear to have gotten conflated:

  1. Less frequent major releases
  2. More frequent releases

I'm in favour of 1. and we could probably aim to hew closer to quarterly major releases (we're relatively close atm).

I think 2. is harder, and tbh I am not sure how many people are really calling for this. We could/should do patch releases when sufficient functionality has accumulated, but I'm less keen on committing to a regular cadence

@mbrobbel
Copy link
Contributor

we sometimes make new APIs that are expected to be changed prior to the next breaking major release

What about doing something like tokio: https://docs.rs/tokio/latest/tokio/#unstable-features?

@alamb
Copy link
Contributor Author

alamb commented Mar 26, 2024

think 2. is harder, and tbh I am not sure how many people are really calling for this.

With my arrow-rs user (not maintainer) hat on, I woud like to call for this:

Examples: IN DataFusion we had several features sit for 2+ months downstream in DataFusion (for example apache/datafusion#8693 from @Jefffrey ) waiting on a release that contained a non breaking API change).

InfluxDB: @erratic-pattern is working on a feature internally that is waiting on #5433 (though I think that technically is a breaking API change)

@wjones127
Copy link
Member

With my arrow-rs user (not maintainer) hat on, I woud like to call for this:

Similarly, in Lance we are often waiting for Arrow to be released and then DataFusion to be released with reference to that Arrow version. However, whenever possible, we will implement some workaround in Lance. For example, we have a custom cast function that handles FSL. But there are some cases where we can't easily implement a workaround. For example, we added S3 encryption support in object-store.

It's hard to say how often it would be challenging for us. But if it does become challenging, I think I would volunteer to work on putting together the minor or patch releases.

@xxchan
Copy link
Contributor

xxchan commented Apr 12, 2024

Crates must match the major arrow versions. For example, if a crate uses DataFusion that forces everything in the entire project to exactly that version of arrow-rs).

Precisely speaking, only if DataFusion exposes arrow-rs's types in public APIs, AND the end-user of DataFusion need to use the type to talk with other crates using arrow-rs. It's totally fine to have multiple versions of arrow-rs in the project if arrow-rs is only used as internal implementation.

Therefore, this makes me think that perhaps decoupled version can help at this stage. Specifically, we can have a set of "core" crates, which defines the types used in public APIs (e.g., arrow-array, arrow-buffer, arrow-schema ...), they do not do breaking changes. When other crates like arrow-arith, arrow-json have breaking changes, it won't cause conflict in the whole ecosystem. I believe this won't cause much increase in maintenance burden (if the core APIs are stable). But I'm not sure I understand the situations of the arrow-rs crates correctly.

@alamb
Copy link
Contributor Author

alamb commented Apr 13, 2024

Thanks @xxchan -- I think you understand the issue and structure.

The additional maintenance burden comes from handling breaking changes to the public APIs (and since there are a lot of public APIs there are a lot of potentially breaking changes)

@xmakro
Copy link
Contributor

xmakro commented Apr 14, 2024

pyo3 was recently upgraded, but updating any package that depends on arrow and pyo3 is blocked until the next arrow release. For these cases it would be nice to see a major arrow release sooner.

@aljazerzen
Copy link

For these cases it would be nice to see a major arrow release sooner.

Or, as it was discussed above, not have new releases of major arrow versions in the first place.

@tustvold
Copy link
Contributor

tustvold commented Apr 15, 2024

For context #5623 is proposing a breaking change to arrow-array.

We have had similar breaking changes to arrow-schema as part of adding support for view types

I therefore think even if we did separate the versioning of the individual crates, we would still need the ability to create breaking changes.

R.e. pyo3 I would like someone to please clarify if #5566 is a breaking change, as if so that has already been merged to main...

@tustvold
Copy link
Contributor

tustvold commented Apr 24, 2024

The next release is going to have to be breaking because of the PyO3 and object_store upgrades. Whilst not breaking in and of themselves they introduce a version upgrade hazard due to the way cargo handles dependency resolution across compatibility ranges.

I anticipate cutting this in the next few weeks.

We may also want to bring forward a fix for incorrect interval ordering, although I'm still unsure how best to solve that one.

@xxchan
Copy link
Contributor

xxchan commented Apr 24, 2024

reply #5566 (comment) here

I think it may be classified as a breaking change in that it'll require consumers to upgrade their pyo3 version

This sounds like bumping MSRV. Consumers are required to bump their Rust version, but there are strong arguments that this should not be a semver breaking change. rust-lang/api-guidelines#231 (comment)

I’m not familiar with pyo3, and not sure whether the analogy is precise though.

From a more practical point of view, I think it depends on whether bumping major version brings benefits to users. For users don’t use pyo3, it’s definitely brings disadvantages. For users using pyo3, the workload seems to be the same. They can just pin to older arrow version if they don’t want to upgrade pyo3 for a while. (Similar to the solution for MSRV) There seems to be no large difference whether they pin to an older major or minor arrow version.

I now roughly feel that bumping major version unnecessarily might bring more harm (in productivity) in the ecosystem than including some “little” breaking changes in minor versions (like tokio unstable). Although the latter might be more “correct”.

Just random personal feeling, correct me if I’m wrong.

@tustvold
Copy link
Contributor

The issue is defined in more detail here - https://doc.rust-lang.org/cargo/reference/resolver.html#version-incompatibility-hazards

And further expanded upon here - https://github.com/dtolnay/semver-trick?tab=readme-ov-file#coordinated-upgrades

Basically say a user has a dependency on pyo3 0.20 in their project, as soon as we publish a minor release their project will start failing to build (assuming no lockfile) with a thoroughly opaque error about two identically named types not being equal. Rustc does actually hint at what the issue might be, but unless people happen to know cargo's somewhat peculiar versioning behaviour, it can be not very obvious. In the past people have filled issues on this repo or pinged maintainers on discord/slack.

The rust docs are fairly unambiguous that the correct response to this is to yank the release - https://doc.rust-lang.org/cargo/reference/resolver.html#semver-breaking-patch-release-breaks-the-build

Whilst I agree it is unfortunate, and I had really hoped to avoid this release being breaking, I'm not really sure we can just pretend it isn't a breaking change...

Ultimately we should still be 0.x.x as we expose 0.x.x crates in our public APIs, but we are because the release cycle used to be synced with arrow proper which does quarterly breaking releases.

@aljazerzen
Copy link

Unfortunately, I agree that because of the bump of pyo3 (and maybe object_store, didn't check), arrow needs a major version bump as well.

This will create a lot of toll downstream, because crates use arrow types in their public interfaces so they will face same problems as arrow is facing with pyo3. It is a shame that the whole arrow crate needs to be bumped only for pyo3 (which might not even be enabled in some dependencies).


If I understand the "semver-trick" this would be the release process:

  • publish arrow 52.0 as normal,
  • publish arrow 51.1, with dependency on arrow 52.0 and lib.rs looking something like this:
    pub use arrow::buffer;  // reexport from arrow 52.0
    
    mod pyarrow;  // this would still be using the old version of pyo3

This would also mean that most of the code in arrow 51.1 could be removed and just replaced with dependencies on arrow 52.

This is all way too much work, while also not solving the problem of publishing the new major versions. It would just make lagging behind the latest arrow version less of a problem because old versions would contain most of the new changes.


Another solution would be to move pyarrow module into a separate crate that can be versioned independently.

@plewis110
Copy link

As big user of the individual arrow-* sub-crates, it would make my life a lot easier if each sub-crate was versioned independently (similar to how any of the mainstream Rust projects do it).

After creating a single library with public APIs exposing the Arrow traits, it only took a week before I ran into this issue with some of my users who wanted to use versions 47 and 51 in their projects.

From my perspective, there was zero difference between the two versions and I didn't understand why the major version was bumped for crates like arrow-array, arrow-schema, and arrow-ipc.

It's a maintenance headache for a crates' users if the maintainers aren't following SemVer since that's such a huge part of the expectations of the ecosystem built around Rust at a crates ecosystem level.

@xxchan
Copy link
Contributor

xxchan commented Apr 26, 2024

it would make my life a lot easier if each sub-crate was versioned independently

I agree and also proposed this before #5368 (comment). I think this is the ultimate solution and we have to follow this in “1.0” status for the main libraries.

IMO the main reasons why single version is used are:

  • It’s fast moving, and does have a lot of breaking changes. Single version simplifies the release process and it’s reasonable at an early stage. But this is largely changed, and it’s why this discussion is raised again.
  • (The larger reason, I guess) For ASF projects, we have to go through a voting process to release a new version. And “release” mainly refers to the source tarball. Therefore releasing everything together can save work for the maintainers. To workaround this issue, arrow-rs is separated from the main arrow repository for more frequent releases. But perhaps releasing together doesn’t prevent us from using separate versions? 🤔

@tustvold
Copy link
Contributor

Right, I think the ship has sailed on arrow 52.0.0, but we can try to hew to this going forward

@xxchan
Copy link
Contributor

xxchan commented Apr 26, 2024

I would agree: if ASF release process is fine with arrow-rs having separate releases, it should not have a problem with each crate having its own versioning.

My understanding is we would then need to hold separate votes and run separate release processes for each individual crate, I am not sure this is practical.

I’m a little confused: what prevents us from holding one vote together for arrow-array v1 and pyarrow v2?

@alamb
Copy link
Contributor Author

alamb commented Apr 26, 2024

I’m a little confused: what prevents us from holding one vote together for arrow-array v1 and pyarrow v2?

Nothing in theory -- I think the limiting factor for this is maintainer bandwidth which is a scarce resource

Insofar as those on this thread who are interested in this topic can lend a hand (preparing / reviewing PRs, branches, etc) it would add to the available bandwidth and make some of the other proposals more feasible.

I am personally willing to run the actual voting / release process, but I don't have the bandwidth to create the PRs / manage the required branches for this process.

@plewis110
Copy link

It's probably worth mentioning that making breaking changes and updating the major version of these crates is not that big of a deal to most downstream users if the individual sub-crates are versioned independently. Just compare the number of downloads for arrow-array vs arrow on crates.io.


Breaking changes usually come with improvements and feature additions for the clients of the individual crates; however, when there are no changes to the crates which are being used it becomes extremely problematic because there is no motivation among all dependents to coalesce on the most recent version. This is as @Xuanwo said:

For instance, the library sun only uses parquet v47. When Arrow releases version 48, the developer reviews the changelog and discovers that only arrow-array is affected. Therefore, they can continue using version >=47 without needing to upgrade.

This also has consequences for security updates and vulnerability resolution. If your library clients are conditioned to assume that new versions of their dependencies are not relevant to them, they will be less likely to pull in new security/vulnerability fixes you've pushed to the project.


Having separate version numbers for all of the sub-crates does not need to be any more of a burden on the maintainers of this crate than the current versioning approach. It should even be less work for maintainers as only a subset of the crates should have version updates due to a single change.

Compared to using separate versions for sub-crates, what is the benefit of the current scheme for anyone?

@tustvold
Copy link
Contributor

tustvold commented May 3, 2024

Compared to using separate versions for sub-crates, what is the benefit of the current scheme for anyone

Simplicity, it will be frustrating and hard for downstreams to reason about what combination of crate versions are compatible with one another if they are versioned independently. As most downstreams will need to use a combination of arrow crates, this will turn upgrades into a labour intensive mess, especially as cargo's behaviour of using multiple versions concurrently would not lead to helpful errors if you got it wrong.

Currently we try hard to keep breaking changes small, at worst requiring updating a few call sites, we're additionally going to try delaying breaking changes to a quarterly schedule after the next release. I suggest we proceed with that as the plan and circle back in 6 months and assess.

@plewis110
Copy link

plewis110 commented May 4, 2024

It will be frustrating and hard for downstreams to reason about what combination of crate versions are compatible with one another if they are versioned independently.

Usually I just look at crates.io if I'm using an older version of crates; however, I'm most likely to use the most recent versions if the project is adhering to SemVer like the majority of the Rust community.

Currently we try hard to keep breaking changes small, at worst requiring updating a few call sites, we're additionally going to try delaying breaking changes to a quarterly schedule after the next release. I suggest we proceed with that as the plan and circle back in 6 months and assess.

This doesn't help users who are currently dealing with supporting multiple major versions for the low-level crates which independently don't have a reason to have multiple major versions...

addendum: I'd also like to note that with the current state there is no good solution which can be "figured out" - it's literally impossible. If the sub-crates are versioned separately, at least downstream users have a chance to get a reasonable setup going.

@tustvold
Copy link
Contributor

tustvold commented May 4, 2024

which independently don't have a reason to have multiple major versions

They do have a reason, I articulated it above, you may not like the reason but there is a reason 😆

Regardless my expectation is that the cadence of breaking changes to arrow-array and similar core crates will align with that of the 3 monthly releases, and so this a moot point anyway. I appreciate your frustration, if perhaps not your tone, but we're doing the best we can

@plewis110
Copy link

Apologies for the tone. As you said, I am frustrated and being forced to use this crate has not been a good experience because of the versioning approach.

@xxchan
Copy link
Contributor

xxchan commented May 5, 2024

Having separate version numbers for all of the sub-crates does not need to be any more of a burden on the maintainers of this crate than the current versioning approach. It should even be less work for maintainers as only a subset of the crates should have version updates due to a single change.

I also agree with this to some degree. But it at least requires some nontrivial work, e.g., updating release CI workflow or scripts, changing crates' version = { workspace = true } to separated versions, and perhaps most importantly, a SOP doc about how to do it.

And as @alamb mentioned above, this option is not excluded from consideration. If we want to help, we can contribute what the actual changes need to be, and them the maintainers may consider adopting the new workflow.

@alamb
Copy link
Contributor Author

alamb commented May 6, 2024

I would love to help review PRs that made it easier to manage / track versions and do non breaking releases. Thank you @xxchan

@alamb
Copy link
Contributor Author

alamb commented May 9, 2024

I tried to capture the outcome of this discussion in #5737 and document what the updated plan for releases is. Feedback welcome.

@alamb alamb closed this as completed May 9, 2024
@tustvold tustvold added the arrow Changes to the arrow crate label May 10, 2024
@tustvold
Copy link
Contributor

label_issue.py automatically added labels {'arrow'} from #5554

@tustvold tustvold added the arrow-flight Changes to the arrow-flight crate label May 10, 2024
@tustvold
Copy link
Contributor

label_issue.py automatically added labels {'arrow-flight'} from #5554

@tustvold tustvold added the parquet Changes to the parquet crate label May 10, 2024
@tustvold
Copy link
Contributor

label_issue.py automatically added labels {'parquet'} from #5737

@amoeba
Copy link
Member

amoeba commented May 21, 2024

This has been a really interesting and productive discussion. I wanted to add a few notes/questions:

  1. As some of the people on this thread are probably aware, Apache Arrow is discussing component versioning and releases on its developer mailing list and, as a result of that discussion, ADBC has started versioning its components independently while releasing a single source tarball for voting. It seems that approach could work here and may be a small piece of the puzzle for independently versioning sub-crates if that's still of interest at some point in the future.
  2. A recurring theme in this discussion has been considering the impact of any changes on maintainer bandwidth. It seems like adopting something like Conventional Commits could address a number of issues here and could be paired with automations that reduce maintainer burden. @wjones127 has been doing some work in this area recently for Lance(DB) and the notion has come up again for the Arrow monorepo too. Is there any interest in opening an issue to discuss this and related automations?
  3. @mbrobbel brought up https://docs.rs/tokio/latest/tokio/#unstable-features which I see as a way to make it easier to reason about what types changes constitute breaking changes which could reduce maintainer burden. Is this worth pulling out into its own issue?

@tustvold
Copy link
Contributor

  1. [DISCUSS] Reducing cadence of major arrow-rs releases introducing patch releases #5368 (comment)
  2. The release process is already largely automated other than the voting
  3. We do something similar in parquet, but the issue is the core APIs, including fundamental things like the array abstractions, can and do receive breaking changes

I'm sorry to be curt, but I'm growing a bit frustrated that this conversation appears to be going in circles, the core abstractions in arrow-rs are not stable, can't be treated as such, and no amount of release hackery will change this.

@alamb
Copy link
Contributor Author

alamb commented May 21, 2024

I'm sorry to be curt, but I'm growing a bit frustrated that this conversation appears to be going in circles, the core abstractions in arrow-rs are not stable, can't be treated as such, and no amount of release hackery will change this.

I agree with this point.

Unless/until we have more discipline about keeping the APIs stable, major version bumps are required.

I see conventional commits / unstable feature flags, etc as way to improve discipline around keeping APIs stable.

Maybe one way to reduce maintainer bandwidth requirements would be to implement a CI check for breaking API. I filed #5791 to track this idea. Anyone interested in helping ease the maintenance burden maybe could help figure out how to automate some of it

@amoeba
Copy link
Member

amoeba commented May 21, 2024

Thanks @tustvold, my intent there wasn't to frustrate and I do appreciate you taking the time to respond. Your point is totally fair IMO.

+1 to your points @alamb.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate
Projects
None yet
Development

No branches or pull requests

10 participants