Revise our approach to pinning dependencies #6495

jtcohen6 · 2023-01-02T11:05:50Z

jtcohen6
Jan 2, 2023
Maintainer

Managing dependencies in Python — what could be more fun?

Current approach

All dependencies on third-party packages are stated in setup.py
If it's possible to install dbt-core with a set of versioned packages, then you can expect it to work with those versions of those packages

We keep relatively tight pins on any "critical" dependencies (Jinja, agate, mashumaro, networkx), where even subtle changes can have unintended or breaking consequences for end users. As dbt-core maintainers, we manage dependency upgrades within the larger process of preparing new dbt-core minor versions. Users try out new dependency versions as part of trying out a new minor version; there's a clear channel for feedback, and a clear next step (downgrade to previous minor version) if something goes awry.

The downside of this approach is that it's much harder to install dbt-core alongside other popular Python packages, e.g. Airflow (who built a whole feature given the frequent complaints of incompatibility). To date, we've accepted that downside as part of a trade-off — it's also less likely that pip install dbt-core will, on any given day, stop working because of a weird behind-the-scenes change in a patch release of a critical third-party package.

Proposed approach

We should maintain two sets of dependencies:

The full set of locked-in (==) dependency versions that are guaranteed to work
The list of third-party packages, with the loosest-possible version boundaries — ideally, only lower bounds, and != for versions with known bugs/incompatibilities

I'm not proposing a specific implementation for either the former or latter option — there are several options we could take to achieve it (setup.py, pyproject.toml / poetry, prebuilt "snapshots" for specific OS) — but that set of two options should be our desired result. Whatever approach we take should also be extensible for adapter plugins maintained by third parties. Then, we should encourage maintainers of adapter plugins to follow our approach, by loosening the dependencies required in setup.py (or equivalent), while also publishing a stricter set of guaranteed dependencies.

This mirrors the approach taken by Airflow, among other popular Python projects: Loose Pip Constraints & Specific Officially Supported Constraints

Then, users can pick between these two options:

Install from the strictly set / pre-bundled dependencies, to start using dbt in a new Python environment, in a way that's guaranteed to work
pip install dbt-core if you need to manage dbt-core within a more-complex Python environment, alongside other dependencies. We'd aim to set lower bounds only in setup.py. This carries the risk of uncovering incompatibilities with new versions of dependencies, before we have a fix for those issues. We'd do our best here; the end user choosing this option would be knowingly opting into that risk.

Work in progress

groodt · 2023-01-13T20:51:44Z

groodt
Jan 13, 2023

I'm glad the project is maturing to a point where it is considering a dependency policy. I believe your intentions are good, but the solutions proposed could become very problematic if implemented incorrectly.

The most comprehensive write-up I've seen on the topic is this one:
https://iscinumpy.dev/post/bound-version-constraints/

Problems with option 1 (strict or exact == requirements):

It's described better in the document that I linked, but in summary:

Do not cap by default - capping dependencies makes your software incompatible with other libraries that also have strict lower limits on dependencies, and limits future fixes. Anyone can fix a missing cap, but users cannot fix an over restrictive cap causing solver errors.

Any user or consumer of your package can fix a missing upper bound if there is a breakage, but if you remove this capability (by publishing the constraint inside the package), you are disempowering users.

By publishing a package with strictly pinned dependencies, you are effectively forcing your dependency closure onto your users. And dbt has a lot of dependencies, so this is quite user-hostile. So I hope your proposal isn't considering doing this.

Problems with option 2 (asking users to install from source?):

This framing here is quite user-hostile. It requires a user to know about git and how to checkout a particular revision. Or it requires users to perform obscure configuration with pip or their dependency manager of choice. It's also against python packaging best-practice, which is to publish prebuilt wheels (with loose version constraints).

Other items

You've mentioned this This mirrors the approach taken by Airflow,. However, I disagree that this is what you've proposed. apache-airflow DOES NOT publish a pypi package with exact dependency pins in the wheel or sdist. apache-airflow publishes packages with lower-bound constraints (very few upper constraints). They are a massive framework, so they publish some "constraints.txt" files, which are crowd-sourced to control any installation into complex environments. This is precisely how to give users the control. Any user installing the package can manage their own upper constraints as necessary.
You've mentioned this The list of third-party packages, with the loosest-possible version boundaries — ideally, only lower bounds, and != for versions with known bugs/incompatibilities. This is precisely what you should be publishing in your pypi packages (wheels and sdsit). Then users can constrain upper bounds as necessary with constraint.txt files (or their own upper pins) as necessary.
Have you considered that one way to detect if there are issues with your dependencies (critical or regular ones), is to regularly (say daily?) run a pip install into a clean virtual environment and run tests in CI. This will be exercising the pip install flow of your users and will pickup any newly published packages in the pypi ecosystem. I suspect that you'll find fewer issues than you expect which will help you become more comfortable with looser upper version constraints.
Vendoring. If you truly do have tight coupling to packages that absolutely cannot change under any circumstances (this is generally a bad spot to be, but I can understand how dbt has some tight coupling to something like jinja). The options here are either to run your own fork, write your own templating engine, vendor the package, or trust that the package does understand breakage and generally won't break public behaviour.

This has ended up being a much larger post than I intended. My main message is this: Please, please, please don't publish distributions (sdist or wheel) to pypi with strict dependencies. Let users pin upper bounds as necessary. Document how to do so if necessary. Document how users can "pin" their environments with pip-tools or poetry or whatever environment manager they feel they need to use for their environments.

3 replies

jtcohen6 Jan 17, 2023
Maintainer Author

@groodt Thanks for the thorough response! I really appreciate your investment & care in the topic.

I think I did a poor job communicating, in the original post, exactly how each approach would take effect in terms of our software distributions. The big idea is indeed to work toward looser dependency constraints within the actual distributed software that we publish to PyPI, while at the same time providing users with another mechanism (similar to constraints.txt) that guarantees a working install, with pre-tested dependency versions. We would not be asking users who want more flexible dependencies to install from source.

Rather, the options are:

Install from a "snapshot" of the full distribution, containing exact versions (==) of all dependencies, that is known to work
pip install dbt-core, with fewer or zero upper bounds on dependencies. I don't mean to sound user-hostile (and I edited the tone my original post to reflect this). This approach just carries a known risk of catching a bug or unexpected interaction, caused by a dependency, before we (maintainers) have a chance to roll out a fix for it. (We do already have nightly builds! And we'd do our best here. But we have also seen critical bugs, triggered by unexpected breaking changes in minor/patch releases of dependencies, that required <24 hour responses.)

groodt Jan 17, 2023

Thanks @jtcohen6 That sounds a lot less dramatic. My teams would be in camp 2.

In terms of camp 1, given that I won't use it, I'm a little reluctant to comment on it too much. Bundling all dependencies up into a zip file presumably is being done so that you can provide a "wheel only" install process? Otherwise, Im not entirely clear on the benefit of creating the zipfile over distributing a constraints.txt file and instructions to install via something like:

pip install --requirement requirements.txt --constraint constraints.txt --no-deps

I guess the "wheelhouse" zip file is a similar workflow at the end of day.

r-richmond Mar 16, 2023

pip install dbt-core, with fewer or zero upper bounds on dependencies... This approach just carries a known risk of catching a bug or unexpected interaction, caused by a dependency, before we (maintainers) have a chance to roll out a fix for it. (We do already have nightly builds! And we'd do our best here. But we have also seen critical bugs, triggered by unexpected breaking changes in minor/patch releases of dependencies, that required <24 hour responses.)

@jtcohen6
I want to echo what I said in the linked issue

IMO DBT maintainers would not / should be responsible for this, especially sub 24 hours. If end users want the guaranteed working dependencies they use the constraints file. If they don't they take on the risk of just doing a simple pip install.

the small catch here is DBT needs to update their docs to make it clear that the supported way of installing dot is as @groodt pointed out something like pip install --requirement requirements.txt --constraint constraints.txt --no-deps and that pip install dbt-core is not the supported way.

See airflow's docs for an example how to do this well https://airflow.apache.org/docs/apache-airflow/stable/installation/installing-from-pypi.html

p.s. I'm back again because of 1, 2, 3

aranke · 2023-02-13T23:00:29Z

aranke
Feb 13, 2023
Maintainer

Playing devil's advocate here, is there any risk to waiting for the Python community to figure this out and then adopting their proposed solution?

There's a lot of chatter around Python packaging¹² so maybe relief is around the corner?

1 reply

r-richmond Mar 16, 2023

IMO I think it is folly to expect that the next standard will solve the problem https://xkcd.com/927/
(I hope it does but I don't think expecting that it will is reasonable)

dwreeves · 2023-04-14T20:25:46Z

dwreeves
Apr 14, 2023

I was asked if I'd like to contribute to the discussion via Twitter.

Here are some notes / thoughts on loosening the pinning.

My environment is I run dbt-core and dbt-snowflake in Airflow (with AWS MWAA), and I ran into issues updating from:

Python 3.7
Airflow 2.2
dbt-core 1.1

To:

Python 3.10
Airflow 2.5
dbt-core 1.4 (or later, just wanted to update!)

Note the final issue I ran into was lack of support for hologram 0.0.16 in the latest public release of dbt-core 1.4.* (ultimately the conflict was for the package jsonschema). And this struck me as an odd, as I will describe below.

My penultimate issue was with dbt-snowflake being unable to find a matching version for snowflake-connector-python[secure-local-storage]~=3.0, but I admittedly did not look too deeply into that. I found my way around that somehow.

Also, for anyone reading this who is also running Airfow 2.5 and struggling to set up dbt-core: I found that setting my requirements to dbt-core==1.5.0b5 and dbt-snowflake==1.5.0b4 worked for me! And also make sure you set hologram==0.0.16.

Reasonable expectations about breaking changes

Well maintained packages adhere to semantic versioning; poorly maintained packages may not, but you should avoid using such packages anyway.

What this means is you can reasonably foresee breaking changes to the API.

Pinning to major versions makes sense, generally, as long as your use cases are well documented by the highest level API.
Pinning to minor versions will make sense if using features very deeply and does a few atypical things with it. Notably, relying on things that are not documented is where you want to tread more carefully and possibly pin to the minor version.
Pinning to patch versions will likely never make sense.

So for example, it's unclear why it should be == and not ~= for the Jinja2 stuff. I can see why it may make sense to pin to 3.1 (unclear, I am not super familiar with how deep dbt-core goes into uncharted API territory with Jinja2), but why down to the patch? This can cause issues when addressing security fixes and other bug fixes for upstream packages.

Take advantage of internal packages

dbt-labs maintains hologram, and that's why I find it very weird that dbt-core aggressively pins the dependency for this.

Ultimately, this is the issue that I ran into when updating my environment, since the current public release of dbt 1.4.* does not allow for use of hologram==0.0.16.

But I guess my question is, why? It makes sense to be weary of changes to Jinja2 because that is a separate project. But dbt-labs should not find it surprising when hologram updates, since they have control over it. And dbt-labs should not find itself in the situation where it introduces breaking changes to a patch version bump, since again, they have control over it (and the high level API is built for use primarily by dbt-core). If a change were to break something about the highest level API, dbt-labs can take advantage of semantic versioning to do so safely. If changes don't break anything, which will be typical, then users should be allowed to update.

It makes sense that hologram pins to the major version release of jsonschema. But it also seems like, in turn, dbt-core should pin to the major version release of hologram.

TLDR: instead of a dependency of hologram>0.0.14,<=0.0.15, in my opinion, it should have always been: hologram>0.0.14,<1.

If dbt-labs is not doing this, then there is not a whole ton of benefit to maintaining separate packages in the first place. I see asynchronous updates across packages as one of the biggest advantages of having a fragmented ecosystem across multiple packages. dbt-labs should take advantage of it.

CI with unpinned dependencies implicitly tests for common installations.

If you have the capacity to do quick patch deployments (this is a big "if", I know), then breaking changes are not a massive deal for a package like dbt-core in the rare circumstances that they occur. If you ever run into an unforeseen breaking issue with unpinned dependencies, your CI runs frequently enough that you'll catch it pretty easily. Users without frozen dependencies or who are doing fresh installs will run into issues for a day, but users who freeze their dependencies won't notice. The most adversely impacted group of users in this case would be people who freeze down to the patch version of dbt-core but who do not freeze anything else.

But really, all of this really should be extraordinarily rare. I think it is rare to the case where most package developers don't worry about it. I'd say once a year I run into an issue like this. (Off the top of my head, over the past 3 years, I recall having this issue with isort, black, and temporarily with tensorflow, where patch versions completely break and become obsolete as naked installations due to unpinned dependencies.)

(For dbt cloud deployments, I would say frozen dependencies or a managed pypi proxy should be used to avoid this issue completely. However, this is separate of the packaging ecosystem.)

More matrices in Github Workflows

Correct me if I am wrong, but Microsoft foots the bill for Github Workflows for open source projects.

This means that dbt-labs can support, for example, multiple minor versions of Jinja2 with more matrices.

Right now, I see the unit testing workflow only tests for different Python versions:

strategy:
  fail-fast: false
  matrix:
    python-version: ["3.7", "3.8", "3.9", "3.10", "3.11"]

However, you could imagine doing the following to support more versions of Jinja2:

strategy:
  fail-fast: false
  matrix:
    python-version: ["3.7", "3.8", "3.9", "3.10", "3.11"]
    jinja2-version: [3.0.*, 3.1.*]

It may be the case that Jinja 3.0 does not work with dbt 1.5, so in the case of dbt 1.5 it would be very uninteresting.

But imagine Jinja2 releases version 3.2, available only on Python 3.8+. You could then safely test that it works on (let's say for example) dbt-core 1.5.3, and then enforce that both are supported with the matrix strategy:

strategy:
  fail-fast: false
  matrix:
    python-version: ["3.7", "3.8", "3.9", "3.10", "3.11"]
    jinja2-version: ["3.1.*", "3.2.*"]
    exclude:
    - python-version: "3.7"
      jinja2-version: "3.2.*"

And then, once confirmed to be working, you do a patch bump from dbt-core 1.5.3 to 1.5.4 (which would just be a one-line change in setup.py).

TLDR

Hologram and other packages managed by dbt-labs should be upper-bound to major releases.
I can see caution with Jinja2 etc. for pinning to minor releases, if that reduces headache, but I don't see why it should be down to the patch. (I personally would still only upper-bound to major releases here, but I say that without a super deep understanding of how dbt uses Jinja2 internally.)
Use more matrices in Github Actions to test for regressions.
If you can quickly deploy patch releases, then don't be scared of regressions (within reason).

3 replies

benallard Apr 15, 2023

I just got bite by the rather weird agate pinning (>=1.6,<1.7.1). agate 1.7.1 was just packaged for arch last week (although being released in januar), and broke dbt. Now I have to publish a new version of the dbt package that accepts that newer agate version.

dwreeves Apr 16, 2023

I just poked around the dbt code base and:

It looks like agate is only used for seeds?
Nothing that dbt uses in the agate API appears to be esoteric.
agate has 1.1k stars and 46 contributors.

So I will +1 that agate's dependency should not have an upper bound.

jtcohen6 Apr 17, 2023
Maintainer Author

@dwreeves Thank you so much for the thoughtful write-up! I really appreciate your perspective, and I think you've expressed a healthy balance of principle and pragmatism.

For agate in particular: We do use it for more than just seeds, also for formatting and interacting with query results. Development in that package has been irregular over the past few years, and though I don't believe dbt-core uses it in many non-standard ways, there have been significant breaking changes included in agate patch releases.

So, I'd say that agate, Jinja2, and mashumaro are in the camp of packages where our interaction is deep enough that a harder pin feels warranted. We've been burned by breaking changes in patch releases before, which led to early-morning scrambles and frantic patches, and leads us to have an abundance of caution today. Still, I agree that I'd like to see us pinning at the minor version level, rather than the patch level.

I've opened a draft PR to reorganize, annotate, and apply some of the recommended changes. This isn't intended as the end of the conversation, just a next concrete step in its continuation:

Reorganize, annotate, and revise dependency pins #7368

benallard · 2023-05-06T08:26:46Z

benallard
May 6, 2023

I'm not sure where to report this, but mashumaro 3.7 was released, and according to the the test-suite, and my few tests, it's not making any trouble. Unfortunately, due to the aggressive pinning, an update to the most recent mashumaro is breaking dbt, as it refuses to start alongside an incompatible version.

Should we bump the pin (meh), or better start accepting ranges open on top. (>=3.6) ? See #7534 as a support to bootstrap the discussion further.

3 replies

jtcohen6 May 7, 2023
Maintainer Author

Should we bump the pin (meh), or better start accepting ranges open on top. (>=3.6)

@benallard As we discussed a month ago (in #7294), dbt-core is using mashumaro deeply enough, and with enough history of breaking changes in new minor versions, that we don't feel ready for an unbounded (>=3.7) or major-level (~=3.7) pin.

I would like to see us moving toward a pin on the minor version, though (~=3.7.0). I just updated #7368 to this effect.

dwreeves May 7, 2023

I am on @benallard's side. I think you will be hard pressed to find many open source projects that adhere to this same philosophy as dbt that everything should have a pin just because something is used very "deeply." The reason why is because it is an incredible inconvenience to do that.

Ultimately dbt should probably attempt to move away from this with more continuous integration, and use constraints.txt for consistent installs.

jtcohen6 May 7, 2023
Maintainer Author

use constraints.txt for consistent installs

To be clear, I do agree with this as our medium/long term goal!

dbeatty10 · 2023-06-12T14:57:08Z

dbeatty10
Jun 12, 2023
Maintainer

@Fatal1ty offered some advice here related to lower bounds, exclusions, and pinning.

I've lightly edited the quote below to assume that dbt-core depends on a Python package named antigravity:

If I were in your place, I would be looking at pull requests from dependabot [related toantigravity] and either:
a) exclude package versions not passing the tests leaving the minimal version X
b) bump the minimal version X to the following patch version that passes the tests

The option "a" is better for users because it makes it easier to use dbt-core in combination with other packages that may also depend on [antigravity]. But if dbt-core is positioned as the product rather than as a library than option "b" or even strict pin to the specific version of a package is also a good choice.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revise our approach to pinning dependencies #6495

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 10 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Revise our approach to pinning dependencies #6495

jtcohen6 Jan 2, 2023 Maintainer

Current approach

Proposed approach

Work in progress

Replies: 5 comments · 10 replies

jtcohen6 Jan 17, 2023 Maintainer Author

aranke Feb 13, 2023 Maintainer

Footnotes

Reasonable expectations about breaking changes

Take advantage of internal packages

CI with unpinned dependencies implicitly tests for common installations.

More matrices in Github Workflows

TLDR

jtcohen6 Apr 17, 2023 Maintainer Author

jtcohen6 May 7, 2023 Maintainer Author

jtcohen6 May 7, 2023 Maintainer Author

dbeatty10 Jun 12, 2023 Maintainer

jtcohen6
Jan 2, 2023
Maintainer

Replies: 5 comments 10 replies

jtcohen6 Jan 17, 2023
Maintainer Author

aranke
Feb 13, 2023
Maintainer

jtcohen6 Apr 17, 2023
Maintainer Author

jtcohen6 May 7, 2023
Maintainer Author

jtcohen6 May 7, 2023
Maintainer Author

dbeatty10
Jun 12, 2023
Maintainer