-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Standardize airflow build process and switch to Hatchling build backend #36537
Conversation
The failures are expected. Im still solving some problems but all the big ones are solved an I successfully run it locally both on Mac and Linux in various combinations and I am really confident it works :) |
BTW. After a good night sleep I got an idea and I've figured a way to simplify it even more - I think I can make it so that I do not need This way there will be literally This one will install amazon dependencies:
But these will install amazon provider:
I will try to get it working today actually :) @uranusjr |
dev/breeze/src/airflow_breeze/prepare_providers/provider_documentation.py
Outdated
Show resolved
Hide resolved
a1b654d
to
ccbe5a5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mainly I was reading the docs w/o testing locally if all works - anyway as being DRAFT state and CI is read at the moment. But wanted to leave early feedback. Otherwise while reading I am now joining the "convinced" crowd :-D
f04a2a9
to
9f504b1
Compare
Thanks @jscheffl for VERY THOROUGH review. I addressed most comments (and the new iteration is way simpler and automatically addressed some of your comments regardles). I left a few unresolved conversations when things should be fixed before "undrafting" it. Pushed new version, and next step of this PR is to make the build green. |
9f504b1
to
6fbf922
Compare
Breeze auto-detects if it should upgrade itself - based on finding Airflow directory it is in and calculating the hash of the pyproject.toml it uses. Finding the airflow sources to act on was using setup.cfg from Airflow and checking the package name inside, but since we are about to remove setup.cfg, and move all project configuration to pyproject.toml (see apache#36537), this mechanism will stop working. This PR changes it by just checking if `airflow` subdir is present, and contains `__init__.py` with "airflow" inside. That should be "good enough" and fast, and also it should be backwards compatible in case new Breeze is used in older airflow sources.
Breeze auto-detects if it should upgrade itself - based on finding Airflow directory it is in and calculating the hash of the pyproject.toml it uses. Finding the airflow sources to act on was using setup.cfg from Airflow and checking the package name inside, but since we are about to remove setup.cfg, and move all project configuration to pyproject.toml (see #36537), this mechanism will stop working. This PR changes it by just checking if `airflow` subdir is present, and contains `__init__.py` with "airflow" inside. That should be "good enough" and fast, and also it should be backwards compatible in case new Breeze is used in older airflow sources.
6fbf922
to
5cba0a1
Compare
…nd (#36537) This PR changes Airflow installation and build backend to use new standard Python ways of building Python applications. We've been trying to do it for quite a while. Airflow tranditionally has been using complex and convoluted build process based on setuptools and (extremely) custom setup.py file. It survived migration to Airflow 2.0 and splitting Airlfow monorepo into Airflow and Providers, adding pre-installed providers and switching providers to use flit (and follow build standards). So far tooling in Python ecosystme had not been able to fuflill our needs and we refrained to develop our own tooling, but finally with appearance of Hatch (managed by Python Packaging Authority) and few recent advancements there we are finally able to swtich to Python standard ways of managing project dependnecy configuration and project build setup (with a few customizations). This PR makes airflow build process follow those standard PEPs: * Airflow has all build configuration stored in pyproject.toml following PEP 518 which allows any fronted (`pip`, `poetry`, `hatch`, `flit`, or whatever other frontend is used to install required build dependendencies to install Airflow locally and to build distribution pacakges (sdist/wheel) * Hatchling backend follows PEP 517 for standard source tree and build backend implementation that allows to execute the build in a frontend-independent way * We store all project metadata in pyprooject.toml - following PEP 621 where all necessary project metadata components were defined. * We plug-in into Hatchling "editable build" hooks following PEP 660. Hatchling internally builds editable wheel that is used as ephemeral step and communication between backend and frontend (and this ephemeral wheel is used to make editable installation of the projeect - suitable for fast iteration of code without reinstalling the package. With Airflow having many provider packages in single source tree where we want to be able to install and develop airflow and providers together, this is not a small feat to implement the case wher editable installation has to behave quite a bit differently when it comes to packaging and dependencies for editable install (when you want to edit sources directly) and installable package (where you want to have separate Airflow package and provider packages). Fortunately the standardisation efforts in the Python Packaging community and tooling implementing it had finally made it possible. Some of the important ways bow this has been achieved: * We continue using provider.yaml in providers as the single source of trutgh for per-provider dependencies. We added a possibility to specify "devel-dependencies" in provider.yaml so that all per-provider dependencies in `generated/provider_dependencies.json` and `pyproject.toml` are generated from those dependencies via update-providers-dependencies pre-commit. * Pyproject.toml is generally managed manually, but the part where provider dependencies and bundle dependencies are used is automatically updated by a pre-commit whenever provider dependencies change. Those generated provider dependencies contain just dependencies of providers - not the provider packages, but in the final "standard" wheel file they are replaced with "apache-airflow-providers-PROVIDER" dependencies - so that the wheel package will only install the provider and use the dependencies of that version of provider it installs. * We are utilising custom hatchiling build hooks (PEP 660 standard) that allow to modify 'standard' wheel package on-the-fly when the wheel is being prepared by adding preinstalled package dependencies (which are not needed in editable build) and by removing all devel extras (that are not needed in the PyPI distributed wheel package). This allows to solve the conundrum of having different "editable" and "standard" behaviour while keeping the same project specification in pyproject.toml. * We added description of how `Hatch` can be employed as build frontend in order to manage local virtualenv and install Airflow in editable way easily - while keeping all properties of the installed application (including working airflow cli and package metadata discovery) as well as how to use PEP-standard ways of bulding wheel and sdist packages. * We have a custom step (following PEP-standards) to inject airflow-specific build steps - compiling www assets and generating git commit hash version to display it in the UI * We also show how all this makes it possible to make it easy to manage local virtualenvs and editable installations for Airflow contributors - without vendor lock-in of the build tools as by following standard PEPs Airflow can be locally and editably installed by anyone using any build front-end tools following the standards - whether you use `pip`, `poetry`, `Hatch`, `flit` or any other frontent build tools, Airflow local installation and package building will work the same way for all of them, where both "editable" and "standard" package prepration is managed by `hatchling` backend in the same way. * Previously our extras contained a "." which is not normalized name for extras - `pip` and other tools replaced it automatically with `_'. This change updates the extra names to contain '-' rather than '.' in the name, following PEP-685. This should be fully backwards compatible, users will still be able to use "." but it will be normalized to "-" in Airflow packages. This is also future proof as it is expected that all package managers and tools will eventually use PEP-685 applied to extras, even if currently some of the tools (pip + setuptools) might generate warnings. * Additionally, this change organizes the documentation around the extras and dependencies, explaining the reasoning behind all the different extras we have. * As a bonus (and this is what we used to test it all) we are documenting how to use Hatch frontend to: * manage multiple Python installations * manage multiple Pythob virtualenv environments * build Airflow packages for release management (cherry picked from commit c439ab8)
The `graphviz` dependency has been problematic as Airflow required dependency - especially for ARM-based installations. Graphviz packages require binary graphviz libraries - which is already a limitation, but they also require to install graphviz Python bindings to be build and installed. This does not work for older Linux installation but - more importantly - when you try to install Graphviz libraries for Python 3.8, 3.9 for ARM M1 MacBooks, the packages fail to install because Python bindings compilation for M1 can only work for Python 3.10+. There is not an easy solution for that except commenting out graphviz dependency from setup.py, when you want to install Airflow for Python 3.8, 3.9 for MacBook M1. However Graphviz is really used in two places: * when you want to render DAGs wia airflow CLI - either to an image or directly to terminal (for terminals/systems supporting imgcat) * when you want to render ER diagram after you modified Airflow models The latter is a development-only feature, the former is production feature, however it is a very niche one. This PR turns rendering of the images in Airflow in optional feature (only working when graphviz python bindings are installed) and effectively turns graphviz into an optional extra (and removes it from requirements). This is not a breaking change technically - the CLIs to render the DAGs is still there and IF you already have graphviz installed, it will continue working as it did before. The only problem when it does not work is where you do not have graphviz installed for fresh installation and it will raise an error and inform that you need it. Graphviz will remain to be installed for most users: * the Airflow Image will still contain graphviz library, because it is added there as extra * when previous version of Airflow has been installed already, then graphviz library is already installed there and Airflow will continue working as it did The only change will be a new installation of new version of Airflow from the scratch, where graphviz will need to be specified as extra or installed separately in order to enable DAG rendering option. Taking into account this behaviour (which only requires to install a graphviz package), this should not be considered as a breaking change. Extracted from: #36537 (cherry picked from commit 89f1737)
Previously we limited grpcio minimum version to stop backtracking of `pip` from happening and we could not do it in the limits of spark provider, becaue some google dependencies used it and conflicted with it. This problem is now gone as we have newer versions of google dependencies and we can not only safely move it to spark provider but also bump it slightly higher to limit the amount of backtracking we need to do. Extracted from #36537 (cherry picked from commit ded01a5)
While testing #36537 I noticed cohere backracking quite a bit with older version. Bumping Cohere to more recent minimum version (released in November) decreased it quite a bit. Since Cohere is mostly standalone package, and likely you want to bump people to later version, it's safe to assume we can bump the minimum version. (cherry picked from commit 9797f92)
…nd (#36537) This PR changes Airflow installation and build backend to use new standard Python ways of building Python applications. We've been trying to do it for quite a while. Airflow tranditionally has been using complex and convoluted build process based on setuptools and (extremely) custom setup.py file. It survived migration to Airflow 2.0 and splitting Airlfow monorepo into Airflow and Providers, adding pre-installed providers and switching providers to use flit (and follow build standards). So far tooling in Python ecosystme had not been able to fuflill our needs and we refrained to develop our own tooling, but finally with appearance of Hatch (managed by Python Packaging Authority) and few recent advancements there we are finally able to swtich to Python standard ways of managing project dependnecy configuration and project build setup (with a few customizations). This PR makes airflow build process follow those standard PEPs: * Airflow has all build configuration stored in pyproject.toml following PEP 518 which allows any fronted (`pip`, `poetry`, `hatch`, `flit`, or whatever other frontend is used to install required build dependendencies to install Airflow locally and to build distribution pacakges (sdist/wheel) * Hatchling backend follows PEP 517 for standard source tree and build backend implementation that allows to execute the build in a frontend-independent way * We store all project metadata in pyprooject.toml - following PEP 621 where all necessary project metadata components were defined. * We plug-in into Hatchling "editable build" hooks following PEP 660. Hatchling internally builds editable wheel that is used as ephemeral step and communication between backend and frontend (and this ephemeral wheel is used to make editable installation of the projeect - suitable for fast iteration of code without reinstalling the package. With Airflow having many provider packages in single source tree where we want to be able to install and develop airflow and providers together, this is not a small feat to implement the case wher editable installation has to behave quite a bit differently when it comes to packaging and dependencies for editable install (when you want to edit sources directly) and installable package (where you want to have separate Airflow package and provider packages). Fortunately the standardisation efforts in the Python Packaging community and tooling implementing it had finally made it possible. Some of the important ways bow this has been achieved: * We continue using provider.yaml in providers as the single source of trutgh for per-provider dependencies. We added a possibility to specify "devel-dependencies" in provider.yaml so that all per-provider dependencies in `generated/provider_dependencies.json` and `pyproject.toml` are generated from those dependencies via update-providers-dependencies pre-commit. * Pyproject.toml is generally managed manually, but the part where provider dependencies and bundle dependencies are used is automatically updated by a pre-commit whenever provider dependencies change. Those generated provider dependencies contain just dependencies of providers - not the provider packages, but in the final "standard" wheel file they are replaced with "apache-airflow-providers-PROVIDER" dependencies - so that the wheel package will only install the provider and use the dependencies of that version of provider it installs. * We are utilising custom hatchiling build hooks (PEP 660 standard) that allow to modify 'standard' wheel package on-the-fly when the wheel is being prepared by adding preinstalled package dependencies (which are not needed in editable build) and by removing all devel extras (that are not needed in the PyPI distributed wheel package). This allows to solve the conundrum of having different "editable" and "standard" behaviour while keeping the same project specification in pyproject.toml. * We added description of how `Hatch` can be employed as build frontend in order to manage local virtualenv and install Airflow in editable way easily - while keeping all properties of the installed application (including working airflow cli and package metadata discovery) as well as how to use PEP-standard ways of bulding wheel and sdist packages. * We have a custom step (following PEP-standards) to inject airflow-specific build steps - compiling www assets and generating git commit hash version to display it in the UI * We also show how all this makes it possible to make it easy to manage local virtualenvs and editable installations for Airflow contributors - without vendor lock-in of the build tools as by following standard PEPs Airflow can be locally and editably installed by anyone using any build front-end tools following the standards - whether you use `pip`, `poetry`, `Hatch`, `flit` or any other frontent build tools, Airflow local installation and package building will work the same way for all of them, where both "editable" and "standard" package prepration is managed by `hatchling` backend in the same way. * Previously our extras contained a "." which is not normalized name for extras - `pip` and other tools replaced it automatically with `_'. This change updates the extra names to contain '-' rather than '.' in the name, following PEP-685. This should be fully backwards compatible, users will still be able to use "." but it will be normalized to "-" in Airflow packages. This is also future proof as it is expected that all package managers and tools will eventually use PEP-685 applied to extras, even if currently some of the tools (pip + setuptools) might generate warnings. * Additionally, this change organizes the documentation around the extras and dependencies, explaining the reasoning behind all the different extras we have. * As a bonus (and this is what we used to test it all) we are documenting how to use Hatch frontend to: * manage multiple Python installations * manage multiple Pythob virtualenv environments * build Airflow packages for release management (cherry picked from commit c439ab8)
Airflow Sdist packages have been broken by apache#37340 and fixed by 37388, but we have not noticed it because CI check for sdist packages has been broken since apache#36537 where we standardized naming of the sdist packages to follow modern syntax (and we silently skipped installation because no providers were found),. This PR fixes it: * changes the naming format expected to follow the new standard * treats "no providers found as error" The "no providers" as success was useful at some point of time when we run sdist as part of regular PRs and some PRs resulted in "no providers changed" condition, however sdist verification only happens now in canary build (so all providers are affected) as well as we have if condition in the job itself to skip the step of installation if there are no providers.
Airflow Sdist packages have been broken by #37340 and fixed by 37388, but we have not noticed it because CI check for sdist packages has been broken since #36537 where we standardized naming of the sdist packages to follow modern syntax (and we silently skipped installation because no providers were found),. This PR fixes it: * changes the naming format expected to follow the new standard * treats "no providers found as error" The "no providers" as success was useful at some point of time when we run sdist as part of regular PRs and some PRs resulted in "no providers changed" condition, however sdist verification only happens now in canary build (so all providers are affected) as well as we have if condition in the job itself to skip the step of installation if there are no providers.
This PR changes Airflow installation and build backend to use new
standard Python ways of building Python applications.
We've been trying to do it for quite a while. Airflow tranditionally
has been using complex and convoluted build process based on
setuptools and (extremely) custom setup.py file. It survived
migration to Airflow 2.0 and splitting Airlfow monorepo into
Airflow and Providers, adding pre-installed providers and switching
providers to use flit (and follow build standards).
So far tooling in Python ecosystme had not been able to fuflill our
needs and we refrained to develop our own tooling, but finally with
appearance of Hatch (managed by Python Packaging Authority) and
few recent advancements there we are finally able to swtich to
Python standard ways of managing project dependnecy configuration
and project build setup (with a few customizations).
This PR makes airflow build process follow those standard PEPs:
Airflow has all build configuration stored in pyproject.toml
following PEP 518 which allows any fronted (
pip
,poetry
,hatch
,flit
, or whatever other frontend is used toinstall required build dependendencies to install Airflow
locally and to build distribution pacakges (sdist/wheel)
Hatchling backend follows PEP 517 for standard source tree and build
backend implementation that allows to execute the build in a
frontend-independent way
We store all project metadata in pyprooject.toml - following
PEP 621 where all necessary project metadata components were
defined.
We plug-in into Hatchling "editable build" hooks following
PEP 660. Hatchling internally builds editable wheel that
is used as ephemeral step and communication between backend
and frontend (and this ephemeral wheel is used to make
editable installation of the projeect - suitable for fast
iteration of code without reinstalling the package.
With Airflow having many provider packages in single source tree
where we want to be able to install and develop airflow and
providers together, this is not a small feat to implement the
case wher editable installation has to behave quite a bit
differently when it comes to packaging and dependencies for
editable install (when you want to edit sources directly) and
installable package (where you want to have separate Airflow
package and provider packages). Fortunately the standardisation
efforts in the Python Packaging community and tooling implementing
it had finally made it possible.
Some of the important ways bow this has been achieved:
We continue using provider.yaml in providers as the single source
of trutgh for per-provider dependencies. We added a possibility
to specify "devel-dependencies" in provider.yaml so that all
per-provider dependencies in
generated/provider_dependencies.json
and
pyproject.toml
are generated from those dependencies viaupdate-providers-dependencies pre-commit.
Pyproject.toml is generally managed manually, but the part where
provider dependencies and bundle dependencies are used is
automatically updated by a pre-commit whenever provider
dependencies change. Those generated provider dependencies contain
just dependencies of providers - not the provider packages, but
in the final "standard" wheel file they are replaced with
"apache-airflow-providers-PROVIDER" dependencies - so that the
wheel package will only install the provider and use the
dependencies of that version of provider it installs.
We are utilising custom hatchiling build hooks (PEP 660 standard)
that allow to modify 'standard' wheel package on-the-fly when
the wheel is being prepared by adding preinstalled package
dependencies (which are not needed in editable build) and by
removing all devel extras (that are not needed in the PyPI
distributed wheel package). This allows to solve the conundrum
of having different "editable" and "standard" behaviour while
keeping the same project specification in pyproject.toml.
We added description of how
Hatch
can be employed as buildfrontend in order to manage local virtualenv and install Airflow
in editable way easily - while keeping all properties of the
installed application (including working airflow cli and
package metadata discovery) as well as how to use PEP-standard
ways of bulding wheel and sdist packages.
We have a custom step (following PEP-standards) to inject
airflow-specific build steps - compiling www assets and
generating git commit hash version to display it in the UI
We also show how all this makes it possible to make it easy to
manage local virtualenvs and editable installations for Airflow
contributors - without vendor lock-in of the build tools as
by following standard PEPs Airflow can be locally and editably
installed by anyone using any build front-end tools following
the standards - whether you use
pip
,poetry
,Hatch
,flit
or any other frontent build tools, Airflow local installation
and package building will work the same way for all of them,
where both "editable" and "standard" package prepration is
managed by
hatchling
backend in the same way.Previously our extras contained a "." which is not normalized
name for extras -
pip
and other tools replaced it automaticallywith `_'. This change updates the extra names to contain
'-' rather than '.' in the name, following PEP-685. This should be
fully backwards compatible, users will still be able to use "." but it
will be normalized to "-" in Airflow packages. This is also future
proof as it is expected that all package managers and tools
will eventually use PEP-685 applied to extras, even if currently
some of the tools (pip + setuptools) might generate warnings.
Additionally, this change organizes the documentation around
the extras and dependencies, explaining the reasoning behind
all the different extras we have.
As a bonus (and this is what we used to test it all) we are
documenting how to use Hatch frontend to:
Fixes: #30764