Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ARM image building for regular PRs #24664

Merged
merged 1 commit into from
Jun 28, 2022
Merged

Add ARM image building for regular PRs #24664

merged 1 commit into from
Jun 28, 2022

Conversation

potiuk
Copy link
Member

@potiuk potiuk commented Jun 26, 2022

The image building for ARM is currently only done in the main build
only to refresh cache, however there are sometimes cases when
new dependency (for example #24635) broke ARM image build and it
was only discovered after merge.

This PR adds extra ARM-based build that should be run after
the AMD64 build. It should not influence the depending steps,
it should just signal failure of the PR if the ARM image cannot
be build.


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragement file, named {pr_number}.significant.rst, in newsfragments.

@potiuk potiuk force-pushed the add-arm-image-building branch from 4cb85ab to 0fdf8ad Compare June 26, 2022 09:28
@potiuk
Copy link
Member Author

potiuk commented Jun 26, 2022

Testing execution of in-airlfow-repo PR.

@potiuk potiuk force-pushed the add-arm-image-building branch from 0fdf8ad to af877eb Compare June 26, 2022 10:15
@potiuk potiuk force-pushed the add-arm-image-building branch from af877eb to 2bc069d Compare June 27, 2022 10:20
@potiuk potiuk marked this pull request as ready for review June 27, 2022 10:21
@potiuk potiuk force-pushed the add-arm-image-building branch from 2bc069d to 00d827a Compare June 27, 2022 10:23
@potiuk
Copy link
Member Author

potiuk commented Jun 27, 2022

cc: @gmsantos - this will prevent the dramas you faced last week. I am running it from "apache" repo, to make sure it actually works. The nice thing is that for now we are just going to test test it and simply mark the build as failure. Nothing waits for the job so it shoudl just add few minutes of overhead/build time and single arm instance starting/stoppping to build all images (~ 2 minutes per image usually when cache is refeshed).

@potiuk
Copy link
Member Author

potiuk commented Jun 27, 2022

Seems it's going to work this time (I had to make the builds run sequentially, but it works now. just fine.

@potiuk
Copy link
Member Author

potiuk commented Jun 27, 2022

Nice. All ARM images built in under 10m :). Looks green.

@potiuk
Copy link
Member Author

potiuk commented Jun 27, 2022

BTW. This one is crucial for those of us who use M1 on a daily basis and rely on the images continue to work for them @dstandish @bbovenzi 👀

@github-actions github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Jun 27, 2022
@github-actions
Copy link

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

@jedcunningham
Copy link
Member

@potiuk, just want to double check there is no security/resource considerations here with us spinning up ARM machines for normal, non-committer PRs, right?

@potiuk
Copy link
Member Author

potiuk commented Jun 28, 2022

@potiuk, just want to double check there is no security/resource considerations here with us spinning up ARM machines for normal, non-committer PRs, right?

Good questions. I ❤️ that you asked them :). Here are my explanations and way how it is protected, but I would love a challenge/review and confirming that my thinking is right. Security is a teamwork and I don't trust myself (ultimately), so I love someone can verify my apprach.

Security

The ARM steps are executed in 2 cases:

a) build image workfow. Those are not "directly" affected by PR changes - the command will only run the "Dockerfile RUN commands" inside the ARM docker engine - Dockerfile is the only user-provided code that is executed during that build - and it is only executed as part of "docker build" (which runs inside the docker builder). You'd have to escape the container to get into the ARM instance. This is basically the same (or lower) risk we have currently - the Dockerfile RUN commands are executed inside the docker-engine of the S3 AMD instance we run. It's actually a bit lower because the ARM instances do not run the "runner" and the runner - it's a "bare" ARM instance with docker engine running with SSH-port-forwarded connection to it.

b) if you run PR from a branch in "apache" repo (like I did in this PR) - then the ARM instance will run in the CI worklfow. This is controlled by if: needs.build-info.outputs.inWorkflowBuild == 'true'. You need to have committer status to be able to push branch to Airflow Repo and make PR from it. Non-commiters can only run this CI workflow from fork, and even if they can change the ci.yaml there and change the if: needs.build-info.outputs.inWorkflowBuild == 'true' it will not work because fork-run ci.yaml workflow does not run on our machines and only our build machines have capability to spin the ARM instances. The "build_image" workflow does not run for those (apache) PR-s because they are skipped if pull request head is "apache/airlfow" (similarly as all the other "build" steps) https://github.com/apache/airflow/pull/24664/files#diff-a6ce9b5b0a53d8dc6dbd59d40b23994c3e68a91fac73c0d7eacdf373c95476c9R386.

The reason for having those duplicate "build" steps in CI workflow (in case of PRs from apache/airflow repo) is that I can actually test those ARM builds. When they are run from "CI" workflow, they run with "my code", when they are run in "build" workflow - they are run with "main" code and I only can test them when I merge them (which is main reason some of the workflows need a few merges to finally succed - as I have literally no way to test them before merge. For example the "build" workflow Start ARM instance/Stop ARM instance had never been tested, because the only way I can test it is by merging the code to main.

Cost

Those ARM machines have auto-destruction built in - no matter if succeded or failed: 50 minues self-kill-switch and "cancel" is run always: https://github.com/apache/airflow/pull/24664/files#diff-b803fcb7f17ed9235f1e5cb1fcd2f5d3b2838429d4368ae4c57ce4436577f03fR417 - so even if someone presses cancel too many times or in case build "hangs" - we got maximum ~ 1hr of instance run. The instances have maximum bid spot price set for requesting the instance. It is set to 0.1 USD /HR. Curent spot price in Ohio: c6g.xlarge | $0.0399 per Hour - so currently 1 h of running the instance is 0.04 USD - max 0.1 USD. Building the 4x images (when cache is refreshed - which is vast majority of times) is ~ 12 minutes. This means that running a single ARM instance step is 0.04 / 5 ~ 0.008 USD (max 0,02 USD). We have ~ 100 image builds / day on average working day = 0.8 to 2 USD / day ~ 30 USD / Month (with rather generous over-counting the days). this is ~ 30/1300 ~ 2% of current spending.

I personally think it's worth it as it saves a lot of our contributors from headaches on M1s - and this is not academic - this has already happened #24635 and it was couple of hours lost for a few people. We have currently no protection mechanisms to prevent it from happening again - and taking into account that a number of deps (including mysql and mssql) do not yet have debian ARM package and fail when building, it is likely to happen again

BTW. Probably we loose way more but not merging some optimisations which are currently in the pipeline (for example this one #24666 - I have not assesed that one because it is far more difficult but I would guess it allows to save ~ 3 minutes of build time x2 for AMD instances at 0.08 USD /hr per version in ~ 80% (wild guess) of our PRs which is already more than that cost of those ARM instances. UPDATE: I think 80% here is a bit too much - it is likely 3 minutes of "time" in 80% of builds but likely 50% of those are free github runners, so likey this saving is a bit less. But this is hard to get the "precise" number - this one is more back-of-the-envelope calculation.

And I already saved quite a lot and more : #24580 (and more to come with more breeze conversion) so I am on a cost-and-build-saving spree here.

@potiuk potiuk force-pushed the add-arm-image-building branch from 00d827a to 8731b00 Compare June 28, 2022 01:08
@potiuk
Copy link
Member Author

potiuk commented Jun 28, 2022

BTW. @jedcunningham - by writing this answer and looking at the workflows I realized that there is a potential for a small optimization (of cost). I added additional "needs" for the "arm" builds - so that they are executed as late as possible.

Previously they were executed always. bu in this case if either regular CI or PROD will fail, it will not be executed. Small cost reduction, but sttill a cool one.

@potiuk
Copy link
Member Author

potiuk commented Jun 28, 2022

Actually @jedcunningham - after sleeping over it, it seems we can furter cut down the overhead and cost. It seems that in regular PRs we only need to build it when some depdnencies change. Generally there is no reason to breaj the ARM docker build with the incoming PR unless your change contains some changes to dependencies - which we already detect by "upgradeToNewerDepeendencies" flag.

So we can further reduce the overhead/cost by:

  • only running the build in incoming PRs if they change dependencies
  • always run it in "push" builds - which means that every time we merge to main or run scheduled build we run ARM build (this is double-checking for potentially automatically upgraded constraints/dependencies)

I'd guess this will be ~ 5% of the previous estimation - so some 1.5 USD/month overhead :). I think we can afford it

@potiuk potiuk force-pushed the add-arm-image-building branch from 8731b00 to 79d3f00 Compare June 28, 2022 09:09
@potiuk
Copy link
Member Author

potiuk commented Jun 28, 2022

Updated with only running stuff when we are upgrading to newer dependencies :). It should be super-optimized this way

The image building for ARM is currently only done in the main build
only to refresh cache, however there are sometimes cases when
new dependency (for example #24635) broke ARM image build and it
was only discovered after merge.

This PR adds extra ARM-based build that should be run after
the AMD64 build. It should not influence the depending steps,
it should just signal failure of the PR if the ARM image cannot
be build.
@potiuk potiuk force-pushed the add-arm-image-building branch from 79d3f00 to f7b1b03 Compare June 28, 2022 17:39
@jedcunningham
Copy link
Member

Thanks for the detailed reply @potiuk. All sounds good and makes sense to me.

@potiuk potiuk merged commit 2fbd750 into main Jun 28, 2022
potiuk added a commit to potiuk/airflow that referenced this pull request Jun 28, 2022
The ARM image build introduced in apache#24664 had problem with build
image that was additionally checking for arm images which were
moved out to a spearate step
potiuk added a commit that referenced this pull request Jun 28, 2022
The ARM image build introduced in #24664 had problem with build
image that was additionally checking for arm images which were
moved out to a spearate step
potiuk added a commit to potiuk/airflow that referenced this pull request Jun 29, 2022
The image building for ARM is currently only done in the main build
only to refresh cache, however there are sometimes cases when
new dependency (for example apache#24635) broke ARM image build and it
was only discovered after merge.

This PR adds extra ARM-based build that should be run after
the AMD64 build. It should not influence the depending steps,
it should just signal failure of the PR if the ARM image cannot
be build.

(cherry picked from commit 2fbd750)
potiuk added a commit to potiuk/airflow that referenced this pull request Jun 29, 2022
The ARM image build introduced in apache#24664 had problem with build
image that was additionally checking for arm images which were
moved out to a spearate step

(cherry picked from commit 5321577)
@jedcunningham jedcunningham deleted the add-arm-image-building branch June 29, 2022 20:40
@ephraimbuddy ephraimbuddy added this to the Airflow 2.3.3 milestone Jun 30, 2022
@ephraimbuddy ephraimbuddy added the changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) label Jul 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:dev-tools changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) full tests needed We need to run full set of tests for this PR to merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants