Move setgid as the first command executed in forked task runner #20040

potiuk · 2021-12-04T20:28:55Z

The runner setgid command was executed after importing several airflow
imports, which - when executed for the first time could take quite
some time (possibly even few seconds). The setgid command should be
done as soon as possible, in case of any errors in the import, it
would fail and the setgid could be never set.

Also this caused the test_start_and_terminate test to fail in CI
because the imports could take arbitrary long time (depending on
parallel tests and whether the imported modules were already
loaded in the process so setting the gid could be set after more
than 0.5 seconds.

This change fixes it twofold:

setgid is moved to be first instruction to be executed (also
signal handling was moved to before the potentially long
imports)
the test was fixed to wait actively and only fail after the
timeout of 1s (which should not happen before of the fix above)

Additionally the test was using task test command rather than task run,
and in some circumstances when you tried to run it locally,
when FORK was disabled (MacOS) the same test could fail with
a different error because --error-file flag is not defined for
task test command but it is automatically added by the runner.

The task command has been changed to `run'

Fixing this tests caused occasional test_on_kill failure
which suffered from similar problem and had similar sleep
implemented.

Thanks to that the test will be usually faster as no significant delays
will be introduced.

^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

The runner setgid command was executed after importing several airflow imports, which - when executed for the first time could take quite some time (possibly even few seconds). The setgid command should be done as soon as possible, in case of any errors in the import, it would fail and the setgid could be never set. Also this caused the test_start_and_terminate test to fail in CI because the imports could take arbitrary long time (depending on parallel tests and whether the imported modules were already loaded in the process so setting the gid could be set after more than 0.5 seconds. This change fixes it twofold: * setgid is moved to be first instruction to be executed (also signal handling was moved to before the potentially long imports) * the test was fixed to wait actively and only fail after the timeout of 1s (which should not happen before of the fix above) Additionally the test was using `task test` command rather than task run, and in some circumstances when you tried to run it locally, when FORK was disabled (MacOS) the same test could fail with a different error because --error-file flag is not defined for `task test` command but it is automatically added by the runner. The task command has been changed to `run' Fixing this tests caused occasional test_on_kill failure which suffered from similar problem and had similar sleep implemented. Thanks to that the test will be usually faster as no significant delays will be introduced.

github-actions · 2021-12-04T22:23:20Z

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

The previous fix in apache#20040 improved forked tests but also caused instability in the "on_kill" test for standard task runner. This PR fixes the instability by signalling when the task started rather than waiting for fixed amount of time and it adds better diagnostics for the test.

The previous fix in #20040 improved forked tests but also caused instability in the "on_kill" test for standard task runner. This PR fixes the instability by signalling when the task started rather than waiting for fixed amount of time and it adds better diagnostics for the test.

The runner setgid command was executed after importing several airflow imports, which - when executed for the first time could take quite some time (possibly even few seconds). The setgid command should be done as soon as possible, in case of any errors in the import, it would fail and the setgid could be never set. Also this caused the test_start_and_terminate test to fail in CI because the imports could take arbitrary long time (depending on parallel tests and whether the imported modules were already loaded in the process so setting the gid could be set after more than 0.5 seconds. This change fixes it twofold: * setgid is moved to be first instruction to be executed (also signal handling was moved to before the potentially long imports) * the test was fixed to wait actively and only fail after the timeout of 1s (which should not happen before of the fix above) Additionally the test was using `task test` command rather than task run, and in some circumstances when you tried to run it locally, when FORK was disabled (MacOS) the same test could fail with a different error because --error-file flag is not defined for `task test` command but it is automatically added by the runner. The task command has been changed to `run' Fixing this tests caused occasional test_on_kill failure which suffered from similar problem and had similar sleep implemented. Thanks to that the test will be usually faster as no significant delays will be introduced. (cherry picked from commit abe01fa)

The previous fix in #20040 improved forked tests but also caused instability in the "on_kill" test for standard task runner. This PR fixes the instability by signalling when the task started rather than waiting for fixed amount of time and it adds better diagnostics for the test. (cherry picked from commit e2345ff)

boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Dec 4, 2021

potiuk requested review from ashb, kaxil, uranusjr, jmcarp, dstandish and jedcunningham December 4, 2021 20:28

potiuk force-pushed the fix-failing-test-start-and-terminate-test branch 3 times, most recently from 5ef4449 to b7b8735 Compare December 4, 2021 20:54

potiuk force-pushed the fix-failing-test-start-and-terminate-test branch from b7b8735 to 49096bb Compare December 4, 2021 21:18

mik-laj approved these changes Dec 4, 2021

View reviewed changes

github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Dec 4, 2021

potiuk merged commit abe01fa into apache:main Dec 4, 2021

potiuk deleted the fix-failing-test-start-and-terminate-test branch December 4, 2021 22:38

potiuk mentioned this pull request Dec 5, 2021

Fix flaky on_kill #20054

Merged

jedcunningham added this to the Airflow 2.2.3 milestone Dec 11, 2021

jedcunningham added the type:bug-fix Changelog: Bug Fixes label Dec 11, 2021

jedcunningham mentioned this pull request Dec 14, 2021

Status of testing of Apache Airflow 2.2.3rc2 #20208

Closed

38 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move setgid as the first command executed in forked task runner #20040

Move setgid as the first command executed in forked task runner #20040

potiuk commented Dec 4, 2021 •

edited

Loading

github-actions bot commented Dec 4, 2021

Move setgid as the first command executed in forked task runner #20040

Move setgid as the first command executed in forked task runner #20040

Conversation

potiuk commented Dec 4, 2021 • edited Loading

github-actions bot commented Dec 4, 2021

potiuk commented Dec 4, 2021 •

edited

Loading