Attaching a logGroomerSidecar to dag processor #30623

wookiist · 2023-04-13T11:20:11Z

Description

This PR attaches the logGroomer sidecar pod when using a standalone dag processor. This is to prevent scheduler logs from growing infinitely in the logs directory of that dag processor pod.

In fact, one of the Airflow clusters my team uses had about 3.5 TiB of scheduler logs accumulated in the emptyDir of a dag processor pod, which reduced the ephemeral-storage availability on that node to the point of pod eviction, resulting in a pod eviction.

airflow@airflow-test-dag-processor-78f9bfdb88-hmckb:/opt/airflow/logs$ ls
scheduler
airflow@airflow-test-dag-processor-78f9bfdb88-hmckb:/opt/airflow/logs$ du -sh
3.5T	     .

We haven't figured out why the standalone dag processor was accumulating scheduler logs, but we think it's a good idea to attach a logGroomer sidecar like any other scheduler pod or worker pod to prevent this from happening in the first place.

In this PR, we've modified the helm chart to attach a logGroomer to the standalone dag processor.

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

boring-cyborg · 2023-04-13T11:20:15Z

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
Here are some useful points:

Pay attention to the quality of your code (ruff, mypy and type annotations). Our pre-commits will help you with that.
In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
Be sure to read the Airflow Coding style.
Apache Airflow is a community-driven project and together we are making it better 🚀.
In case of doubts contact the developers at:
Mailing List: [email protected]
Slack: https://s.apache.org/airflow-slack

potiuk · 2023-04-14T07:39:36Z

I think scheduler/* logs are mainly because standalone DAGFileProcessor has been extracted from scheduler (normally it runs there). That might be a good idea (@mhenc?) to possibly update the default log configuration so that in case dag file processor is run as standalone, the logs are created in a different folder. Not a big issue, but might be a good idea to improve it.

jedcunningham

Can you add test coverage? Follow this example.

potiuk · 2023-04-14T16:13:00Z

Can you add test coverage? Follow this example.

Right :)

wookiist · 2023-04-15T01:57:17Z

Can you add test coverage? Follow this example.

Thanks for the example! I've added the test code.

wookiist · 2023-04-15T03:44:32Z

@potiuk
Additionally, I'm using remote logging to have it write to stdout, which is then passed to filebeat for loading into ES.
I thought that using remote logging doesn't leave any logs on pods like scheduler, but I'm wondering why the dag processor does?

potiuk · 2023-04-15T09:55:13Z

@potiuk Additionally, I'm using remote logging to have it write to stdout, which is then passed to filebeat for loading into ES. I thought that using remote logging doesn't leave any logs on pods like scheduler, but I'm wondering why the dag processor does?

I believe it leaves logs there - but possibly they are much smaller and not that annoying. The feature to remove local logs automatically when you use remote logging has been just recently merged and is being released (withing next 2 weeks) as part of 2.6.0 #29772

potiuk · 2023-04-15T09:59:14Z

Ah my mistake. I think local file deletion was implemented before (but individually for ES task handler only), The thing is that dag file processor logs are NOT logged remotly by default - anly "task" logs are. Scheduler/dag file processor logs are stored locally by default and you need to separtely configure them using Advanced configuration https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/logging-tasks.html#advanced-configuration if you want to send them to a remote logging system.

tests/charts/test_dag_processor.py

wookiist · 2023-04-18T01:39:21Z

After a series of unit test failures, I realized that in LogGroomerTest, the render_chart function was not generating the dag processor yaml. This is because the dag processor is not enabled by default.
When testing DagProcessor, I had to add "dagProcessor": {"enabled": True} value when testing the dagProcessor. Working on fixing the code for this.

potiuk · 2023-04-18T09:49:32Z

Also I moved the test file as part of speed optimization, so you will have to move it too - after rebase (you will have conflict)

* Remove gauge scheduler.tasks.running * Add significant.rst file * Update newsfragments/30374.significant.rst --------- Co-authored-by: Niko Oliveira <[email protected]>

…atest `resource_version` (apache#30425) * Recover from `too old resource version exception` by retrieving the latest `resource_version` * Update airflow/executors/kubernetes_executor.py Co-authored-by: Tzu-ping Chung <[email protected]> --------- Co-authored-by: Tzu-ping Chung <[email protected]>

Co-authored-by: eladkal <[email protected]>

* readding after borked it * pre-commit * finally fixing after the github issue last week * push fix * feedback from hussein

The new moto (4.1.7) performs additional validation on the queues created during tests and it failes the tests when content deduplication is not specified. Explicit setting the deduplication mode, fixes the problem and allows the new moto to be installed.

* fix possible race condition when refreshing DAGs * merge the two queries into one * Remove provide_session from internal function Since get_latest_version_hash_and_updated_datetime is internal and we always pass in the session anyway, the provide_session decorator is redundant and only introduce possibility for developer errors. --------- Co-authored-by: Sébastien Brochet <[email protected]> Co-authored-by: Tzu-ping Chung <[email protected]>

In some cases when the machine has been reused across builds, pipx installed twine might seem both installed and removed (this happens when builds are cancelled while installing twine. Installing twine with --force should fix the problem.

…he#30682) When packages for "Other" test type are calculated, the list of all test folders is generated and they are compared with the packages previously selected by the "predefined" test types. This is done via `find` method that returns the folders in arbitrary order, mostly depending on the sequence the folders were created. In case the tests from some packages have some side-effects that impact tests in other packages (obviously not something that is desired), this might end up that the tests succeed in one environment, but fail in another. This happened for example in case of apache#30362 that had cross-package side-effect later fixed in apache#30588. There - results of "Other" test type depended on where the tests were executed. This PR sorts the find output so it is always in consistent order. we are using ASCII for package names and the test types are derived in the same Docker CI image with the same LOCALE, so it should guarantee that the output of packages for "Other" test type should be always consistent.

Upgrading to latest (released a week ago) MyPy in the hopes it will fix some more problem with attrs after upgrading new packages, but it seems that even the latest MyPy does not know about the new typing changes introduced in attrs (traditionally mypy has attrs plugin that injects appropriate typing but apparently it needs to catch up with those changes.

Helm Unit tests are using template rendering and the rendering uses a lot of CPU for `helm template command`. We have a lot of those rendering tests (>800) so even running the tests in parallel on a multi-cpu machine does not lead to a decreased elapsed time to execute the tests. However, each of the tests is run entirely independently and we should be able to achieve much faster elapsed time if we run a subset of tetsts on separate, multi-CPU machine. This will not lower the job build time, however it might speed up elapsed time and thus get a faster feedback. This PR achieves that.

…de (apache#30690) * Add a new argument to rais skip exception when the python callable exit with the same value * add unit tests for skip_exit_code

…le codes (apache#30692) * rename skip_exit_code to skip_on_exit_code and allow providing multiple codes * replace list type by Container

Co-authored-by: Mark Richman <[email protected]>

This change separates out the policies we have for providers to a separate PROVIERS.rst file. It also documents clearly the process and policy we have for accepting new community-managed providers, explaining the conditions that have to be fulfilled and stating a very strong preference of keeping providers maintained by the 3rd-party providers when there are 3rd-party teams that manage the providers.

wookiist · 2023-04-19T00:45:53Z

oh sorry for noise. my mistake :(
I wrongly rebased and pushed it.

wookiist · 2023-04-19T01:32:25Z

I accidentally rebased the commit, sorry.
I'm trying to continue the story in a new PR. Thank you for your understanding.

#30726

o-nikolas · 2023-04-19T18:30:40Z

I accidentally rebased the commit, sorry. I'm trying to continue the story in a new PR. Thank you for your understanding.

#30726

Closing this one then

jaewook-oh added 2 commits April 13, 2023 20:06

Attaching a logGroomerSidecar to Dag Processor

5a6750b

Update dagProcessor enabled false

4bba5d2

wookiist requested review from dstandish and jedcunningham as code owners April 13, 2023 11:20

boring-cyborg bot added the area:helm-chart Airflow Helm Chart label Apr 13, 2023

wookiist changed the title ~~Feat/logremovedagprocessor~~ Attaching a logGroomerSidecar to Dag Processor Apr 13, 2023

wookiist changed the title ~~Attaching a logGroomerSidecar to Dag Processor~~ Attaching a logGroomerSidecar to dag processor Apr 13, 2023

Update dagProcessor logGroomer value comments

5b32792

potiuk approved these changes Apr 14, 2023

View reviewed changes

jedcunningham requested changes Apr 14, 2023

View reviewed changes

Adding loggroomer test code to test-dag-processor.py

1272cb2

potiuk reviewed Apr 15, 2023

View reviewed changes

tests/charts/test_dag_processor.py Outdated Show resolved Hide resolved

Adding line at the end of the code

95e1abc

jaewook-oh and others added 8 commits April 19, 2023 09:24

locating extraContainers at the last

916ffe8

Remove gauge scheduler.tasks.running (apache#30374)

3e0ad8a

* Remove gauge scheduler.tasks.running * Add significant.rst file * Update newsfragments/30374.significant.rst --------- Co-authored-by: Niko Oliveira <[email protected]>

docs: use correct import path for Dataset (apache#30617)

9f18fe4

Speed up TaskGroups with caching property of group_id (apache#30284)

f821be2

Co-authored-by: eladkal <[email protected]>

Fix TriggerDagRunOperator with deferrable parameter (apache#30406)

0342fb7

* readding after borked it * pre-commit * finally fixing after the github issue last week * push fix * feedback from hussein

potiuk and others added 15 commits April 19, 2023 09:41

Fix docs: add an "apache" prefix to pip install (apache#30681)

79f01c4

Remove unittests.TestCase from tests/test_utils (apache#30685)

f0a83bf

Add missing version val to caching_hash_method config (apache#30688)

adf5c40

Skip PythonVirtualenvOperator task when it returns a provided exit co…

5c111dd

…de (apache#30690) * Add a new argument to rais skip exception when the python callable exit with the same value * add unit tests for skip_exit_code

[OTel Integration] Add tagging to existing stats (apache#30496)

822c0f2

rename skip_exit_code to skip_on_exit_code and allow providing multip…

bee6652

…le codes (apache#30692) * rename skip_exit_code to skip_on_exit_code and allow providing multiple codes * replace list type by Container

Fix d3 dependencies (apache#30702)

fe86e21

Update system test example_emr to have logs (apache#30715)

768bdda

Fixed logging issue (apache#30703)

5f465d2

Co-authored-by: Mark Richman <[email protected]>

SqlToS3Operator - Add feature to partition SQL table (apache#30460)

8642ec8

wookiist requested review from kaxil, ashb, eladkal, o-nikolas, uranusjr, ryanahamilton, bbovenzi, ephraimbuddy, bolkedebruin and XD-DENG as code owners April 19, 2023 00:43

wookiist mentioned this pull request Apr 19, 2023

Adding loggroomer sidecar to dag processor #30726

Merged

o-nikolas closed this Apr 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attaching a logGroomerSidecar to dag processor #30623

Attaching a logGroomerSidecar to dag processor #30623

wookiist commented Apr 13, 2023 •

edited

Loading

boring-cyborg bot commented Apr 13, 2023

potiuk commented Apr 14, 2023

jedcunningham left a comment

potiuk commented Apr 14, 2023

wookiist commented Apr 15, 2023

wookiist commented Apr 15, 2023

potiuk commented Apr 15, 2023

potiuk commented Apr 15, 2023

wookiist commented Apr 18, 2023 •

edited

Loading

potiuk commented Apr 18, 2023 •

edited

Loading

wookiist commented Apr 19, 2023

wookiist commented Apr 19, 2023

o-nikolas commented Apr 19, 2023

Attaching a logGroomerSidecar to dag processor #30623

Attaching a logGroomerSidecar to dag processor #30623

Conversation

wookiist commented Apr 13, 2023 • edited Loading

Description

boring-cyborg bot commented Apr 13, 2023

potiuk commented Apr 14, 2023

jedcunningham left a comment

Choose a reason for hiding this comment

potiuk commented Apr 14, 2023

wookiist commented Apr 15, 2023

wookiist commented Apr 15, 2023

potiuk commented Apr 15, 2023

potiuk commented Apr 15, 2023

wookiist commented Apr 18, 2023 • edited Loading

potiuk commented Apr 18, 2023 • edited Loading

wookiist commented Apr 19, 2023

wookiist commented Apr 19, 2023

o-nikolas commented Apr 19, 2023

wookiist commented Apr 13, 2023 •

edited

Loading

wookiist commented Apr 18, 2023 •

edited

Loading

potiuk commented Apr 18, 2023 •

edited

Loading