Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attaching a logGroomerSidecar to dag processor #30623

Closed
wants to merge 69 commits into from

Conversation

wookiist
Copy link
Contributor

@wookiist wookiist commented Apr 13, 2023

Description

This PR attaches the logGroomer sidecar pod when using a standalone dag processor. This is to prevent scheduler logs from growing infinitely in the logs directory of that dag processor pod.

In fact, one of the Airflow clusters my team uses had about 3.5 TiB of scheduler logs accumulated in the emptyDir of a dag processor pod, which reduced the ephemeral-storage availability on that node to the point of pod eviction, resulting in a pod eviction.

airflow@airflow-test-dag-processor-78f9bfdb88-hmckb:/opt/airflow/logs$ ls
scheduler
airflow@airflow-test-dag-processor-78f9bfdb88-hmckb:/opt/airflow/logs$ du -sh
3.5T	     .

We haven't figured out why the standalone dag processor was accumulating scheduler logs, but we think it's a good idea to attach a logGroomer sidecar like any other scheduler pod or worker pod to prevent this from happening in the first place.

In this PR, we've modified the helm chart to attach a logGroomer to the standalone dag processor.

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg boring-cyborg bot added the area:helm-chart Airflow Helm Chart label Apr 13, 2023
@boring-cyborg
Copy link

boring-cyborg bot commented Apr 13, 2023

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: [email protected]
    Slack: https://s.apache.org/airflow-slack

@wookiist wookiist changed the title Feat/logremovedagprocessor Attaching a logGroomerSidecar to Dag Processor Apr 13, 2023
@wookiist wookiist changed the title Attaching a logGroomerSidecar to Dag Processor Attaching a logGroomerSidecar to dag processor Apr 13, 2023
@potiuk
Copy link
Member

potiuk commented Apr 14, 2023

I think scheduler/* logs are mainly because standalone DAGFileProcessor has been extracted from scheduler (normally it runs there). That might be a good idea (@mhenc?) to possibly update the default log configuration so that in case dag file processor is run as standalone, the logs are created in a different folder. Not a big issue, but might be a good idea to improve it.

Copy link
Member

@jedcunningham jedcunningham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add test coverage? Follow this example.

@potiuk
Copy link
Member

potiuk commented Apr 14, 2023

Can you add test coverage? Follow this example.

Right :)

@wookiist
Copy link
Contributor Author

Can you add test coverage? Follow this example.

Thanks for the example! I've added the test code.

@wookiist
Copy link
Contributor Author

@potiuk
Additionally, I'm using remote logging to have it write to stdout, which is then passed to filebeat for loading into ES.
I thought that using remote logging doesn't leave any logs on pods like scheduler, but I'm wondering why the dag processor does?

@potiuk
Copy link
Member

potiuk commented Apr 15, 2023

@potiuk Additionally, I'm using remote logging to have it write to stdout, which is then passed to filebeat for loading into ES. I thought that using remote logging doesn't leave any logs on pods like scheduler, but I'm wondering why the dag processor does?

I believe it leaves logs there - but possibly they are much smaller and not that annoying. The feature to remove local logs automatically when you use remote logging has been just recently merged and is being released (withing next 2 weeks) as part of 2.6.0 #29772

@potiuk
Copy link
Member

potiuk commented Apr 15, 2023

Ah my mistake. I think local file deletion was implemented before (but individually for ES task handler only), The thing is that dag file processor logs are NOT logged remotly by default - anly "task" logs are. Scheduler/dag file processor logs are stored locally by default and you need to separtely configure them using Advanced configuration https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/logging-tasks.html#advanced-configuration if you want to send them to a remote logging system.

@wookiist
Copy link
Contributor Author

wookiist commented Apr 18, 2023

After a series of unit test failures, I realized that in LogGroomerTest, the render_chart function was not generating the dag processor yaml. This is because the dag processor is not enabled by default.
When testing DagProcessor, I had to add "dagProcessor": {"enabled": True} value when testing the dagProcessor. Working on fixing the code for this.

@potiuk
Copy link
Member

potiuk commented Apr 18, 2023

Also I moved the test file as part of speed optimization, so you will have to move it too - after rebase (you will have conflict)

jaewook-oh and others added 8 commits April 19, 2023 09:24
* Remove gauge scheduler.tasks.running

* Add significant.rst file

* Update newsfragments/30374.significant.rst

---------

Co-authored-by: Niko Oliveira <[email protected]>
…atest `resource_version` (apache#30425)

* Recover from `too old resource version exception` by retrieving the latest `resource_version`

* Update airflow/executors/kubernetes_executor.py

Co-authored-by: Tzu-ping Chung <[email protected]>

---------

Co-authored-by: Tzu-ping Chung <[email protected]>
* readding after borked it

* pre-commit

* finally fixing after the github issue last week

* push fix

* feedback from hussein
The new moto (4.1.7) performs additional validation on the queues
created during tests and it failes the tests when content
deduplication is not specified.

Explicit setting the deduplication mode, fixes the problem and
allows the new moto to be installed.
* fix possible race condition when refreshing DAGs

* merge the two queries into one

* Remove provide_session from internal function

Since get_latest_version_hash_and_updated_datetime is internal and we
always pass in the session anyway, the provide_session decorator is
redundant and only introduce possibility for developer errors.

---------

Co-authored-by: Sébastien Brochet <[email protected]>
Co-authored-by: Tzu-ping Chung <[email protected]>
potiuk and others added 15 commits April 19, 2023 09:41
In some cases when the machine has been reused across builds, pipx
installed twine might seem both installed and removed (this happens
when builds are cancelled while installing twine.

Installing twine with --force should fix the problem.
…he#30682)

When packages for "Other" test type are calculated, the list of
all test folders is generated and they are compared with the
packages previously selected by the "predefined" test types. This
is done via `find` method that returns the folders in arbitrary
order, mostly depending on the sequence the folders were created.

In case the tests from some packages have some side-effects that
impact tests in other packages (obviously not something that is
desired), this might end up that the tests succeed in one
environment, but fail in another. This happened for example
in case of apache#30362 that had cross-package side-effect later
fixed in apache#30588. There - results of "Other" test type depended
on where the tests were executed.

This PR sorts the find output so it is always in consistent order.
we are using ASCII for package names and the test types are
derived in the same Docker CI image with the same LOCALE, so it
should guarantee that the output of packages for "Other" test type
should be always consistent.
Upgrading to latest (released a week ago) MyPy in the hopes it
will fix some more problem with attrs after upgrading new packages,
but it seems that even the latest MyPy does not know about the
new typing changes introduced in attrs (traditionally mypy has
attrs plugin that injects appropriate typing but apparently it
needs to catch up with those changes.
Helm Unit tests are using template rendering and the rendering
uses a lot of CPU for `helm template command`. We have a lot of
those rendering tests (>800) so even running the tests in parallel
on a multi-cpu machine does not lead to a decreased elapsed time
to execute the tests.

However, each of the tests is run entirely independently and we
should be able to achieve much faster elapsed time if we run
a subset of tetsts on separate, multi-CPU machine. This will not
lower the job build time, however it might speed up elapsed time
and thus get a faster feedback.

This PR achieves that.
…de (apache#30690)

* Add a new argument to rais skip exception when the python callable exit with the same value

* add unit tests for skip_exit_code
…le codes (apache#30692)

* rename skip_exit_code to skip_on_exit_code and allow providing multiple codes

* replace list type by Container
This change separates out the policies we have for providers to
a separate PROVIERS.rst file. It also documents clearly the process
and policy we have for accepting new community-managed providers,
explaining the conditions that have to be fulfilled and stating
a very strong preference of keeping providers maintained by the
3rd-party providers when there are 3rd-party teams that manage
the providers.
@wookiist
Copy link
Contributor Author

oh sorry for noise. my mistake :(
I wrongly rebased and pushed it.

@wookiist
Copy link
Contributor Author

I accidentally rebased the commit, sorry.
I'm trying to continue the story in a new PR. Thank you for your understanding.

#30726

@o-nikolas
Copy link
Contributor

I accidentally rebased the commit, sorry. I'm trying to continue the story in a new PR. Thank you for your understanding.

#30726

Closing this one then

@o-nikolas o-nikolas closed this Apr 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:helm-chart Airflow Helm Chart
Projects
None yet
Development

Successfully merging this pull request may close these issues.