Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fetch served logs also when task attempt is up for retry and no remote logs available #39496

Merged
merged 3 commits into from
May 21, 2024

Conversation

kahlstrm
Copy link
Contributor

@kahlstrm kahlstrm commented May 8, 2024

This PR is a continuation of #39177, as while testing it noticed that when the task's state is in TaskInstanceState.UP_FOR_RETRY, the logs for all attempts is unavailable, when the source is served logs.

Also noticed that the tests previously only covered the TaskInstanceState.RUNNING situation, whereas there was also TaskInstanceState.DEFERRED (and now TaskInstanceState.UP_FOR_RETRY). Changed the test to cover all three cases.

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@kahlstrm kahlstrm force-pushed the main branch 3 times, most recently from fc45184 to c29563e Compare May 9, 2024 09:34
Copy link
Contributor

@ferruzzi ferruzzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left one non-blocking suggestion, otherwise LGTM

tests/utils/test_log_handlers.py Outdated Show resolved Hide resolved
@kahlstrm kahlstrm force-pushed the main branch 4 times, most recently from 2883512 to 33c830e Compare May 11, 2024 11:24
@eladkal eladkal added this to the Airflow 2.9.2 milestone May 11, 2024
@eladkal eladkal added the type:bug-fix Changelog: Bug Fixes label May 11, 2024
@eladkal
Copy link
Contributor

eladkal commented May 20, 2024

@kahlstrm can you fix the failing tests?

@kahlstrm
Copy link
Contributor Author

@kahlstrm can you fix the failing tests?

Sure, didn't realize the tests failing are related ones, one sec.

@kahlstrm
Copy link
Contributor Author

@eladkal fixed the tests, seems like the odd behavior with TaskInstanceState.DEFERRED was fixed, and me working around it in these tests were broken due to that.

@eladkal eladkal requested a review from dstandish May 21, 2024 04:26
@ferruzzi ferruzzi merged commit ec2e245 into apache:main May 21, 2024
42 checks passed
ephraimbuddy pushed a commit that referenced this pull request Jun 4, 2024
@ephraimbuddy
Copy link
Contributor

ephraimbuddy commented Jun 5, 2024

I tried to get this into 2.9.2 but the test was failing https://github.com/apache/airflow/actions/runs/9383343196/job/25837180680?pr=40050#step:7:3547

@kahlstrm
Copy link
Contributor Author

kahlstrm commented Jun 5, 2024

I tried to get this into 2.9.2 but the test was failing https://github.com/apache/airflow/actions/runs/9383343196/job/25837180680?pr=40050#step:7:3547

Hmm yeah, I'd guess this has something to do with the flakiness with the TaskInstanceState.DEFERRED-state, as I encountered something similar while writing these tests. I can try to have a second look at some point.

@kahlstrm
Copy link
Contributor Author

Filed a PR that simply removes the inconsistent test case for TaskInstanceState.DEFERRED, it can be found here #40257 @ephraimbuddy

romsharon98 pushed a commit to romsharon98/airflow that referenced this pull request Jul 26, 2024
@dstandish
Copy link
Contributor

@kahlstrm there are some issues with this code now that i'm looking into.

But here's a question for you. Why should we check served logs when the task is up for retry? I think it's reasonable to assume that when the task is in a running state, that the logs are where they are going to be -- either shared volume, or remote storage.

@kahlstrm
Copy link
Contributor Author

@kahlstrm there are some issues with this code now that i'm looking into.

But here's a question for you. Why should we check served logs when the task is up for retry? I think it's reasonable to assume that when the task is in a running state, that the logs are where they are going to be -- either shared volume, or remote storage.

The issue we encountered is that when the state of the task instance state was up for retry, i.e. previous attempt had failed, the logs for the previous attempt weren't available for the time period when the task instance was in up for retry state.

Example:

Attempt 1 fails for some reason which is easiest visible in the logs. New Task instance is marked up for retry. During the time task instance is in up for retry state, the previous logs for the previous task attempts are not being fetched. The way to access the logs for that period of time is to ssh into the worker and find the log file.

Does this use case make sense? We started only observing these problems after upgrading to 2.7.0 and higher.

@dstandish
Copy link
Contributor

dstandish commented Aug 1, 2024

The issue we encountered is that when the state of the task instance state was up for retry, i.e. previous attempt had failed, the logs for the previous attempt weren't available for the time period when the task instance was in up for retry state.

This doesn't sound right. When a task fails (whether there are remaing retries or not), the logs should be wherever they are going to be. Does your webserver have access to the shared volume?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:logging type:bug-fix Changelog: Bug Fixes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants