-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix "This Session's transaction has been rolled back" #25532
Conversation
What was the crash you experienced? |
BTW. By first look - the crash could be caused by some DB object held by the Integrity Error accessed by secret masker. The difference vs. the original implementation was that our logging (which could have secret masking included) walks through all objects in the exception as well, so it could have accessed something that would be released by rollback(). That could explain why you cannot reproduce it in tests. An easy way to verify that hypothesis is to disable secret masking (just to be sure - you can remove it from logging config rather than by airflow configuration and repeat your tests @ephraimbuddy . |
However - this could also be just "logging" (no secret masking involved) - to verify this hypothesis you would have to repeat the tests without exc_info in the log. |
Here's the log:
|
Nothing obvious - but it would be great to verify these two hypotheses because maybe we can discover a pattern that we should avoid (or even forbid via pre-commit). |
I removed exc_info and it still failed. It's the logging because once I move it past it, it doesn't crash. |
I have another hypothesis:
And we are trying to access it in the log:
Run id is database, sqlalchemy object. And when you access it and It failed to be added due to Integrity Error, MAYBE when you access it before rollback, SQL Alchemy knows it just failed to add it, and it is in a limbo state when it is not yet added but not yet detached either, so it will return the error. When you do rollback() the DagRun object itself becomes detached and you can access it's run_id. It's a long shot - but maybe ? You can test THAT hypothesis by logging a bogus string instead of self.run_id. |
Still failed. I moved the log after the other log that used the |
Just to clarify (cause I am not 100% sure). - did it also fail, when you did that: ?
If that's the case, then I am afraid the root cause is somewhere deeper. Probably we see some race condition manifesting tself. Accessing self.log should not trigger it. Anothe option is that one of the logging handlers of yours (do you have any?) Accesses the DB in some way, |
Exactly the code you have above is what I tried and it failed. I don't have any logging handlers and no environment variables was set too. |
Hmm. Can you really reproduce it ? Maybe there was some mistake when you run it. and you run another version (or maybe .pyc file was compiled or something like that). When I look closer at the stacktrace it DOES seem like the last hypothesis of mine..
^^ this is part of the log.info where run_id is accessed.
^^ And it seems sql alchemy tries to retrieve the object and load it and it tries to execute SQL command but irt finds out that rollback is Pending and should be executed before.. So this would be rather strange to see it happening with this code:
Can you produce such an exception and see that the line number match and get the stack trace? |
I'd only think such error would only be possible (with the last code) if the DagRun object was created in a DIFFERENT session than the one that has just got rollback() applied. |
This attribute has been marked as expired, so SQLA is trying to reload the value. SQLA doesn't do this by itself, so somewhere we are expiring the object (but leaving it attached to the session. If it was detached via Do we do make any calls to |
I tihnk it the same issue as the one here: apache/superset#530 |
So I think the problem is due to bulk nature of the change above. DagRun is modfied by one of the "succesful" changes, but then the exception makes the connection dirty. The DagRun object is not "expunged" - it's just marked as modified because part of the bulk update actually succeeded. |
If that's the case then moving rollback() before is "good" - because it effectively detaches the object from session and it is not going to be refreshed even if it is dirty / modified. |
@potiuk You are right, I had to test now by restarting breeze. With just |
f36636e
to
eefb527
Compare
So we have a VERY plausible explanation then :). |
I looked at the code and this seems to be the only place (except tests) where we use However I think it would be good if we add a description of the problem here and link to the superset issue in a comment here just to capture it somewhere in the code (I am thinking about my fufure self in a year, hitting similar issue and trying to find out what was it about when I recall that something similar happened in the past) :) |
eefb527
to
b96f4e0
Compare
Look how I did it now. I feel it's much better and understandable. |
Perfect! |
You probably want to rename it/update commit description before merge :) |
b96f4e0
to
53a46d3
Compare
53a46d3
to
6e6b3af
Compare
Accessing the run_id(self.run_id) on exception leads to error because sessions are invalidated on exception. Here we extract the run_id before handling the exception
6e6b3af
to
0409c12
Compare
Accessing the run_id(self.run_id) on exception leads to error because sessions are invalidated on exception. Here we extract the run_id before handling the exception (cherry picked from commit 5668888)
Accessing the run_id on exception leads to error because sessions are invalidated on exception. Here we extract the information we need to log the exception before handling the exception
Here's how to reproduce the scheduler crash:
Run this dag in main:
Stop the scheduler
Reduce the list in the
get_files
to something like [1,2,3]Change this
difference
tosymmetric_difference
:airflow/airflow/models/dagrun.py
Line 1189 in 171aaf0
Start the scheduler and clear the above task so it can run again.
Now, it'll try to create new TIs and the Scheduler will crash. Switch to this PR and try the same again. You will notice that the scheduler survived the crash and rollback was successful.