Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-36607: Eliminate RuntimeError raised by asyncio.all_tasks() #13971

Merged
merged 2 commits into from
Jun 11, 2019

Conversation

asvetlov
Copy link
Contributor

@asvetlov asvetlov commented Jun 11, 2019

If internal tasks weak set is changed by another thread during iteration.

https://bugs.python.org/issue36607

If internal tasks weak set is changed by another thread during iteration.
@asvetlov asvetlov requested a review from 1st1 as a code owner June 11, 2019 10:54
@asvetlov asvetlov changed the title Eliminate RuntimeError raised by asyncio.all_tasks() bpo-36607: Eliminate RuntimeError raised by asyncio.all_tasks() Jun 11, 2019
try:
tasks = list(_all_tasks)
except RuntimeError:
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like unchecked loops like that. There must be a retry count or something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of iterations is the question.
What is good enough to prevent a collision? 10? 100? 1000?
I think the number depends on count of threads and a number of tasks inside every processed loop.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say 1000 should be enough. It's high enough to give it enough attempts and it's low enough to abort early if something weird is going on.

@bedevere-bot
Copy link

When you're done making the requested changes, leave the comment: I have made the requested changes; please review again.

@@ -42,9 +42,19 @@ def all_tasks(loop=None):
"""Return a set of all tasks for the loop."""
if loop is None:
loop = events.get_running_loop()
# NB: set(_all_tasks) is required to protect
# NB: list(_all_tasks) is required to protect
# from https://bugs.python.org/issue34970 bug
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we write a proper comment here without referencing an issue on bpo? I mean a url is fine, but it requires the reader to navigate to it in order to understand what's going on.

# from https://bugs.python.org/issue34970 bug
return {t for t in list(_all_tasks)
# NB: Have to repeat on RuntimeError, other thread
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no point in prefixing all comments with "NB"; it's just a regular comment. I'd suggest to write something along the following (as it took me sometime to understand why this change is necessary at all):

"Looping over a WeakSet (_all_tasks) isn't safe as it can be updated from another thread while we do so. Therefore we cast it to list prior to filtering. The list cast itself requires iteration, so we repeat it several times ignoring RuntimeErrors (which are not very likely to occur). See issues 34970 and 36607 for details."

@miss-islington
Copy link
Contributor

Thanks @asvetlov for the PR 🌮🎉.. I'm working now to backport this PR to: 3.7, 3.8.
🐍🍒⛏🤖

@asvetlov asvetlov deleted the protect-all_tasks branch June 11, 2019 15:27
miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Jun 11, 2019
…onGH-13971)

If internal tasks weak set is changed by another thread during iteration.

https://bugs.python.org/issue36607
(cherry picked from commit 65aa64f)

Co-authored-by: Andrew Svetlov <[email protected]>
@bedevere-bot
Copy link

GH-13975 is a backport of this pull request to the 3.8 branch.

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Jun 11, 2019
…onGH-13971)

If internal tasks weak set is changed by another thread during iteration.

https://bugs.python.org/issue36607
(cherry picked from commit 65aa64f)

Co-authored-by: Andrew Svetlov <[email protected]>
@bedevere-bot
Copy link

GH-13976 is a backport of this pull request to the 3.7 branch.

@nickdavies
Copy link

This approach feels very wrong to me as we aren't fixing the underlying problem of multiple threads mutating the same global variable without locks.

Some more concrete issues:

  • We are trusting that RuntimeError is thrown in all cases where this non-threadsafe mutation occurs. A simple example of something removing (eg GC) and something adding (eg another thread) will result in no RuntimeError and instead you could get some weird invalid behavior.

  • Weakset/refs are not threadsafe and we are really lucky that given the current code it happens to crash in a consistent way. The code to detect size changes in iteration is not a valid approach to detecting this issue and I fear that this patch (which repeatedly tries the racy operation) could hide this issue and any changes to asyncio/weakref etc could change it from crashing to causing corruption or memory leaks or other strange behavior.

  • There is no theoretical limit to the number of times that this loop can run. If you have a really heavy asyncio workload (eg hundreds of thousands of tasks spread over many threads) it's possible for a sustained task adding rate to be similar to how long it takes to copy this list. It also makes this function perform at a very unpredictable speed, sometimes it copies the tasks list once, others it might copy it 10s or 100s of times.

  • If you do reach the limit that you have defined in the code then we are back to square one. A user can still write a perfectly valid program and have this error raised to them in which case they have no way of avoiding the problem.

In my opinion the only valid solution to this problem is to remove the possibility of two threads mutating the _all_tasks global. The cleanest way to do this is to move the _all_tasks variable to be a property on the loop itself instead of a model level variable

miss-islington added a commit that referenced this pull request Jun 11, 2019
…3971)

If internal tasks weak set is changed by another thread during iteration.

https://bugs.python.org/issue36607
(cherry picked from commit 65aa64f)

Co-authored-by: Andrew Svetlov <[email protected]>
@1st1
Copy link
Member

1st1 commented Jun 11, 2019

Using a lock is problematic because of GC.

Making all_tasks a loop property is a good idea. Andrew?

@nickdavies
Copy link

Yeah locks like be either impossible or almost impossible to implement correctly to solve this. I mentioned this on the issue as a complexity of trying to solve this using locks.

... this was my concern too, the adding and removing from the WeakDict[AbstractEventLoop, WeakSet[Task]] for _all_tasks could still cause issues. Specifically the whole WeakSet class is not threadsafe so I would assume WeakDict is the same, there may not be a nice way of ensuring a combination of GC + the IterationGuard doesn't come and mess up the dict even if I wrap it in a threading lock.

@1st1
Copy link
Member

1st1 commented Jun 11, 2019

One downside to make this a loop attribute is that the attribute becomes part of the AbstractEventLoop api. Other implementations (uvloop, tokio) will need to implement it. Another option is to add an indirection: another mapping of loop -> all_tasks

@asvetlov
Copy link
Contributor Author

Making all_tasks a loop property is a good idea.

Ideally yes.
I have no strong objection but 2 comments:

  1. It adds a new public API. All existing loop implementations are not forward-compatible with the change. E.g. you should release uvloop instantly to work with new asyncio version.
  2. Since this is a new API the change is not acceptable for 3.7 at least. Maybe you can convince @ambv but I prefer reserving one more thing for threaded child watchers :)

To protect the current try-again approach: yes, theoretically it still can fail, the PR just reduces a chance of failure by several orders of magnitude.
WeakSet is safe in terms that the structure never crashes in Python even used by several threads in parallel (like dict and any other Python builtin collection).
It can return wrong data (extra or missing elements), yes. But the same problem exists if you analyze the tasks list just after getting the function value.

Suppose we have a code:

def all_tasks(loop=None):
    if loop is None:
        loop = events.get_running_loop()
    with _lock:
        tasks = list(_all_tasks)
   return {t for t in tasks if futures._get_loop(t) is loop and not t.done()}

Local tasks variable is a snapshot, not synchronized (protected by lock) value.
Moreover, the next line filters tasks from foreign loops.
It is very safe is passed loop is None (the thread that executes the function is the same thread that executes the loop), so before next yield point all_tasks() returned value is actual.

If loop points on a loop that is executed in a different thread then there is a chance of getting returned value out-of-sync with the actual state. Not sure if we can (and should) handle it reliable.

@1st1
Copy link
Member

1st1 commented Jun 11, 2019

Re adding new API: I think I addressed that in my last comment :)

#13971 (comment)

@asvetlov
Copy link
Contributor Author

Yes. Sorry, I read your comment just after sending mine

@nickdavies
Copy link

If loop points on a loop that is executed in a different thread then there is a chance of getting returned value out-of-sync with the actual state. Not sure if we can (and should) handle it reliable.

I agree that we shouldn't try and support this because using loops across threads is not supported already.

It can return wrong data (extra or missing elements), yes. But the same problem exists if you analyze the tasks list just after getting the function value.

Correct me if I am wrong as I am not very familiar with the C implementations but I feel like this type of invalid data is very different from the time of copy vs time of use problem you are describing:

  • thread safe copy (perfect copy at the time): You are only going to miss new things added to your loop or completed tasks being removed. This can only happen if you are using loops across threads (eg asking for tasks for a loop that is running on a different thread) which isn't supported.

  • non-thread safe copy (currently possible): depending on how the code underneath is implemented you may return two copies of the same task or have valid tasks missing. This could result in a user that is doing the right have tasks that are never cancelled for example.

@nickdavies
Copy link

For reference we solved this in our backported version with this hack: https://gist.github.com/nickdavies/4a37c6cd9dcc7041fddd2d2a81cee383

the key difference is the _patch_loop that hacks a way to keep the list of tasks local to the thread that created the loop.

@asvetlov
Copy link
Contributor Author

I agree that we shouldn't try and support this because using loops across threads is not supported already.

We do support it. asyncio.all_tasks(loop) handles the case when loop.run_until_complete() is called from non-current thread.

you may return two copies of the same task or have valid tasks missing

Two copies are impossible IMHO. Anyway, we make a set again before returning from all_tasks() function.

Valid tasks miss is possible if all_tasks() is called explicitly with loop which is executed by another thread. This is not a case for asyncio.run().

CPython switches threads in approximately every 5 ms (see sys.getswitchinterval()). 1000 retries is overkilling a little, the second retry should succeed. But keeping 1000 is safe and easy, that's why we use this (much higher than needed) value.

So, another implementation can save second (rare enough) list(_all_tasks) call for usages like asyncio.run(), which is not hot-path. We can allow a retry on loop closing/finalizing.

@banool
Copy link
Contributor

banool commented Jun 11, 2019

I just don't understand how having code in the core CPython implementation that is correct most of the time is at all acceptable. It doesn't matter how unlikely it is, I expect the language itself to be correct 100% of the time. We should really aim for a deterministic fix.

@1st1
Copy link
Member

1st1 commented Jun 11, 2019

CPython switches threads in approximately every 5 ms (see sys.getswitchinterval()). 1000 retries is overkilling a little, the second retry should succeed. But keeping 1000 is safe and easy, that's why we use this (much higher than needed) value.

I think we should revert the change and implement it via the loop->all_tasks mapping. Or just revert it and don't fix if you have no time.

@asvetlov
Copy link
Contributor Author

asvetlov commented Jun 11, 2019

@1st1 and I agree that the PR is not perfect.
It is just a fire-fighting job that can be done quickly and landed to Python 3.7 bugfix release

https://bugs.python.org/issue36607 is reverted to "open" state.

Please feel to propose the PR which will be awesome!
loop->all_tasks mapping sounds like a good idea.

@1st1
Copy link
Member

1st1 commented Jun 11, 2019

It is just a fire-fighting job that can be done quickly and landed to Python 3.7 bugfix release

Yeah, I'm OK with this fix for 3.7. The main reason is that this allows us to not touch a bunch of low-level C code in a bugfix release. For 3.8 we'll do better.

miss-islington added a commit that referenced this pull request Jun 11, 2019
…3971)

If internal tasks weak set is changed by another thread during iteration.

https://bugs.python.org/issue36607
(cherry picked from commit 65aa64f)

Co-authored-by: Andrew Svetlov <[email protected]>
@asvetlov
Copy link
Contributor Author

@nickdavies your patch still has a low chance of failure.
a) If task objects are referenced by other python objects
b) these objects have cross-references and need GC run to be destructed
c) GC is called on iterating on weakset
d) the iteration raises RuntimeError still

A probability of failure is very-very low but still not zero

DinoV pushed a commit to DinoV/cpython that referenced this pull request Jan 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants