Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-117657: Fix TSAN list set failure #118260

Merged
merged 5 commits into from
May 2, 2024
Merged

Conversation

DinoV
Copy link
Contributor

@DinoV DinoV commented Apr 24, 2024

There's a failure in TSAN when we write to a list via PyList_SET_ITEM. There could be further problems using this inline function but callers need to either own the list locally or lock, but the reads can still proceed lock-free so this needs to be atomic. And it needs to be release semantics so that any values that are being published through the list will be visible.

@DinoV DinoV requested review from mpage and colesbury April 24, 2024 23:42
@colesbury colesbury changed the title [gh-117657] Fix TSAN list set failure gh-117657: Fix TSAN list set failure Apr 25, 2024
@DinoV DinoV marked this pull request as ready for review April 25, 2024 15:52
@@ -44,7 +44,11 @@ PyList_SET_ITEM(PyObject *op, Py_ssize_t index, PyObject *value) {
PyListObject *list = _PyList_CAST(op);
assert(0 <= index);
assert(index < list->allocated);
#ifdef Py_GIL_DISABLED
_Py_atomic_store_ptr_release(&list->ob_item[index], value);
Copy link
Contributor

@colesbury colesbury Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't feel right to me. Basically, PyList_SET_ITEM is only safe in certain circumstances during list initialization before the list is exposed to other threads. If the list is later shared with other threads, it seems like there should be other synchronization involved to make the whole thing thread-safe.

Is there a reproducer for the Tsan reported race?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me see if I can hit it again, I didn't save the output :(. I hit it while following the steps in #117657

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it is: https://gist.github.com/DinoV/4baac3cdfeace53a296aaee5d2a86996

So the lock is held, but the problem is that the lock doesn't get taken on the read. But I think a relaxed store would be good enough here. I'll continue running with that and see if there's any other races.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be inclined to make the change in _PyList_AppendTakeRefListResize rather than PyList_SET_ITEM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a targetted test that repros it consistently under a TSAN build and we only need a relaxed load here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but I still think it makes more sense to modify _PyList_AppendTakeRefListResize to use FT_ATOMIC_STORE_PTR_RELEASE (or relaxed?) directly rather than modifying PyList_SET_ITEM.

PyList_SET_ITEM is used in ~2 places internally (_PyList_AppendTakeRef and _PyList_AppendTakeRefListResize) to modify existing lists while holding the lock. It's also used in many places to initialize newly created lists. The atomic operations make sense in the _PyList_AppendTakeRef* calls, but not really for the other uses of PyList_SET_ITEM.

I'm also not sure "relaxed" is strong enough for _PyList_AppendTakeRef* -- we generally use "release" when assigning to the entries of ob_item.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the other places we're missing an atomic store is in list_extend_fast but there seems to be some actual real crashes when we're extending and iterating at the same time that I'm looking into.

@DinoV DinoV force-pushed the nogil_tsan_list_set branch 2 times, most recently from 5916897 to 9298103 Compare April 29, 2024 19:44
Comment on lines 47 to 51
#ifdef Py_GIL_DISABLED
_Py_atomic_store_ptr_relaxed(&list->ob_item[index], value);
#else
list->ob_item[index] = value;
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the change to PyList_SET_ITEM be reverted now that the problematic call sites use FT_ATOMIC_STORE_PTR_RELEASE?

@DinoV DinoV merged commit 1e67b92 into python:main May 2, 2024
35 checks passed
SonicField pushed a commit to SonicField/cpython that referenced this pull request May 8, 2024
* Fix TSAN list set failure

* Relaxed atomic is sufficient, add targetted test

* More list

* Remove atomic assign in list

* Fixup white space
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants