Improve design doc on how collection safety deal with synchronization #24

itamarst · 2021-10-24T13:14:06Z

itamarst
Oct 24, 2021

Keeping in mind I'm just learning about lower-level synchronization, so all of this may be wrong—

The current design doc states that collections deal with synchronization requirements by having only the writer hold a lock. But that's not quite what's going on.

Locks have two purposes: prevent concurrent access, and ensure happened-before semantics across threads. Locking only on writes addresses the first concern, but not the second. I.e. there is no guarantee that a reading thread will actually see the writes, or a consistent view of the writes, if it's running on another CPU.

The actual implementation has added in enough use of atomics that I assume you've addressed this (i.e. you have readers rely on an atomic to generate the happened-before relationship).

But that's not what the design doc says, so it seems like the design doc significantly understates the work needed to get Python C extensions to be thread-safe; you can't just do a writes-only lock, but will need either full-on locking of reads+writes (with corresponding costs in single-threaded mode), or a lock-free design. And there's the work involved, but also acquiring the specialized knowledge do to this correctly.

Which is not to say this project isn't worth doing, it's super-exciting, just... probably worth expanding on the current doc to make the costs clearer (and perhaps motivate thinking about ways to solve this that reduce the burden for C extension maintainers).

colesbury · 2021-10-26T19:16:57Z

colesbury
Oct 26, 2021
Maintainer

Hi Itamar -- the required memory orderings ought to be better explained. I'm planning on doing this in a separate document. My plan is to:

Give an overview of the memory orderings used.
Provide machine-checked models (e.g. via CDSChecker) that demonstrate the orderings are sufficient. (I've made some use of CDSChecker while developing the proof-of-concept.)
Give an informal argument that the models are representative of the actual code used.

I like this approach because I think it's hard to reason correctly about memory ordering. Exhaustive verifiers like CDSChecker don't scale up to real-world programs. Non-exhaustive verifiers (C11Checker, ThreadSanitizer) are okay at catching data races, but don't do a good job of catching memory orderings that are too weak.

This is only relevant for the Python internals -- not for Python C extensions. The collection thread-safety scheme (that avoids locks for most read operations) is only intended for some Python built-in collections (like dict), and not for Python C extensions. The associated code is in Include/internal/... and requires Py_BUILD_CORE to be set. I think the complexity of this scheme is justified for dict and list because they are used so extensively, but I don't think it's worth exposing to C extensions.

I expect Python C extensions to use standard techniques (like locks) to protect shared mutable data, or to move data to thread-local state.

0 replies

itamarst · 2021-10-26T20:20:28Z

itamarst
Oct 26, 2021
Author

Makes sense re core. So the issue is that for Python C extensions this is the source of potential small bottlenecks that get smeared across lots of calls.

Consider random object like the transaction object of a database adapter. The vast majority of the time it's single-threaded, but it can't assume that is how it'll be used. So now it needs a lock on every API call, whereas before it didn't. And that can add up, every call to C functions potentially is going to be a lock acquire.

It would help if there was a super-optimized lock implementation available, like https://webkit.org/blog/6161/locking-in-webkit/ or the Rust equivalent parking_lot where uncontended lock acquisition is just an atomic check-and-swap (not sure how fast Python's current locks are, but I suspect "not at all optimized" because no one had the motivation to).

For thread local case... you have to use the slow kind of thread local on Linux, since the faster kinds don't work with dlopen() (which extensions need). And the overhead does add up.

2 replies

colesbury Oct 26, 2021
Maintainer

Consider random object like the transaction object of a database adapter. The vast majority of the time it's single-threaded, but it can't assume that is how it'll be used. So now it needs a lock on every API call, whereas before it didn't. And that can add up, every call to C functions potentially is going to be a lock acquire.

This is typical of Java database adapters. For example, the MySQL JDBC adapter acquires a lock for nearly every method. (The methods that don't directly acquire a lock, usually delegate to a method that acquires a lock.)

It would help if there was a super-optimized lock implementation available, like https://webkit.org/blog/6161/locking-in-webkit...

I'm using a similarly optimized lock implementation (based on WebKit's design), but it's not exposed in the Python C API, because I've been hesitant to add anything to the public API. It wouldn't be too hard to expose it, though.

For thread local case... you have to use the slow kind of thread local on Linux...

Even the "slow" kind on Linux is quite fast (1-2ns). See this benchmark for example. The overhead of Python function calls is much higher (10ns+ in my interpreter, 30ns+ in Python 3.10)

itamarst Oct 26, 2021
Author

It sounds like it would probably be a good idea to include faster locks as an API for extension authors, given you already have them.

njsmith · 2021-10-29T15:58:45Z

njsmith
Oct 29, 2021

I think the complexity of this scheme is justified for dict and list because they are used so extensively, but I don't think it's worth exposing to C extensions.

Also -- IIUC, the tricky optimization to avoid locking on reads isn't really a generic "go fast" optimization. It's mostly only useful for objects that see heavy read traffic from multiple threads simultaneously, and few writes. So like -- module and class dicts. Does that sound right? If an object is mostly accessed from a single thread, then efficient locks like futexes or WTF::Lock are extremely cheap, because the processor can take exclusive ownership of the cache line holding the lock and atomic updates become plain old cache line writes. It's just if you have lots of threads trying to read the same object simultaneously, then regular locking would upgrade those reads into writes, and then the writes trigger terrible cache-line bouncing.

For thread local case... you have to use the slow kind of thread local on Linux, since the faster kinds don't work with dlopen()

Huh, can you elaborate? I think of pthread_{get,set}specific as the "slow kind" and __thread as the "fast kind", and __thread definitely works in shared objects? Skimming this it does look like there are some optimizations for __thread variables defined in the main executable vs. dlopened objects, but it's like... one extra branch or something?

1 reply

itamarst Oct 29, 2021
Author

Re thread locals, I have code where a thread-local is used on every Python function, and __tls_get_addr is using 1% of runtime per perf. This may just be Rust unnecessarily picking the slowest of the four kinds, or doing something else injudicious, though, haven't checked.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve design doc on how collection safety deal with synchronization #24

{{title}}

Replies: 3 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Improve design doc on how collection safety deal with synchronization #24

itamarst Oct 24, 2021

Replies: 3 comments · 3 replies

colesbury Oct 26, 2021 Maintainer

itamarst Oct 26, 2021 Author

colesbury Oct 26, 2021 Maintainer

itamarst Oct 26, 2021 Author

njsmith Oct 29, 2021

itamarst Oct 29, 2021 Author

itamarst
Oct 24, 2021

Replies: 3 comments 3 replies

colesbury
Oct 26, 2021
Maintainer

itamarst
Oct 26, 2021
Author

colesbury Oct 26, 2021
Maintainer

itamarst Oct 26, 2021
Author

njsmith
Oct 29, 2021

itamarst Oct 29, 2021
Author