Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use OnceCell to cache Python operations #12942

Merged
merged 2 commits into from
Aug 14, 2024

Conversation

jakelishman
Copy link
Member

Summary

We were previously using RefCell to handle the interior mutability for the Python-operation caches. This works fine, but has two problems:

  • RefCell<Py<PyAny>> is two pointers wide
  • we had to do heavier dynamic runtime checks for the borrow status on every access, including simple gets.

This commit moves the caching to instead use OnceCell. The internal tracking for a OnceCell<T> is just whether an internal Option<T> is in the Some variant, and since Option<Py<PyAny>> will use the null-pointer optimisation for space, this means that OnceCell<Py<PyAny>> is only a single pointer wide. Since it's not permissible to take out a mutable reference to the interior, there is also no dynamic runtime checking---just the null-pointer check that the cell is initialised---so access is faster as well.

We can still clear the cache if we hold a mutable reference to the OnceCell, and since every method that can invalidate the cache also necessarily changes Rust-space, they certainly require mutable references to the cell, making it safe.

There is a risk here, though, in OnceCell::get_or_init. This function is not re-entrant; one cannot attempt the initialisation while another thread is initialising it. Typically this is not an issue, since the type isn't Sync, however PyO3 allows types that are only Send and not Sync to be put in pyclasses, which Python can then share between threads. This is usually safe due to the GIL mediating access to Python-space objects, so there are no data races. However, if the initialiser passed to get_or_init yields control at any point to the Python interpreter (such as by calling a method on a Python object), this gives it a chance to block the thread and allow another Python thread to run. The other thread could then attempt to enter the same cache initialiser, which is a violation of its contract.

PyO3's GILOnceCell can protect against this failure mode, but this is inconvenient to use for us, because it requires the GIL to be held to access the value at all, which is problematic during Clone operations. Instead, we make ourselves data-race safe by manually checking for initialisation, calcuting the cache value ourselves, and then calling set or get_or_init passing the already created object. While the initialiser can still yield to the Python interpreter, the actual setting of the cell is now atomic in this sense, and thus safe. The downside is that more than one thread might do the same initialisation work, but given how comparitively rare the use of Python threads is, it's better to prioritise memory use and single-Python-threaded access times than one-off cache population in multiple Python threads.

Details and comments

GILOnceCell would likely be better here, but it doesn't derive Clone, nor would that make sense, since it uses the GIL to mediate even getter access. When we're able to move to PyO3 0.22 and disable the py-clone feature of it, we might be able to reorganise this a little bit to avoid needing to handle the GIL-atomic initialisation manually.

I measured this as saving 165MB (around 10%) in resident-set size from this benchmark:

from qiskit.circuit import QuantumCircuit

qc = QuantumCircuit(1_000)
for _ in range(3_000):
    for q in qc.qubits:
        qc.rz(0.0, q)
    for q in qc.qubits:
        qc.rx(0.0, q)
    for q in qc.qubits:
        qc.rz(0.0, q)
    for a, b in zip(qc.qubits[:-1], qc.qubits[1:]):
        qc.cx(a, b)

over its parent commit. That's the same microbenchmark we used in #12730. That should have a (small) impact on copy and circuit-build performance due to needing to move around less memory, and access times on the Python operation cache should be smaller too, though that shouldn't be in the critical path.

The biggest memory problem remaining in the CircuitData model (by far) is the parameter storage at the moment - in this particular benchmark, we spend something around 559MB storing the parameters of the 9,000,000 parametric gates, of which 15% is malloc overhead (on macOS, at least, but probably most Unix-likes and not dissimilar on Windows) and within the actual allocation, about 70% is just padding bytes.

@jakelishman jakelishman added performance Changelog: None Do not include in changelog Rust This PR or issue is related to Rust code in the repository mod: circuit Related to the core of the `QuantumCircuit` class or the circuit library labels Aug 12, 2024
@jakelishman jakelishman added this to the 1.3 beta milestone Aug 12, 2024
@jakelishman jakelishman requested a review from a team as a code owner August 12, 2024 08:44
@qiskit-bot
Copy link
Collaborator

One or more of the following people are relevant to this code:

  • @Qiskit/terra-core
  • @kevinhartman
  • @mtreinish

@coveralls
Copy link

coveralls commented Aug 12, 2024

Pull Request Test Coverage Report for Build 10392333478

Details

  • 38 of 40 (95.0%) changed or added relevant lines in 4 files are covered.
  • 7 unchanged lines in 3 files lost coverage.
  • Overall coverage increased (+0.01%) to 89.707%

Changes Missing Coverage Covered Lines Changed/Added Lines %
crates/circuit/src/circuit_data.rs 8 9 88.89%
crates/circuit/src/circuit_instruction.rs 10 11 90.91%
Files with Coverage Reduction New Missed Lines %
crates/qasm2/src/expr.rs 1 94.02%
crates/accelerate/src/two_qubit_decompose.rs 1 90.69%
crates/qasm2/src/lex.rs 5 92.73%
Totals Coverage Status
Change from base Build 10391255051: 0.01%
Covered Lines: 67532
Relevant Lines: 75281

💛 - Coveralls

raynelfss
raynelfss previously approved these changes Aug 13, 2024
Copy link
Contributor

@raynelfss raynelfss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is sensible and, putting aside all the memory saving benefits, it makes the code more readable as we don't need to constantly call borrow or borrow_mut. Hopefully we can bump to PyO3 0.22 and reorganize things soon.

@raynelfss raynelfss added this pull request to the merge queue Aug 13, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 13, 2024
@raynelfss
Copy link
Contributor

I had previously approved this PR but it seems to be failing CI now. Any clue as to why @jakelishman ?

We were previously using `RefCell` to handle the interior mutability for
the Python-operation caches.  This works fine, but has two problems:

- `RefCell<Py<PyAny>>` is two pointers wide
- we had to do heavier dynamic runtime checks for the borrow status on
  every access, including simple gets.

This commit moves the caching to instead use `OnceCell`.  The internal
tracking for a `OnceCell<T>` is just whether an internal `Option<T>` is
in the `Some` variant, and since `Option<Py<PyAny>>` will use the
null-pointer optimisation for space, this means that
`OnceCell<Py<PyAny>>` is only a single pointer wide.  Since it's not
permissible to take out a mutable reference to the interior, there is
also no dynamic runtime checking---just the null-pointer check that the
cell is initialised---so access is faster as well.

We can still clear the cache if we hold a mutable reference to the
`OnceCell`, and since every method that can invalidate the cache also
necessarily changes Rust-space, they certainly require mutable
references to the cell, making it safe.

There is a risk here, though, in `OnceCell::get_or_init`.  This function
is not re-entrant; one cannot attempt the initialisation while another
thread is initialising it.  Typically this is not an issue, since the
type isn't `Sync`, however PyO3 allows types that are only `Send` and
_not_ `Sync` to be put in `pyclass`es, which Python can then share
between threads.  This is usually safe due to the GIL mediating access
to Python-space objects, so there are no data races.  However, if the
initialiser passed to `get_or_init` yields control at any point to the
Python interpreter (such as by calling a method on a Python object),
this gives it a chance to block the thread and allow another Python
thread to run.  The other thread could then attempt to enter the same
cache initialiser, which is a violation of its contract.

PyO3's `GILOnceCell` can protect against this failure mode, but this is
inconvenient to use for us, because it requires the GIL to be held to
access the value at all, which is problematic during `Clone` operations.
Instead, we make ourselves data-race safe by manually checking for
initialisation, calcuting the cache value ourselves, and then calling
`set` or `get_or_init` passing the already created object.  While the
initialiser can still yield to the Python interpreter, the actual
setting of the cell is now atomic in this sense, and thus safe.  The
downside is that more than one thread might do the same initialisation
work, but given how comparitively rare the use of Python threads is,
it's better to prioritise memory use and single-Python-threaded access
times than one-off cache population in multiple Python threads.
@jakelishman
Copy link
Member Author

Ray: it's ok, it's just that #12943 merged in the meantime that had a logical but not syntactic conflict. I've rebased this PR to account for the additional cache interaction added in that PR.

Copy link
Contributor

@raynelfss raynelfss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you for the quick fixes!

@jakelishman jakelishman added this pull request to the merge queue Aug 14, 2024
Merged via the queue into Qiskit:main with commit 48cca36 Aug 14, 2024
15 checks passed
@jakelishman jakelishman deleted the oncecell-cache branch August 14, 2024 20:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Changelog: None Do not include in changelog mod: circuit Related to the core of the `QuantumCircuit` class or the circuit library performance Rust This PR or issue is related to Rust code in the repository
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants