-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Rust-based SparsePauliOp.to_matrix
and Miri tests
#11388
Conversation
One or more of the the following people are requested to review this:
|
} | ||
} | ||
|
||
impl<T: Send, S: ?Sized> ParallelSliceMutExt<T> for S where S: ParallelSliceMut<T> {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did I write <S: ?Sized> ... where S: ParallelSliceMut<T>
and not put both either in the <>
or in the where
? Who knows.
Pull Request Test Coverage Report for Build 8848600959Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
This rewrites the numerical version of `SparsePauliOp.to_matrix` to be written in parallelised Rust, building up the matrices row-by-row rather than converting each contained operator to a matrix individually and summing them. The new algorithms are complete row-based, which is embarrassingly parallel for dense matrices, and parallelisable with additional copies and cumulative sums in the CSR case. The two new algorithms are an asymptotic complexity improvement for both dense and sparse matrices over the "sum the individual matrices" version. In the dense case, the cost goes from O(4 ** num_qubits * num_ops) to O(4 ** num_qubits + (2 ** num_qubits) * reduced_num_ops) where the first term is from the zeroing of the matrix, and the second is from updating the elements in-place. `reduced_num_ops` is the number of explicitly stored operations after Pauli-operator uniqueness compaction, so is upper-bounded as `4 ** num_qubits`. (The Pauli compaction goes as `O(num_ops)`, so is irrelevant to the complexity discussion.) The CSR form of the algorithm goes as O(2 ** num_qubits * reduced_num_ops * lg(reduced_num_ops)) which (I think! - I didn't fully calculate it) is asymptotically the same as before, but in practice the constant factors and intermediate memory use are *dramatically* reduced, and the new algorithm is threadable with an additional `O(2 ** num_qubits * reduced_num_ops)` intermediate memory overhead (the serial form has only `O(reduced_num_ops)` memory overhead). The `object`-coefficients form is left as-is to avoid exploding the complexity in Rust space; these objects are already slow and unsuited for high-performance code, so the effects should be minimal.
As we begin to include more `unsafe` code in the Rust-accelerated components, it is becoming more important for us to test these in an undefined-behaviour sanitiser. This is done in a separate CI job because: - we do not yet know how stable Miri will be, so we don't want to block on it. - some dependencies need their version-resolution patching to Miri-compatible versions, but we want to run our regular test suites with the same versions of packages we will be building against.
This parallelises the previously serial-only cumulative sum of the `indptr` array of number of non-zero entries at the end. In practice, I didn't actually see any change in performance from this, but philosophically it feels like the right thing to do.
22f83a2
to
afa9726
Compare
Now rebased over #11133. edit: ugh, Miri's failing after the version of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this LGTM, the underlying algorithm here makes sense and I learned a fair amount about rayon internals and plumbing by reviewing this and looking up the details in the rayon docs. For future observers https://github.com/rayon-rs/rayon/blob/v1.10.0/src/iter/plumbing/README.md provides a useful overview.
In the back of my head while reviewing this I was wondering how much the unsafe usage around using uninitialized memory actually saves us performance wise. But the analysis, inline comments, and miri tests give me confidence that the usage is ok and it definitely is faster. I left a few inline questions and comments but they're mostly minor musing.
The other thing is I was looking at the coverage report and there are a few coverage gaps reported, mainly the 64bit sparse path and then the serial paths for everything. They're definitely covered by the rust tests, but I'm wondering if you want to add some python side test coverage just so we're exercising them full path (and including them in the coverage report).
let x_like = pack_bits(self.x).unwrap(); | ||
let z_like = pack_bits(self.z).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to return a real error here since pack_bits
can error. Probably a PyOverflowError
since the error condition is exceeding 64bits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unwrap
failing here would have been an internal logic error, because ZXPaulis
validates that x
and z
have the same shape on construction from Python space, then the start of ZXPaulisView::matrix_compress
validates that the number of qubits will fit in a u64
after the packing, which is the error condition for this (and elevates that to a ValueError
).
I switched these over to be expect
with an error message explaining where the internal-logic error would come in, but can switch them to full Python exceptions if you think it's better (though it should be impossible to trigger them, even from Rust space, using the public interface).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, if you're saying that the input to pack_bits
will always fit in 64 bits because of earlier assertions than I would probably just remove the size check and result return from the function then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 6b82a7b.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops sorry, I got out-of-sync with your comments. The "done in" is referring to my prior comment.
The pack_bits
function could still fail if it were called from other valid places in Rust space using its regular public interface (and as we expand SparsePauliOp
more in Rust space, we're more likely to use that), so I think it's right to return a Result
. It's just that in this particular case, we already validated the inputs so we know it should always work.
// It's also important that `$int_ty` does not implement `Drop`, since otherwise it | ||
// will be called on uninitialised memory (all primitive int types satisfy this). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it's worth restricting the int type to match PrimInt
(https://docs.rs/num-traits/latest/num_traits/int/trait.PrimInt.html) just so we don't have potential reuse of the macro with an unsafe type (I could see someone trying to use BigInt
or something)? It's probably over-engineering since this really won't be exposed publically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a way in mind to do this assertion? The macro_rules
is more like a template-match than a generic, so I'm not sure of an obvious way to put a "trait bound" on the input.
I guess I could possibly hack something together where we attempt to codegen something invalid if $int_ty
or $uint_ty
don't satisfy PrimInt
, but I don't know if there's a better pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not specifically I guess, I was thinking something like adding a dummy generic function that had the trait bound and passed $int_ty
to it so we'd get a compile error. I'm not sure if there is a better pattern either TBH.
But, it's not important I was more just my idle musing. Especially since I don't think we'll ever make the macro use public so it's fine with just the comment.
I wrote this PR quite some time ago now, so I've forgotten exactly what I did, but typically when I'm making these performance-improvement PRs, I make them in a series of teeny-tiny commits where I change one thing at a time (like changing from growing a
The problem with this is that from Python space, there's no real way to force the 64-bit mode without generating a very large operator that would take an annoyingly long time to test. It does cause a bit of a testing gap, though, for the Python-space components, so if you think it's still worth it I can add one. |
Yeah, that's what I figured for the 64bit case, I'm fine skipping those as realistically the only option would be to have a slow test we only ran nightly which wouldn't show up in our coverage report. But for the serial paths we should at least verify that via Python I think. |
Whoops, totally missed that you'd mentioned serial stuff too. Yeah, that was an oversight thanks. Serial vs parallel tests added in edfe2c6. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM now, thanks for the quick updates.
New
SparsePauliOp.to_matrix
This rewrites the numerical version of
SparsePauliOp.to_matrix
to be written in parallelised Rust, building up the matrices row-by-row rather than converting each contained operator to a matrix individually and summing them.The new algorithms are complete row-based, which is embarrassingly parallel for dense matrices, and parallelisable with additional copies and cumulative sums in the CSR case.
The two new algorithms are an asymptotic complexity improvement for both dense and sparse matrices over the "sum the individual matrices" version. In the dense case, the cost goes from$\mathcal O\bigl(4^{\text{qubits}} \cdot \text{ops}\bigr)$ to $\mathcal O\bigl(4^{\text{qubits}} + 2^{\text{qubits}} \cdot \text{reduced ops}\bigr)$ , where the first term is from the zeroing of the matrix, and the second is from updating the elements in-place. $\text{reduced ops}$ is the number of explicitly stored operations after Pauli-operator uniqueness compaction, so is upper-bounded as $4^{\text{qubits}}$ . (The Pauli compaction goes as $\mathcal O(\text{ops})$ , so is irrelevant to the complexity discussion.)
The CSR form of the algorithm goes as$\mathcal O\bigl(2^{\text{qubits}} \cdot \text{reduced ops}\cdot\lg(\text{reduced ops})\bigr)$ , which (I think! - I didn't fully calculate it) is asymptotically the same as before, but in practice the constant factors and intermediate memory use are dramatically reduced, and the new algorithm is threadable with an additional $\mathcal O\bigl(2^{\text{qubits}} \cdot \text{reduced ops}\bigr)$ intermediate memory overhead (the serial form has only $\mathcal O(\text{reduced ops})$ memory overhead).
The
object
-coefficients form is left as-is to avoid exploding the complexity in Rust space; these objects are already slow and unsuited for high-performance code, so the effects should be minimal.Miri tests
As we begin to include more
unsafe
code in the Rust-accelerated components, it is becoming more important for us to test these in an undefined-behaviour sanitiser. This is done in a separate CI job because:we do not yet know how stable Miri will be, so we don't want to block on it.
some dependencies need their version-resolution patching to Miri-compatible versions, but we want to run our regular test suites with the same versions of packages we will be building against.
Performance
Using a setup of:
then timing with:
the resulting graph looks like
The solid Qrusty line is the full path, including the construction of the Scipy matrix at the end. The dashed Qrusty line is just to construct the three arrays, since the index types of Qrusty's native CSR representation get converted when moving to Scipy. Both Qrusty lines are parallelised at 8 SMT threads.
The solid Qiskit line is with 8-thread SMT parallelism on my Macbook, the dashed one is the serial form. In absolute timings for the 18q, 1000 ops case, the full Qrusty path took ~4.54s (or ~3.68s without Scipy construction), while the parallel Qiskit path took ~2.86s (or ~3.91s in serial), so it takes under 2/3 the time to build a Scipy matrix in both or a 20% speedup to build compared to using Qrusty to build its native CSR type and for Qiskit to build a Scipy CSR.
Older version of the graph that used averages instead of the best times (despite what the axis says).
The big bumps in the parallel lines at 6 and 12 qubits appear legitimate and reproducible, but I'm not sure at all why they exist - maybe a weirdness where we move from one allocator to another mid algorithm? Update: the bump appears to be caused by the same medium-size allocator buggy shenanigans first detected by @kevinhartman in #10827, since on macOS 14, setting the environment variable
MallocMediumZone=0
removes the big bump, and the bump doesn't seem to appear in the same way on Linux.Fix #8772.