Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add demo "How to Quantum Just-In-Time Compile Grover's Algorithm with Catalyst" #1219

Merged
merged 43 commits into from
Nov 7, 2024

Conversation

joeycarter
Copy link
Contributor

Title: How to Quantum Just-In-Time Compile Grover's Algorithm with Catalyst

Summary: This demo uses the existing Grover's Algorithm tutorial to describe how to just-in-time (JIT) compile a quantum circuit using Catalyst. It also includes runtime benchmarks to demonstrate the performance improvements that JIT compiling with Catalyst offers.

Relevant references: L. K. Grover (1996) "A fast quantum mechanical algorithm for database search"

Possible Drawbacks: None

Related GitHub Issues: None

[sc-72939]


If you are writing a demonstration, please answer these questions to facilitate the marketing process.

  • GOALS — Why are we working on this now?

    • Promote Catalyst by demonstrating the performance improvements it offers by QJIT compiling a relatively simple quantum circuit.
  • AUDIENCE — Who is this for?

    • Users of PennyLane looking to compile and optimize their circuits for better performance.
  • KEYWORDS — What words should be included in the marketing post?

    • Grover's algorithm
    • Catalyst
    • QJIT
  • Which of the following types of documentation is most similar to your file?
    (more details here)

  • Tutorial
  • Demo
  • How-to

Copy link

👋 Hey, looks like you've updated some demos!

🐘 Don't forget to update the dateOfLastModification in the associated metadata files so your changes are reflected in Glass Onion (search and recommendations).

Please hide this comment once the field(s) are updated. Thanks!

@joeycarter joeycarter requested a review from a team September 18, 2024 19:48
@rmoyard rmoyard self-requested a review September 18, 2024 19:52
Copy link

@rauletorresc rauletorresc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use by default the lightning.qubit device in the baseline algorithm explanation, instead of doing it with the default.qubit device, for the same reason explained in the demo: Catalyst does not support the latter at the moment. I would also remove it from the benchmark and show only the results with lightning.qubit. Finally, I would write a small note stating non-supported devices of Catalyst. What do you think?

@joeycarter
Copy link
Contributor Author

I would use by default the lightning.qubit device in the baseline algorithm explanation, instead of doing it with the default.qubit device, for the same reason explained in the demo: Catalyst does not support the latter at the moment. I would also remove it from the benchmark and show only the results with lightning.qubit. Finally, I would write a small note stating non-supported devices of Catalyst. What do you think?

My reasoning for using default.qubit as the baseline device was that the spirit of this demo was to take an existing PennyLane circuit, in this case the one from https://pennylane.ai/qml/demos/tutorial_grovers_algorithm/, and make it work with Catalyst. Switching the device from default.qubit to lightning.qubit is one of the required steps, and so including it in this demo makes it explicit what needs to be done for the circuit to work with Catalyst. My preference would be to leave it in.

There's an argument to be made to leave default.qubit out of the benchmarks, since it does consume the most CPU time to execute. I included it to really drive home each incremental performance improvement a user gets as they make the necessary modifications to their circuit. I'll go with popular opinion on this one whether to include or exclude it.

I would prefer not to list out the devices that Catalyst does not support, and just refer to the documentation. Suppose the list of supported devices changes over time, I think it would be better to not have to keep the demo in sync with that list.

Copy link
Contributor

@rmoyard rmoyard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great demo, can you double check the build failure? After that we can check that it renders properly.

@joeycarter
Copy link
Contributor Author

Great demo, can you double check the build failure? After that we can check that it renders properly.

Thanks @rmoyard! The build failure was due to the path to the preview image thumbnail not existing. I've put in the original Grover's algorithm preview image as a placeholder, just to check the build and rendering, which we can replace later on.

Copy link

github-actions bot commented Sep 19, 2024

Thank you for opening this pull request.

You can find the built site at this link.

Deployment Info:

  • Pull Request ID: 1219
  • Deployment SHA: 795f940483b86d805d8d2d016a00904d4869f7cc
    (The Deployment SHA refers to the latest commit hash the docs were built from)

Note: It may take several minutes for updates to this pull request to be reflected on the deployed site.

@rmoyard rmoyard self-requested a review September 19, 2024 17:54
Copy link
Contributor

@rmoyard rmoyard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great first demo 💯

Copy link
Contributor

@dime10 dime10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @joeycarter, that's a cool demo! :)

The benchmarking is done quite nicely, I like the comparison of the different stages (include the first load!).

Copy link
Member

@josh146 josh146 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, really nicely written how-to guide @joeycarter! Super clear and very well written.

I've left some suggestions throughout.

@joeycarter
Copy link
Contributor Author

I've refactored the timeit benchmarking; since I'm now running the compiled circuit earlier in the script to print out the results, we have to create a new qjit object of the circuit to get the runtime of the first call. I'm doing so with the setup input argument of timeit.

Interestingly, when I run this locally I no longer see a difference between the first-call runtime and subsequent-call runtimes. It's possible that when timeit was executing these commands in its own namespace, the compiled circuit ended up in some funky state that led to the small runtime overhead on the first call.

I'll see if I get the same result in the deployed version. If so, I'll remove the paragraph on the caching overhead, since this is likely the wrong interpretation of the larger runtime in the first call of the compiled circuit.

@dime10
Copy link
Contributor

dime10 commented Sep 25, 2024

Interestingly, when I run this locally I no longer see a difference between the first-call runtime and subsequent-call runtimes. It's possible that when timeit was executing these commands in its own namespace, the compiled circuit ended up in some funky state that led to the small runtime overhead on the first call.

If the first call overhead is primarily spent on loading shared libraries, then creating a new QJIT object may not recreate that state accurately. This is just a wild guess, but it could be that a lot of the shared libs (besides the program one, since that one should be new) that are needed to execute a catalyst program are still loaded in the Python process.

If you comment out the first call earlier in the demo, does it revert it to the old results?

@joeycarter
Copy link
Contributor Author

If the first call overhead is primarily spent on loading shared libraries, then creating a new QJIT object may not recreate that state accurately. This is just a wild guess, but it could be that a lot of the shared libs (besides the program one, since that one should be new) that are needed to execute a catalyst program are still loaded in the Python process.

If you comment out the first call earlier in the demo, does it revert it to the old results?

Ah, yes if I comment out the first call then it reverts to the old results, with the first QJIT call taking significantly longer than subsequent calls. On my machine I get

Native (default.qubit) runtime: (13.31 +/- nan) s
Native (lightning.qubit) runtime: (5.359 +/- 0.023) s
QJIT compilation runtime: (0.4419 +/- nan) s
QJIT (first call) runtime: (0.007347 +/- nan) s
QJIT (subsequent calls) runtime: (0.001503 +/- 0.00021) s

So the overhead appears to be in calling for the first time any instance of a qjit-compiled circuit, rather than calling every new QJIT object for the first time. Loading shared libraries sounds like a plausible explanation, I'll run a profiler on this script to see if that reveals where the runtime hotspot is.

@joeycarter
Copy link
Contributor Author

I profiled the first and second calls to the qjit-compiled circuit and in fact the hotspot is in the call to jnp.asarray() in CompiledFunction._exec().

If we trace down the function call stack from jnp.asarray, we get to a jax function dispatch.py:390(_device_put_sharding_impl). In the first call, this takes ~0.0086 s, and in the second only ~0.00005 s. Tracing down one step further in the stack, in the second (fast) call, the most time is spent in dispatch.py:331(_put_x), but in the first call, the most time is spent in pxla.py:1669(_get_default_device) (~0.008 s). Tracing down even further in the slow call, a lot of time is spent in XLA, e.g. in xla_bridge.py:737(_discover_and_register_pjrt_plugins) (~0.0047 s) and in xla_client.py:67(make_cpu_client) (~0.0024 s). These functions are not called in the second qjit-object call.

That's a lot of text so here's a visualization! First, the first (slow) call:

Screenshot from 2024-09-25 15-14-39

and the second (fast) call:

Screenshot from 2024-09-25 15-14-52

If I'm understanding all of this correctly, the overhead isn't in Catalyst per se, but in loading a bunch of JAX and XLA things the first time we call jnp.asarray().

In fact, if I throw in a jnp.asarray([0.]) before the first qjit-object call, I can get the first and subsequent calls to be roughly on par with one another! (The first call is still a bit slower, perhaps because of some smaller overheads elsewhere).

This is getting far into the nitty-gritty details of the performance of Catalyst and JAX, and well beyond the scope of this demo, I think. I propose we not show the difference between the first call and the subsequent calls in the benchmarks, since I think it's a reasonable assumption that any user who cares about that level of performance will already be using JAX arrays in their programs and not notice any difference in the first and second calls to an AOT-compiled circuit. @josh146, @dime10, how does that sound?

@josh146
Copy link
Member

josh146 commented Sep 25, 2024

@joeycarter yep that sounds good to me! On another note, we discovered that the decomposition pathway for lightning+noqjit for GroverOperator is dynamic, and uses a magic number to change strategy at <13 qubits:

https://github.com/PennyLaneAI/pennylane-lightning/blob/master/pennylane_lightning/lightning_qubit/lightning_qubit.py#L177-L178

This is likely not optimal, as these magic numbers are not always the best and can lead to slowdowns near the boundary (e.g., 10-12 qubits). I'd be curious to rerun the demo at 13 or 14 qubits (where lightning and catalyst should use the same decomposition strategy) to see what the outcome is?

@dime10
Copy link
Contributor

dime10 commented Sep 25, 2024

Thanks @joeycarter, nice digging! So the overhead is not related to catalyst, happy to leave it at that and not show 1st vs 2nd call 👍

@joeycarter
Copy link
Contributor Author

Thanks @joeycarter, nice digging! So the overhead is not related to catalyst, happy to leave it at that and not show 1st vs 2nd call 👍

Thanks @dime10! Sounds good.

@joeycarter joeycarter force-pushed the joeycarter/qjit-grovers-algo-with-catalyst branch from 96e156a to 51b6c4a Compare November 6, 2024 19:30
@joeycarter joeycarter changed the base branch from dev to master November 6, 2024 19:31
@joeycarter
Copy link
Contributor Author

Heads up, I've rebased this demo onto the master branch.

@joeycarter joeycarter merged commit 22a06cd into master Nov 7, 2024
10 checks passed
@joeycarter joeycarter deleted the joeycarter/qjit-grovers-algo-with-catalyst branch November 7, 2024 19:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants