feat(distributed_sp1): first working version of sp1 distributed prover #302

Champii · 2024-07-02T02:28:49Z

This PR introduces the `SP1DistributedProver`

Description

These changes add a new TCP server and a new distributed prover for sp1.

It introduce two new roles for a taiko instance that are the orchestrator and the worker.

The orchestrator is responsible for fetching the blocks data as a normal prover would. It will then check in its ip list file distributed.json the list of every worker it has access to, and check their connectivity and send a simple Ping packet to assess their reliability.

It will then compute the execution's checkpoints and then send each checkpoints to a different worker for them to commit the shard. Each worker then send back the commitment data.when all workers answered, the orchestrator instantiate a Challenger to observe the commitments. It then send the challenger and the checkpoints again to each worker so that they can prove them.
The orchestrator then merge them, verify the complete proof and sends it back to the caller.

The checkpoints contain a number of shards that are calculated to be equaly distributed between each worker. In fact, the number of checkpoints match the number of available workers.

In the event of a worker failure during runtime (network or processing error), the unproven checkpoint is pushed back in a queue for any remaining workers to pick.

How to run

The Orchestrator

Create a file distributed.json containing a JSON array of the workers' IP:PORT like this

[
    "1.2.3.4:8081",
    "5.6.7.8:8081"
]

NOTE: The orchestrator can also be a worker in its own pool

Then run raiko as you are used to

export TARGET=sp1
export CPU_OPT=1
# This is used to minimise the RAM usage. Increase this to reduce the proving time
# Note: you don't need to set it on the workers, it is propagated along the request
export SHARD_BATCH_SIZE=1
make run

The Workers

You have to set the ORCHESTRATOR env variable. It will enable the sp1 worker on this instance and only allow connections from this orchestrator address

export TARGET=sp1
export CPU_OPT=1
export ORCHESTRATOR=10.200.0.15
make run

The Script

You have to specify the prover as sp1_distributed.

./script/prove-block.sh taiko_a7 sp1_distributed 333763

What's left to do

Add a version number to the WorkerEnvelope to make sure orchestrator and workers are compatible
Even better error management
Refactor the sp1_specific module that contains the code used for proving
Make clippy happy
Fix the CI tests
Better TCP socket reliability (add a size limit)
Add a target to the makefile to run in worker mode

What's next

Allow the workers to cancel their work if the orchestrator cancels
Encrypt the network traffic
Investigate a regression in the time taken by sp1 to prove a block (15% more time spent proving than the previous used version)
Remove the patched version of Plonky3 that is used to (de)serialize the DuplexChallenger. Propose a PR for that
Add some tests ?

host/src/lib.rs

host/src/server/worker.rs

host/Cargo.toml

provers/sp1/driver/Cargo.toml

provers/sp1/driver/src/distributed/worker/envelope.rs

provers/sp1/driver/src/distributed/worker/socket.rs

smtmfft

quickly go through once, seems good to me.
just 1 thinking that maybe we should organize it using a new independent crate/repo and upstream to sp1. Because it actually has nothing (just few if not none) to do with raiko, as it's a complement to sp1 local & sp1 network, we can call sp1 with SP1_DISTRIBUTE="url.xxx.xxx", just like current sp1 network and risc0 bonsai.

smtmfft · 2024-07-12T06:57:22Z

core/src/interfaces.rs

@@ -119,6 +123,7 @@ impl std::fmt::Display for ProofType {
        f.write_str(match self {
            ProofType::Native => "native",
            ProofType::Sp1 => "sp1",
+            ProofType::Sp1Distributed => "sp1_distributed",


just feel that we don't need to make it explicit, the client does not need to know there is sp1 or sp1_distributed. So merge it into inside sp1 prover might be more reasonable.

@smtmfft So only behave differently based on compile time flags?

I think he talks about runtime ENV variable like SP1_DISTRIBUTE which would indeed be more aligned with the current sp1 behaviour

Yes, like current sp1, you can setup env to customize sp1's behavior without client's awareness.
for example by set/unset SP1_PROVER=network SP1_PRIVATE_KEY=... then raiko get proof from sp1 network/local-cpus, but for client, they are the same sp1 proof.

provers/sp1/driver/src/distributed/worker/pool.rs

smtmfft · 2024-07-12T09:46:08Z

provers/sp1/driver/worker.version

@@ -0,0 +1 @@
+1


better to sth like 0.1.0

Do you think we should directly include the Cargo.toml version field ?

Maybe having a full semver is a bit overkill, as this field is only here to indicate actual breaking changes and incompatibilities between different workers' version. So tracking minors and patches seems irrelevant to me, what do you think ?

Champii · 2024-07-12T11:17:40Z

quickly go through once, seems good to me. just 1 thinking that maybe we should organize it using a new independent crate/repo and upstream to sp1. Because it actually has nothing (just few if not none) to do with raiko, as it's a complement to sp1 local & sp1 network, we can call sp1 with SP1_DISTRIBUTE="url.xxx.xxx", just like current sp1 network and risc0 bonsai.

That was my first approach, but @mratsim advocated for inclusion into raiko directly to avoid having to manage a full sp1 fork. I can put this code back into sp1 but it might adds some complexity

smtmfft · 2024-07-15T00:41:06Z

That was my first approach, but @mratsim advocated for inclusion into raiko directly to avoid having to manage a full sp1 fork. I can put this code back into sp1 but it might adds some complexity

you don't need to manage a full sp1 fork isn't it?, basically what you need is a stable sp1 API set?? I was thinking of it because this solution is more generic than just a raiko sub-module. Anyway, for fork management, I think a reasonable working approach is stick to release to avoid managing unstable forks, as currently taiko-reth does, but sp1 seems do not have one official release so far. BTW: I think once it has, raiko should be switch to that one either to avoid endless upgrading API.

petarvujovic98 · 2024-07-16T12:39:13Z

core/src/interfaces.rs

@@ -119,6 +123,7 @@ impl std::fmt::Display for ProofType {
        f.write_str(match self {
            ProofType::Native => "native",
            ProofType::Sp1 => "sp1",
+            ProofType::Sp1Distributed => "sp1_distributed",


@smtmfft So only behave differently based on compile time flags?

petarvujovic98 · 2024-07-16T12:49:30Z

provers/sp1/driver/src/distributed/worker/pool.rs

+            if let WorkerResponse::Commitment {
+                commitments,
+                shards_public_values,
+            } = response
+            {
+                commitments_vec.extend(commitments);
+                shards_public_values_vec.extend(shards_public_values);
+            } else {
+                return Err(WorkerError::InvalidResponse);
+            }


Could be refactored into let ... else form. Not that important, it makes it into a guard clause which is more readable IMO.

Something like:

Suggested change

if let WorkerResponse::Commitment {

commitments,

shards_public_values,

} = response

{

commitments_vec.extend(commitments);

shards_public_values_vec.extend(shards_public_values);

} else {

return Err(WorkerError::InvalidResponse);

}

let WorkerResponse::Commitment {

commitments,

shards_public_values,

} = response else {

return Err(WorkerError::InvalidResponse);

};

commitments_vec.extend(commitments);

shards_public_values_vec.extend(shards_public_values);

This is indeed more readable :)

petarvujovic98 · 2024-07-16T12:50:12Z

provers/sp1/driver/src/distributed/worker/pool.rs

+            if let WorkerResponse::Proof(partial_proof) = response {
+                proofs.extend(partial_proof);
+            } else {
+                return Err(WorkerError::InvalidResponse);
+            }


Similarly here (see above comment)

petarvujovic98 · 2024-07-16T12:51:31Z

provers/sp1/driver/src/distributed/worker/protocol.rs

+            WorkerProtocol::Request(req) => write!(f, "Request({})", req),
+            WorkerProtocol::Response(res) => write!(f, "Response({})", res),


This could just be a variable capture instead of passing it to the macro:

Suggested change

WorkerProtocol::Request(req) => write!(f, "Request({})", req),

WorkerProtocol::Response(res) => write!(f, "Response({})", res),

WorkerProtocol::Request(req) => write!(f, "Request({req})"),

WorkerProtocol::Response(res) => write!(f, "Response({res})"),

Yeah I should definitely use this feature more, I think it didn't exist when I first learned Rust and I've been stuck with this since

petarvujovic98 · 2024-07-16T12:52:22Z

provers/sp1/driver/src/distributed/worker/socket.rs

+    async fn read_data(&mut self) -> Result<Vec<u8>, WorkerError> {
+        let size = self.socket.read_u64().await? as usize;
+
+        log::debug!("Receiving data with size: {:?}", size);


Suggested change

log::debug!("Receiving data with size: {:?}", size);

log::debug!("Receiving data with size: {size:?}");

petarvujovic98 · 2024-07-16T12:52:36Z

provers/sp1/driver/src/distributed/worker/socket.rs

+                    // TODO: handle the case where the data is bigger than expected
+                }
+                Err(e) => {
+                    log::error!("failed to read from socket; err = {:?}", e);


Suggested change

log::error!("failed to read from socket; err = {:?}", e);

log::error!("failed to read from socket; err = {e:?}");

petarvujovic98 · 2024-07-29T12:57:49Z

provers/sp1/driver/src/lib.rs

-            .verify(&proof, &vk)
-            .map_err(|_| ProverError::GuestError("Sp1: verification failed".to_owned()))?;
+        client.verify(&proof, &vk).map_err(|e| {
+            ProverError::GuestError(format!("Sp1: verification failed: {:#?}", e).to_owned())


Suggested change

ProverError::GuestError(format!("Sp1: verification failed: {:#?}", e).to_owned())

ProverError::GuestError(format!("Sp1: verification failed: {e:#?}").to_owned())

mratsim · 2024-07-31T07:39:16Z

Cargo.toml

-sp1-helper = { git = "https://github.com/succinctlabs/sp1.git", branch = "main" }
+sp1-sdk = { git = "https://github.com/succinctlabs/sp1.git", rev = "dd032eb23949828d244d1ad1f1569aa78155837c" }
+sp1-zkvm = { git = "https://github.com/succinctlabs/sp1.git", rev = "dd032eb23949828d244d1ad1f1569aa78155837c" }
+sp1-helper = { git = "https://github.com/succinctlabs/sp1.git", rev = "dd032eb23949828d244d1ad1f1569aa78155837c" }


This should use the v1.0.1 tag instead of the commit hash, it would be easier for future maintenance as people won't try to hunt "what is the specific feature we depend on that was added in that commit?"

I'm waiting for taikoxyz/sp1#1 to be merged in the taiko branch and I'll use that particular revision as a source instead, as the tag v1.0.1 will be obsolete. Should we add our own versioning system on top of sp1's and refer to that ?

maybe v1.0.1-taiko-1, you're the SP1 maintainer now ;). cc @smtmfft @Brechtpd @petarvujovic98

mratsim · 2024-07-31T07:40:37Z

Cargo.toml

+p3-challenger = { git = "https://github.com/Champii/Plonky3.git", branch = "serde_patch" }
+p3-poseidon2 = { git = "https://github.com/Champii/Plonky3.git", branch = "serde_patch" }
+p3-baby-bear = { git = "https://github.com/Champii/Plonky3.git", branch = "serde_patch" }
+p3-symmetric = { git = "https://github.com/Champii/Plonky3.git", branch = "serde_patch" }


this should be added to the taikoxyz org, or maybe a taikoxyz-patches special org to avoid pollution if across Raiko and Gwyneth we expect lots of long-term patches. cc @Brechtpd

I agree. Could anybody fork Plonky3 so that I can make a PR for that too ? :)

mratsim · 2024-07-31T07:44:27Z

host/src/lib.rs

@@ -88,6 +103,10 @@ impl Opts {
        "0.0.0.0:8080".to_string()
    }

+    fn default_sp1_worker_address() -> String {
+        "0.0.0.0:8081".to_string()


Suggested change

"0.0.0.0:8081".to_string()

// Placeholder address

"0.0.0.0:8081".to_string()

AFAIK in some cases 0.0.0.0 is/was use for broadcasting so I think it should be clarified it's a placeholder until we have a valid address.

It is the listening address of the worker socket, I always saw this format to indicate to listen on every interfaces. I'll do some more research but that's a sensible default in my opinion.

Champii requested review from Brechtpd and petarvujovic98 July 2, 2024 02:28

Champii mentioned this pull request Jul 2, 2024

WIP: First very dirty prototype of SP1 distributed proof #283

Closed

Champii had a problem deploying to test-environment July 2, 2024 02:30 — with GitHub Actions Failure

Champii temporarily deployed to test-environment July 2, 2024 02:37 — with GitHub Actions Inactive

Champii changed the title ~~WIP: First working version of SP1 Distributed Prover~~ [WIP] feat(distributed_sp1): First working version of SP1 Distributed Prover Jul 2, 2024

Champii commented Jul 2, 2024

View reviewed changes

Champii temporarily deployed to test-environment July 2, 2024 17:14 — with GitHub Actions Inactive

Champii force-pushed the sp1_distributed_prover branch from 3229632 to a0272b8 Compare July 2, 2024 17:25

Champii temporarily deployed to test-environment July 2, 2024 17:26 — with GitHub Actions Inactive

Champii temporarily deployed to test-environment July 2, 2024 18:11 — with GitHub Actions Inactive

Champii force-pushed the sp1_distributed_prover branch 2 times, most recently from 1c98633 to 21a6338 Compare July 2, 2024 18:23

Champii temporarily deployed to test-environment July 2, 2024 18:25 — with GitHub Actions Inactive

Champii force-pushed the sp1_distributed_prover branch from 21a6338 to 19084ab Compare July 2, 2024 18:40

Champii temporarily deployed to test-environment July 2, 2024 18:41 — with GitHub Actions Inactive

Champii force-pushed the sp1_distributed_prover branch from 19084ab to 0b10003 Compare July 2, 2024 18:45

Champii temporarily deployed to test-environment July 2, 2024 18:46 — with GitHub Actions Inactive

Champii force-pushed the sp1_distributed_prover branch from 0b10003 to 03ff234 Compare July 3, 2024 13:54

Champii temporarily deployed to test-environment July 3, 2024 13:55 — with GitHub Actions Inactive

Champii force-pushed the sp1_distributed_prover branch from 03ff234 to 1dc27c3 Compare July 3, 2024 14:33

Champii temporarily deployed to test-environment July 3, 2024 14:35 — with GitHub Actions Inactive

Champii force-pushed the sp1_distributed_prover branch from 1dc27c3 to c52bc22 Compare July 3, 2024 15:27

Champii temporarily deployed to test-environment July 3, 2024 15:28 — with GitHub Actions Inactive

Champii force-pushed the sp1_distributed_prover branch from c52bc22 to 5435c64 Compare July 3, 2024 15:44

Champii temporarily deployed to test-environment July 3, 2024 15:45 — with GitHub Actions Inactive

Champii force-pushed the sp1_distributed_prover branch from 5435c64 to 85fff43 Compare July 3, 2024 16:02

Champii temporarily deployed to test-environment July 3, 2024 16:03 — with GitHub Actions Inactive

Champii temporarily deployed to test-environment July 4, 2024 07:02 — with GitHub Actions Inactive

Champii force-pushed the sp1_distributed_prover branch from 2680485 to 31ef3a2 Compare July 4, 2024 07:14

Champii force-pushed the sp1_distributed_prover branch 2 times, most recently from 4db0ce9 to e14f6d6 Compare July 10, 2024 13:27

Champii temporarily deployed to test-environment July 10, 2024 13:28 — with GitHub Actions Inactive

Champii force-pushed the sp1_distributed_prover branch from e14f6d6 to b7bf25c Compare July 10, 2024 13:37

Champii temporarily deployed to test-environment July 10, 2024 13:37 — with GitHub Actions Inactive

Champii force-pushed the sp1_distributed_prover branch from b7bf25c to 3bad047 Compare July 10, 2024 13:41

Champii temporarily deployed to test-environment July 10, 2024 13:42 — with GitHub Actions Inactive

smtmfft reviewed Jul 12, 2024

View reviewed changes

petarvujovic98 approved these changes Jul 16, 2024

View reviewed changes

Champii force-pushed the sp1_distributed_prover branch 2 times, most recently from 1a7cdae to 810d241 Compare July 24, 2024 09:39

Champii had a problem deploying to test-environment July 24, 2024 09:41 — with GitHub Actions Failure

Champii force-pushed the sp1_distributed_prover branch from 810d241 to 0c4a1ea Compare July 24, 2024 09:49

Champii had a problem deploying to test-environment July 24, 2024 09:51 — with GitHub Actions Failure

Champii force-pushed the sp1_distributed_prover branch from 0c4a1ea to debe988 Compare July 24, 2024 09:53

Champii had a problem deploying to test-environment July 24, 2024 09:54 — with GitHub Actions Failure

First working version of SP1 Distributed Prover

d8738fd

Champii force-pushed the sp1_distributed_prover branch from debe988 to d8738fd Compare July 24, 2024 09:59

Champii had a problem deploying to test-environment July 24, 2024 10:00 — with GitHub Actions Failure

petarvujovic98 approved these changes Jul 24, 2024

View reviewed changes

Champii added 3 commits July 26, 2024 18:10

Update to new sp1 version

d890b6a

TMP disabled kzg check

76813bb

Add debug for sp1 verification failure

dcd6bc5

petarvujovic98 reviewed Jul 29, 2024

View reviewed changes

Restore kzg check

09fb775

mratsim approved these changes Jul 31, 2024

View reviewed changes

kzg

e258346

mratsim mentioned this pull request Aug 1, 2024

Distributed prover taikoxyz/sp1#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(distributed_sp1): first working version of sp1 distributed prover #302

feat(distributed_sp1): first working version of sp1 distributed prover #302

Champii commented Jul 2, 2024 •

edited

Loading

smtmfft left a comment

smtmfft Jul 12, 2024

petarvujovic98 Jul 16, 2024

Champii Jul 16, 2024

smtmfft Jul 16, 2024

smtmfft Jul 12, 2024

Champii Jul 12, 2024

Champii commented Jul 12, 2024

smtmfft commented Jul 15, 2024

petarvujovic98 Jul 16, 2024

petarvujovic98 Jul 16, 2024

Champii Jul 16, 2024

petarvujovic98 Jul 16, 2024

petarvujovic98 Jul 16, 2024

Champii Jul 16, 2024

petarvujovic98 Jul 16, 2024

petarvujovic98 Jul 16, 2024

petarvujovic98 Jul 29, 2024

mratsim Jul 31, 2024

Champii Aug 1, 2024

mratsim Aug 1, 2024

mratsim Jul 31, 2024

Champii Aug 1, 2024

mratsim Jul 31, 2024

Champii Aug 1, 2024

		WorkerProtocol::Request(req) => write!(f, "Request({})", req),
		WorkerProtocol::Response(res) => write!(f, "Response({})", res),

	log::debug!("Receiving data with size: {:?}", size);
	log::debug!("Receiving data with size: {size:?}");

	log::error!("failed to read from socket; err = {:?}", e);
	log::error!("failed to read from socket; err = {e:?}");

	ProverError::GuestError(format!("Sp1: verification failed: {:#?}", e).to_owned())
	ProverError::GuestError(format!("Sp1: verification failed: {e:#?}").to_owned())

	"0.0.0.0:8081".to_string()
	// Placeholder address
	"0.0.0.0:8081".to_string()

feat(distributed_sp1): first working version of sp1 distributed prover #302

Are you sure you want to change the base?

feat(distributed_sp1): first working version of sp1 distributed prover #302

Conversation

Champii commented Jul 2, 2024 • edited Loading

This PR introduces the SP1DistributedProver

Description

How to run

The Orchestrator

The Workers

The Script

What's left to do

What's next

smtmfft left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Champii commented Jul 12, 2024

smtmfft commented Jul 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Champii commented Jul 2, 2024 •

edited

Loading

This PR introduces the `SP1DistributedProver`