Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NodeJS multi-thread support #64

Open
dapplion opened this issue Mar 13, 2021 · 19 comments
Open

NodeJS multi-thread support #64

dapplion opened this issue Mar 13, 2021 · 19 comments

Comments

@dapplion
Copy link
Contributor

Current NodeJS bindings are not able to run in a multi-threaded setup with worker_threads reliably. Most of the times the parent process crashes with a segmentation fault. I'm working on creating a minimal reproducible POC that always crashes, but it's proving very difficult.

First of all, could you say if current SWIG NodeJS bindings are safe to multi-thread? There's some directions about to load the same bindings in multiple threads in the NodeJS docs. I wonder if the current SWIG setup achieves any of this directions. https://nodejs.org/api/addons.html#addons_worker_support

In our particular case I would be happy with a threaded setup where threads don't share any resources and only communicate with the main thread through messages.

@dot-asm
Copy link
Collaborator

dot-asm commented Mar 13, 2021

As I said yesterday. "There is a remaining issue with Workers, and it's being worked on." In other words, the problem is recognized and even understood, so you don't have to spend time on PoC, but be a little bit patient:-) Meanwhile require("blst") in the main script before you create workers, even if the main script doesn't use it, this minimizes chances of crash at startup...

@dapplion
Copy link
Contributor Author

Thanks! Would you mind sharing a bit about your understatement of the problem?

Meanwhile require("blst") in the main script before you create workers, even if the main script doesn't use it, this minimizes chances of crash at startup...

A dirty work-around that seems to work consistently is to duplicate the .node file so each file is required by only one thread

@dot-asm
Copy link
Collaborator

dot-asm commented Mar 13, 2021

The problem is two-fold, a) initialization of SWIG private structures is not MT-safe, b) it's not sufficient to make init context-aware, it has to be even isolate-aware. First problem manifests as crashes at require time and can be mitigated by require-ing blst before creating workers. I have solution for it. Second problem customarily manifests as crash when a thread or whole script finishes. This one is being worked on. Duplicating .node file should indeed do the trick for both, but obviously suboptimal.

@dot-asm
Copy link
Collaborator

dot-asm commented Mar 22, 2021

@dapplion, could you test swig/swig#1973?

@dapplion
Copy link
Contributor Author

@dapplion, could you test swig/swig#1973?

Didn't had a chance to test yet. Will update then, let me know if you have any updates too. Thanks!

@dapplion
Copy link
Contributor Author

@dapplion, could you test swig/swig#1973?

Early tests seem to work fine! Thank you so much for pushing this 🎉 Will work on POC inside Lodestar to confirm it's all good

@dapplion
Copy link
Contributor Author

@dot-asm Now that we have unlock the gates of parallelization, do you think it would be possible to send a pairing context between Javascript workers? The current SWIG produced Pairing class doesn't expose any way to get a clone-able data structure that is send-able in a Worker message.

  • Is the Pairing context "serializable" is some way? i.e. get some bytes as Buffer

@dot-asm
Copy link
Collaborator

dot-asm commented Apr 1, 2021

This is what I customarily call a "wrong-attitude" question:-) When it comes to software, question shouldn't be if something is possible, but how to achieve the goal. So let's start with the goal. Or rather take next step from stated goal to send pairing context between workers. What is possible to send? I mean I don't actually know. Is it possible to send ArrayBuffer-s? To express it in more programmer-friendly manner would following be useful/usable in terms of Typescript prototype:

--- a/bindings/node.js/blst.hpp.ts
+++ b/bindings/node.js/blst.hpp.ts
@@ -190,7 +190,8 @@ export interface Pairing {
     aug?: bytes
   ): BLST_ERROR;
   commit(): void;
-  merge(ctx: Pairing): BLST_ERROR;
+  asArrayBuffer(): ArrayBuffer;
+  merge(ctx: Pairing | ArrayBuffer): BLST_ERROR;
   finalverify(sig?: PT): boolean;
 }
 

One can probably even say that merge could accept only ArrayBuffer...

@dapplion
Copy link
Contributor Author

dapplion commented Apr 1, 2021

What is possible to send?

Primitive types, objects and arrays of primitive types and TypedArray instances (Buffer, ArrayBuffer, Uint8Array, etc). Stricter definition is arguments consumable by this function https://nodejs.org/api/v8.html#v8_v8_serialize_value

Is it possible to send ArrayBuffer

Yes!

To express it in more programmer-friendly manner would following be useful/usable in terms of Typescript prototype

This API sketch would work well yes.


Some questions to study the viability

  • Do you know what is the cost in CPU cycles (or some other metric) of turning the Pairing into an ArrayBuffer and vice-versa?
  • What is the expected size in bytes of a Pairing context (as a function of signatures aggregated if variable)

To preface my question I'm wondering what's the optimal strategy to validate a large enough amount of signatures in Javascript / Typescript. A naive approach is to just slice the signature set and divide into workers, then add the resulting booleans.

However, seeing that Go, Rust and Nim use the same strategy it would be worth studying the viability: thus my petition get the Pairing context as raw bytes.

It's important to note that in Javascript the communication between threads has significant lag, between 0.2 - 0.05 ms one way. If the size of the message is big enough serialization is slow and become a bottleneck too. However a SharedArrayBuffer can help for big payloads.

Maybe a better strategy would be to do the multi-threading in C land, inside the binding, where inter-thread communication is orders of magnitude faster and sharing memory is okay. So, do you know if it's possible to generate native binding for Node with SWIG capable of doing multi-threading themselves?

@dot-asm
Copy link
Collaborator

dot-asm commented Apr 2, 2021

This API sketch would work well yes.

Then add this at the end of SWIGJAVASCRIPT_V8 section in blst.swg.

// allow Pairing context to be passed and merged as ArrayBuffer...
%extend blst::Pairing {
    blst::Pairing* asArrayBuffer() { return $self; }
}
%typemap(out) blst::Pairing* asArrayBuffer() {
    size_t sz = blst_pairing_sizeof();
    auto ab = v8::ArrayBuffer::New(v8::Isolate::GetCurrent(), sz);
    memcpy(ab->GetData(), $1, sz);
    $result = ab;
}
%typemap(in) (const blst::Pairing* ctx) %{
    if ($input->IsArrayBuffer()) {
        auto ab = v8::Local<v8::ArrayBuffer>::Cast($input);
        if (ab->ByteLength() < blst_pairing_sizeof())
            SWIG_exception_fail(SWIG_TypeError, "in method '$symname', "
                                                "<ArrayBuffer> is too short");
        $1 = reinterpret_cast<$1_ltype>(ab->GetData());
    } else {
        // based on SWIGTYPE* in $SWIG_LIB/typemaps/swigtype.swg
        void *argp = nullptr;
        int res = SWIG_ConvertPtr($input, &argp, $descriptor, $disown);
        if (!SWIG_IsOK(res)) {
            SWIG_exception_fail(SWIG_ArgError(res), "in method '$symname', "
                                                    "argument $argnum of type '$type'");
        }
        $1 = reinterpret_cast<$1_ltype>(argp);
    }
%}

If you confirm that it's working as expected, I'll push to the repository...

Do you know what is the cost in CPU cycles (or some other metric) of turning the Pairing into an ArrayBuffer and vice-versa?

I don't quite follow. If you want to be in the business of counting CPU cycles, how come you ended up on Javascript side? :-) :-) :-) But on serious note, I can't really afford delving into this to the depth you might expect. As you can see [above], the overhead cost is allocating some memory from V8 heap plus memcpy, and checking type and casting on the receiving side. And that's about as far as I'd go.

What is the expected size in bytes of a Pairing context (as a function of signatures aggregated if variable)

It's always ~3K. Formally speaking one can truncate commit-ted Pairing context to less than 1K, but I'd rather not go there, not without good reason.

@dapplion
Copy link
Contributor Author

dapplion commented Apr 2, 2021

If you want to be in the business of counting CPU cycles, how come you ended up on Javascript side?

🙃🙃 I'll just say that I'm learning Rust..

In a serious note, our node spends between 50-80% of total CPU time doing BLS sig verification, so if we have to spend time optimizing something, this is a great target. Throwing all the work to a thread will be a massive improvement too.

the overhead cost is allocating some memory from V8 heap plus memcpy, and checking type and casting on the receiving side. And that's about as far as I'd go.

That's good enough thanks!

It's always ~3K.

Thanks, I don't think it would be a problem at all then, no need to truncate.


I'll report back when I can do a test with your SWIG snippet.

@dot-asm Is your branch swig/swig#1973 safe to use in production?

@dot-asm
Copy link
Collaborator

dot-asm commented Apr 2, 2021

@dot-asm Is your branch swig/swig#1973 safe to use in production?

I personally would say so, yes.

@dot-asm
Copy link
Collaborator

dot-asm commented Apr 30, 2021

What's the standing with asArrayBuffer add-on interface?

On related note, my impression is that passing ArrayBuffer between threads doesn't actually involve any data duplication or serialization. Or rather I see no reason why it would have to. ArrayBuffer data storage appears to be allocated off V8 pool, and it can be detached and reattached. So that transfer of an ArrayBuffer can be achieved by passing a pointer to the off-pool storage...

@dapplion
Copy link
Contributor Author

dapplion commented May 1, 2021

What's the standing with asArrayBuffer add-on interface?

On related note, my impression is that passing ArrayBuffer between threads doesn't actually involve any data duplication or serialization. Or rather I see no reason why it would have to. ArrayBuffer data storage appears to be allocated off V8 pool, and it can be detached and reattached. So that transfer of an ArrayBuffer can be achieved by passing a pointer to the off-pool storage...

Yeah that sounds right. I haven't had time to implement the transfer Pairing strategy, so can't comment on that front yet. I did some benchmarks and even tho transfering ArrayBuffer to threads is done with pointers, there's still a time cost in message passing, which is around 1ms per one-way transfer in a few machines I've tested.

Due to that the current strategy is to bundle as many signature sets as possible to reduce the communication between workers and the main thread without under-utilizing the workers.

@dot-asm
Copy link
Collaborator

dot-asm commented May 3, 2021

there's still a time cost in message passing, which is around 1ms per one-way transfer in a few machines I've tested.

"1ms" as in "millisecond"? If so, then it doesn't sound right. LAN is faster than that. If real, then node should be notified...

@dapplion
Copy link
Contributor Author

dapplion commented May 3, 2021

there's still a time cost in message passing, which is around 1ms per one-way transfer in a few machines I've tested.

"1ms" as in "millisecond"? If so, then it doesn't sound right. LAN is faster than that. If real, then node should be notified...

😆 yeah it's ridiculous but that's the state of the art. Can you run this locally to confirm I'm not crazy? https://github.com/ChainSafe/blst-ts/blob/master/benchmark/workerMessagePassing.js

@dot-asm
Copy link
Collaborator

dot-asm commented May 4, 2021

Can you run this locally to confirm I'm not crazy?

I can run it, but I can't confirm you're not crazy:-):-):-) Not qualified, you know...:-):-):-)
But on serious note! For starters, I don't actually see 1ms:

0 main -> worker 0.7323275699999999 ms
0 worker -> main 0.11640998600000001 ms
1 main -> worker 0.429437391 ms
1 worker -> main 0.123062587 ms
2 main -> worker 0.338589356 ms
2 worker -> main 0.13179326133333333 ms
3 main -> worker 0.295062732 ms
3 worker -> main 0.1400632265 ms
...

However, I reckon that these are not actually representative numbers. First, it's appropriate to disregard first results, because you have to give JIT compiler a chance to kick in. Secondly, as you use setTimeout, and with increasing timeout value, you allow processor to ramp down operating frequency, which would manifest itself as operations appearing slower. This is apparently why main-to-worker averages slower than worker-to-main. I mean setTimeout falls asleep prior main-to-worker, frequency drops, but then, by the worker-to-main time, it's increasing... Either way, if I fix the frequency and discount results from the first iteration of the outer loop, I get following:

1 main -> worker 0.057152096 ms
1 worker -> main 0.060167767999999996 ms
2 main -> worker 0.06846348 ms
2 worker -> main 0.062043496 ms
3 main -> worker 0.07300487333333333 ms
3 worker -> main 0.06437548666666666 ms
...

In other words messages are passed in ~60μs, "μs" as in "microseconds," at the nominal processor frequency. It's still arguably higher than expected, but I wouldn't be surprised that it's an adequate result for the snippet in question. I mean I would imagine that the process.hrtime() output gets serialized to JSON, which would take its toll...

@dapplion
Copy link
Contributor Author

dapplion commented May 4, 2021

can run it, but I can't confirm you're not crazy:-):-):-) Not qualified, you know...:-):-):-)

Actually, you are super qualified to disregard my initial results! Thanks for taking the time to analyze it, definitely now things make sense. Happy to know the latency is low in practice.

@dot-asm
Copy link
Collaborator

dot-asm commented May 4, 2021

Just in case for reference, I used node v14.16. No, it doesn't mean that I know if there were some relevant improvements, or if prior versions were "bad." It was just a convenient choice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants