-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dynamic fetch the variable count on PServer #7764
Comments
In the current design, if we want to something at "runtime", we must implement it in C++ in an operator. In sparse updating (using |
Agree with @typhoonzero : "one RPC call to signal the parameter server.", to be specific, I think that call need to indicate the dynamic number of parameter the trainer is going to send (same as @Yancey1989 's proposal). Because the indication call may arrive before the parameter send call, so the parameter server should start next round only when the number of parameter received matches, rather than when the call arrives. |
I thought about another way maybe simpler to implement: the client send |
Thanks for @typhoonzero , I think it's an easy way to implement, and I have some thought about that:
Do you mean we don't' need to push the SelectedRows into the Queue but execute the Deserialize instantly? If so, I think we need to lock the var and execute the Deserialize, that right? And I saw #7801 , if we push the var into Queue, maybe we can use the ThreadPool to speedup the Deserialize. |
Yep, do not send it to the queue. And we only need to lock the variable if there are more than on thread writing it in the same time. |
@typhoonzero how does it update the variable in the server's scope directly, do you mean we add one more block for updating each SelectedRows tensor? |
I think we should call |
Agree with @typhoonzero , we can not use an operator to merge selectedrows just because we can not confirm the input size of the merge op. FROM @typhoonzero
I think we don't need to separate these into two part, just send all variables which has been |
Background
Currently, a trainer instance would split a variable and send them to multiple pserver instances, the pserver use
param_count * fan_in
as the count of variables for one barrier.It's correct for the dense update, but for a sparse update, a SelectedRows variable may be an empty Tensor, and we could not send the empty variable, otherwise, it would do a lot of useless communication.
Solution
We need a method to calculate the count of variables for one barrier dynamically, instead a static number. think each trainer needs to call a gRPC interface, so the PServer will know how much variables for a barrier.
The text was updated successfully, but these errors were encountered: