Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic fetch the variable count on PServer #7764

Closed
Yancey1989 opened this issue Jan 23, 2018 · 8 comments
Closed

Dynamic fetch the variable count on PServer #7764

Yancey1989 opened this issue Jan 23, 2018 · 8 comments
Assignees

Comments

@Yancey1989
Copy link
Contributor

Yancey1989 commented Jan 23, 2018

Background

Currently, a trainer instance would split a variable and send them to multiple pserver instances, the pserver use param_count * fan_in as the count of variables for one barrier.
It's correct for the dense update, but for a sparse update, a SelectedRows variable may be an empty Tensor, and we could not send the empty variable, otherwise, it would do a lot of useless communication.

Solution

We need a method to calculate the count of variables for one barrier dynamically, instead a static number. think each trainer needs to call a gRPC interface, so the PServer will know how much variables for a barrier.

@typhoonzero
Copy link
Contributor

In the current design, if we want to something at "runtime", we must implement it in C++ in an operator.

In sparse updating (using SelectedRows to update weights), parameter server doesn't know how many variables need to get from trainers before it can start to run its block, so there has to be one RPC call to signal the parameter server.

@helinwang
Copy link
Contributor

helinwang commented Jan 23, 2018

Agree with @typhoonzero : "one RPC call to signal the parameter server.", to be specific, I think that call need to indicate the dynamic number of parameter the trainer is going to send (same as @Yancey1989 's proposal). Because the indication call may arrive before the parameter send call, so the parameter server should start next round only when the number of parameter received matches, rather than when the call arrives.

@typhoonzero
Copy link
Contributor

I thought about another way maybe simpler to implement: the client send SelectedRows and wait the send complete before sending dense variables. When the server receives the SelectedRows message, it should update the variable in the server's scope directly, then the server could run the server-side-program like normal.

@Yancey1989
Copy link
Contributor Author

Yancey1989 commented Jan 24, 2018

Thanks for @typhoonzero , I think it's an easy way to implement, and I have some thought about that:

When the server receives the SelectedRows message, it should update the variable in the server's scope directly

Do you mean we don't' need to push the SelectedRows into the Queue but execute the Deserialize instantly? If so, I think we need to lock the var and execute the Deserialize, that right?

And I saw #7801 , if we push the var into Queue, maybe we can use the ThreadPool to speedup the Deserialize.

@typhoonzero
Copy link
Contributor

Do you mean we don't' need to push the SelectedRows into the Queue but execute the Deserialize instantly? If so, I think we need to lock the var and execute the Deserialize, that right?

Yep, do not send it to the queue. And we only need to lock the variable if there are more than on thread writing it in the same time.

@helinwang
Copy link
Contributor

When the server receives the SelectedRows message, it should update the variable in the server's scope directly

@typhoonzero how does it update the variable in the server's scope directly, do you mean we add one more block for updating each SelectedRows tensor?

@typhoonzero
Copy link
Contributor

I think we should call operators::math::scatter::MergeAdd (in selected_rows_functor.h) immediately when receiving one. Because we don't know how many input variables should the "update" operator takes.

@Yancey1989
Copy link
Contributor Author

Yancey1989 commented Jan 25, 2018

Agree with @typhoonzero , we can not use an operator to merge selectedrows just because we can not confirm the input size of the merge op.

FROM @typhoonzero

the client send SelectedRows and wait the send complete before sending dense variables

I think we don't need to separate these into two part, just send all variables which has been initialized, and then make a batch barrier, it's more general. And for another, if we execute Deseraizlie on server site immediately, the client site will wait much more time then just put them into the Queue, so I think put the selectedRows into the Queue maybe more efficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants