Dynamic fetch the variable count on PServer #7764

Yancey1989 · 2018-01-23T04:03:21Z

Background

Currently, a trainer instance would split a variable and send them to multiple pserver instances, the pserver use param_count * fan_in as the count of variables for one barrier.
It's correct for the dense update, but for a sparse update, a SelectedRows variable may be an empty Tensor, and we could not send the empty variable, otherwise, it would do a lot of useless communication.

Solution

We need a method to calculate the count of variables for one barrier dynamically, instead a static number. think each trainer needs to call a gRPC interface, so the PServer will know how much variables for a barrier.

The text was updated successfully, but these errors were encountered:

typhoonzero · 2018-01-23T06:30:02Z

In the current design, if we want to something at "runtime", we must implement it in C++ in an operator.

In sparse updating (using SelectedRows to update weights), parameter server doesn't know how many variables need to get from trainers before it can start to run its block, so there has to be one RPC call to signal the parameter server.

helinwang · 2018-01-23T23:11:57Z

Agree with @typhoonzero : "one RPC call to signal the parameter server.", to be specific, I think that call need to indicate the dynamic number of parameter the trainer is going to send (same as @Yancey1989 's proposal). Because the indication call may arrive before the parameter send call, so the parameter server should start next round only when the number of parameter received matches, rather than when the call arrives.

typhoonzero · 2018-01-24T02:07:37Z

I thought about another way maybe simpler to implement: the client send SelectedRows and wait the send complete before sending dense variables. When the server receives the SelectedRows message, it should update the variable in the server's scope directly, then the server could run the server-side-program like normal.

Yancey1989 · 2018-01-24T04:01:46Z

Thanks for @typhoonzero , I think it's an easy way to implement, and I have some thought about that:

When the server receives the SelectedRows message, it should update the variable in the server's scope directly

Do you mean we don't' need to push the SelectedRows into the Queue but execute the Deserialize instantly? If so, I think we need to lock the var and execute the Deserialize, that right?

And I saw #7801 , if we push the var into Queue, maybe we can use the ThreadPool to speedup the Deserialize.

typhoonzero · 2018-01-24T04:17:55Z

Do you mean we don't' need to push the SelectedRows into the Queue but execute the Deserialize instantly? If so, I think we need to lock the var and execute the Deserialize, that right?

Yep, do not send it to the queue. And we only need to lock the variable if there are more than on thread writing it in the same time.

helinwang · 2018-01-24T19:08:24Z

When the server receives the SelectedRows message, it should update the variable in the server's scope directly

@typhoonzero how does it update the variable in the server's scope directly, do you mean we add one more block for updating each SelectedRows tensor?

typhoonzero · 2018-01-25T02:24:56Z

I think we should call operators::math::scatter::MergeAdd (in selected_rows_functor.h) immediately when receiving one. Because we don't know how many input variables should the "update" operator takes.

Yancey1989 · 2018-01-25T02:40:45Z

Agree with @typhoonzero , we can not use an operator to merge selectedrows just because we can not confirm the input size of the merge op.

FROM @typhoonzero

the client send SelectedRows and wait the send complete before sending dense variables

I think we don't need to separate these into two part, just send all variables which has been initialized, and then make a batch barrier, it's more general. And for another, if we execute Deseraizlie on server site immediately, the client site will wait much more time then just put them into the Queue, so I think put the selectedRows into the Queue maybe more efficient.

Yancey1989 assigned helinwang and typhoonzero Jan 23, 2018

Yancey1989 mentioned this issue Jan 25, 2018

Batch barrier in send/recv op #7847

Merged

Yancey1989 closed this as completed in #7847 Jan 29, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic fetch the variable count on PServer #7764

Dynamic fetch the variable count on PServer #7764

Yancey1989 commented Jan 23, 2018 •

edited

Loading

typhoonzero commented Jan 23, 2018

helinwang commented Jan 23, 2018 •

edited

Loading

typhoonzero commented Jan 24, 2018

Yancey1989 commented Jan 24, 2018 •

edited

Loading

typhoonzero commented Jan 24, 2018

helinwang commented Jan 24, 2018

typhoonzero commented Jan 25, 2018

Yancey1989 commented Jan 25, 2018 •

edited

Loading

Dynamic fetch the variable count on PServer #7764

Dynamic fetch the variable count on PServer #7764

Comments

Yancey1989 commented Jan 23, 2018 • edited Loading

Background

Solution

typhoonzero commented Jan 23, 2018

helinwang commented Jan 23, 2018 • edited Loading

typhoonzero commented Jan 24, 2018

Yancey1989 commented Jan 24, 2018 • edited Loading

typhoonzero commented Jan 24, 2018

helinwang commented Jan 24, 2018

typhoonzero commented Jan 25, 2018

Yancey1989 commented Jan 25, 2018 • edited Loading

Yancey1989 commented Jan 23, 2018 •

edited

Loading

helinwang commented Jan 23, 2018 •

edited

Loading

Yancey1989 commented Jan 24, 2018 •

edited

Loading

Yancey1989 commented Jan 25, 2018 •

edited

Loading