How the scope should be implemented in distribute environment. #3825

dzhwinter · 2017-09-02T23:21:58Z

Background

In our current scope design, we get variable value by name string. If the variable distributes in multiple GPU, even multiple machines, the implementation is different with local machine mode.

questions

1、Do we really need an API functionvalue = get_variable_value(remote_server/job_name, var_name)? And the distributed version?
2、How the scope should be partitioned and distributed to GPU/nodes?
3、Suppose users need the get_variable_value interface to debug, every node shares a global scope. Should variable name record its node/GPU location? How to compatible with current variable name convention?

The text was updated successfully, but these errors were encountered:

dzhwinter · 2017-09-02T23:24:01Z

related design doc scope, parameter server

wangkuiyi · 2017-09-03T02:07:45Z

I tend to believe we shouldn't and couldn't provide

value = get_variable_value(job_name, var_name)

In the case of distributed computing, the values of a variable might change differently on various worker nodes. Some nodes run faster than others. So the snapshot of a variable on various nodes is hard to be defined.

For debug purpose, I think we can add a PrintOp in the net, maybe accompanied with an IfOp:

layer.if(cond, layer.print(x1))

helinwang · 2017-09-08T21:41:18Z

1、Do we really need an API functionvalue = get_variable_value(remote_server/job_name, var_name)? And the distributed version?

I don't think we need this API, even when training locally. Instead, we can use

[w, b] = pd.eval(target=["W", "b"])

In the implementation, our engine will add an fetch OP to the graph, and pd.eval will return its output.

2、How the scope should be partitioned and distributed to GPU/nodes?

I think #3747 and #3769 already talks about it. It's a big question, maybe we should create one new issue for it?

3、Suppose users need the get_variable_value interface to debug, every node shares a global scope. Should variable name record its node/GPU location? How to compatible with current variable name convention?

There will not be duplicate names in the graph for multiple node. For multiple GPU, we can give GPU=0 the true var name, and give the replicated variables on other GPUs random name.

dzhwinter · 2018-06-06T10:53:54Z

This issue is a proposal, which has been implemented in the latest develop branch.

dzhwinter mentioned this issue Sep 4, 2017

Design doc: operator based parameter server. #3747

Merged

dzhwinter closed this as completed Jun 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How the scope should be implemented in distribute environment. #3825

How the scope should be implemented in distribute environment. #3825

dzhwinter commented Sep 2, 2017 •

edited by wangkuiyi

Loading

dzhwinter commented Sep 2, 2017

wangkuiyi commented Sep 3, 2017

helinwang commented Sep 8, 2017 •

edited

Loading

dzhwinter commented Jun 6, 2018

How the scope should be implemented in distribute environment. #3825

How the scope should be implemented in distribute environment. #3825

Comments

dzhwinter commented Sep 2, 2017 • edited by wangkuiyi Loading

Background

questions

dzhwinter commented Sep 2, 2017

wangkuiyi commented Sep 3, 2017

helinwang commented Sep 8, 2017 • edited Loading

dzhwinter commented Jun 6, 2018

dzhwinter commented Sep 2, 2017 •

edited by wangkuiyi

Loading

helinwang commented Sep 8, 2017 •

edited

Loading