Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How the scope should be implemented in distribute environment. #3825

Closed
dzhwinter opened this issue Sep 2, 2017 · 4 comments
Closed

How the scope should be implemented in distribute environment. #3825

dzhwinter opened this issue Sep 2, 2017 · 4 comments

Comments

@dzhwinter
Copy link
Contributor

dzhwinter commented Sep 2, 2017

Background

In our current scope design, we get variable value by name string. If the variable distributes in multiple GPU, even multiple machines, the implementation is different with local machine mode.

questions

1、Do we really need an API functionvalue = get_variable_value(remote_server/job_name, var_name)? And the distributed version?
2、How the scope should be partitioned and distributed to GPU/nodes?
3、Suppose users need the get_variable_value interface to debug, every node shares a global scope. Should variable name record its node/GPU location? How to compatible with current variable name convention?

@dzhwinter
Copy link
Contributor Author

related design doc scope, parameter server

@wangkuiyi
Copy link
Collaborator

I tend to believe we shouldn't and couldn't provide

value = get_variable_value(job_name, var_name)

In the case of distributed computing, the values of a variable might change differently on various worker nodes. Some nodes run faster than others. So the snapshot of a variable on various nodes is hard to be defined.

For debug purpose, I think we can add a PrintOp in the net, maybe accompanied with an IfOp:

layer.if(cond, layer.print(x1))

@helinwang
Copy link
Contributor

helinwang commented Sep 8, 2017

1、Do we really need an API functionvalue = get_variable_value(remote_server/job_name, var_name)? And the distributed version?

I don't think we need this API, even when training locally. Instead, we can use

[w, b] = pd.eval(target=["W", "b"])

In the implementation, our engine will add an fetch OP to the graph, and pd.eval will return its output.

2、How the scope should be partitioned and distributed to GPU/nodes?

I think #3747 and #3769 already talks about it. It's a big question, maybe we should create one new issue for it?

3、Suppose users need the get_variable_value interface to debug, every node shares a global scope. Should variable name record its node/GPU location? How to compatible with current variable name convention?

There will not be duplicate names in the graph for multiple node. For multiple GPU, we can give GPU=0 the true var name, and give the replicated variables on other GPUs random name.

@dzhwinter
Copy link
Contributor Author

This issue is a proposal, which has been implemented in the latest develop branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants