-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How the scope should be implemented in distribute environment. #3825
Comments
related design doc scope, parameter server |
I tend to believe we shouldn't and couldn't provide value = get_variable_value(job_name, var_name) In the case of distributed computing, the values of a variable might change differently on various worker nodes. Some nodes run faster than others. So the snapshot of a variable on various nodes is hard to be defined. For debug purpose, I think we can add a PrintOp in the net, maybe accompanied with an IfOp: layer.if(cond, layer.print(x1)) |
I don't think we need this API, even when training locally. Instead, we can use [w, b] = pd.eval(target=["W", "b"]) In the implementation, our engine will add an
I think #3747 and #3769 already talks about it. It's a big question, maybe we should create one new issue for it?
There will not be duplicate names in the graph for multiple node. For multiple GPU, we can give GPU=0 the true var name, and give the replicated variables on other GPUs random name. |
This issue is a proposal, which has been implemented in the latest develop branch. |
Background
In our current scope design, we get variable value by name string. If the variable distributes in multiple GPU, even multiple machines, the implementation is different with local machine mode.
questions
1、Do we really need an API function
value = get_variable_value(remote_server/job_name, var_name)
? And the distributed version?2、How the scope should be partitioned and distributed to GPU/nodes?
3、Suppose users need the
get_variable_value
interface to debug, every node shares a global scope. Should variable name record its node/GPU location? How to compatible with current variable name convention?The text was updated successfully, but these errors were encountered: