-
-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parameter server #57
Comments
Thanks for summarizing things. I ran into a first issue: I assume dask array already has serializers for numpy ndarray (and sparse too?). Is there an easy way to re-use them? |
Sorry, I was on master, which supports scattering singletons. Try scattering a list: [future] = client.scatter([x]) |
Thanks - that solved the scatter. But with a simplified version of ps method: def parameter_server():
beta = np.zeros(D)
with worker_client() as c:
betas = c.channel('betas', maxlen=1)
future_beta = c.scatter([beta])
betas.append(future_beta) I get:
Full stacktrace here |
Your future_beta value is a list. Try unpacking
[future_beta] = c.scatter([beta])
…On Thu, May 25, 2017 at 10:28 AM, Nick Pentreath ***@***.***> wrote:
Thanks - that solved the scatter. But with a simplified version of ps
method:
def parameter_server():
beta = np.zeros(D)
with worker_client() as c:
betas = c.channel('betas', maxlen=1)
future_beta = c.scatter([beta])
betas.append(future_beta)
I get:
...
Exception: TypeError("can't serialize <Future: status: finished, key: c85bbd0d1718128e8eb4b46d0b5940d8>",)
Full stacktrace here
<https://gist.github.com/MLnick/b42613104470501546e4e0209c58fa34>
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#57 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszB4v3eV96hk8-vXGnhfHW0WbMVx6ks5r9Um2gaJpZM4NmHsH>
.
|
Ah thanks! Another question - is there a way to control which workers dask arrays will be located on? i.e. Can I ensure that the dask array |
… On Thu, May 25, 2017 at 10:40 AM, Nick Pentreath ***@***.***> wrote:
Ah thanks!
Another question - is there a way to control which workers dask arrays
will be located on? i.e. Can I ensure that the dask array X is split
among workers w1, w2 only?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#57 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszCcrjDPaIOzzlDjZ5BFwlo7aj00Dks5r9Ux0gaJpZM4NmHsH>
.
|
Thanks - it seems for constructing a dask |
I'm having a weird issue with submitted tasks on workers not being able to see updates in the channel. Here's a simple repro: In [3]: client = Client()
In [4]: def simple_worker():
...: with worker_client() as c:
...: betas = c.channel('betas')
...: print(betas)
...:
In [5]: b = client.channel('betas')
In [6]: distributed.channels - INFO - Add new client to channel, ec0e59cc-4135-11e7-8f77-a45e60e5f579, betas
distributed.channels - INFO - Add new channel betas
In [6]:
In [6]: b.append("foo")
In [7]: b.flush()
In [8]: res = client.submit(simple_worker)
In [9]: <Channel: betas - 0 elements>
distributed.channels - INFO - Add new client to channel, f7644180-4135-11e7-95ee-a45e60e5f579, betas
distributed.batched - INFO - Batched Comm Closed: |
Hrm, try The worker client may not get updates immediately after creating the channel. It could take a small while. Typically we've resolved this in the past by iterating over the channel like |
Ok yeah - it seems to be a very small delay in being able to view the data. The "iter" works well. This brings up a couple potentially useful things for a
|
So, I have the very basics of working setup for the PS and a worker, using an ADAGRAD variant of SGD. I haven't properly "distributed" either the gradient computation (which should be done async, parallel, per chunk of But I did verify the Adagrad version gives results close to L-BFGS (so the logic is working). Will post further updates as I go. |
Yes, so I think that rather than extending channels we might want to build a few new constructs including:
I think that these would solve your problems, would probably be useful in other contexts as well, and would probably not be that much work. This is the sort of task that might interest @jcrist if he's interested in getting into the distributed scheduler. All logic in both cases would be fully sequentialized through the scheduler, so this shouldn't require much in the way of concurrency logic (other than ensuring that state is always valid between tornado yield points). Copying (or improving upon) the channels implementation would probably be a good start. If @jcrist is busy or not very interested in this topic I can probably get to it starting Tuesday. Hopefully this doesn't block @MLnick from making progress on ML work. Hopefully we'll be able to progress in parallel. thoughts? |
Checking in here. @MLnick are you still able to make progress (with what time you have). |
Hey, meant to post an update but been a bit tied up. I got the basics working. Here is the gist. Current limitations / TODOs:
Having said that, it seems to work pretty well in principal and doing the "sparse pull" version should be straightforward to add. Note the solution found by the SGD version is different from the LBFGS ones. For this small test case it is very close in terms of accuracy metrics but for larger size problems I think some things will need tweaking (e.g. item (1) above, step sizes, iterations and maybe early stopping criteria). Expanding beyond 1 PS will be more involved since the beta data needs to be sharded across the PS nodes. |
Some feeback on the code:
|
How is this typically done? Do the workers always send updates to both parameter servers? |
It really depends on architecture. Glint for example is done with Akka actors. It has a "masterless" architecture. There is a single "actor reference" representing the (client connection to) the array on the PS. It exposes a One could also have a "master" PS node (or coordinator node) that handles the sharding and splitting up and re-routing requests to the relevant PS. It could get a bit involved depending on impl details. |
Another thought is perhaps there's a way to have the PS beta be a dask
array (distributed across say 2 PS for example). Then the "reference" to
that array is passed to workers. When they compute() on that they should
automatically pull the data from where it is living on PS nodes, right?
What I'm not sure on is the update part - the Dask array on the PS must be
updated and perhaps persisted (to force computation and the latest "view")?
That's where I'm hazy. But it seems to me that the handling of the sharding
/ distributing of the params on the PS could be taken care of more
automatically by dask arrays.
…On Wed, 31 May 2017 at 13:33, Matthew Rocklin ***@***.***> wrote:
Expanding beyond 1 PS will be more involved since the beta data needs to
be sharded across the PS nodes.
How is this typically done? Do the workers always send updates to both
parameter servers?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#57 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA_SB3BD2pS8fhGWv5Sajepk_zL7cs_mks5r_U__gaJpZM4NmHsH>
.
|
Yeah, my brain went that way as well. First, the simple way to do this is just to have a channel for each parameter server and to manage the slicing ourselves manually. This isn't automatic but also isn't that hard. However, the broader question you bring up is a dask collection (bag, array, dataframe, delayed) that is backed by a changing set of futures. In principle this is the same as a channel, except that rather than pushing data into a deque we would push data into a random access data structure. I might play with this a bit and see how far I get in a short time. I hope that this doesn't block you on near-term progress though. I suspect that we can go decently far on single-parameter server systems. Do you have a sense for what your current performance bottlenecks are? |
I played with and modified your script a bit here: https://gist.github.com/MLnick/27d71e2a809a54d82381428527e4f494 This starts multiple worker-tasks concurrently. It also evaluates the change in beta over time. |
Thanks, will take a look. When I get some more time I will try to do the sparse param updates and then test things out on some larger (sparse) data (e.g. Criteo). |
Whoops, it looks like my last comment copied over MLnick's implementation rather than my altered one. Regardless, here is a new one using Queues and Variables. https://gist.github.com/mrocklin/a92785743744b5c698984e16b7065037 Things look decent. Although at the moment the parameter server isn't able to keep up with the workers. We may want to either batch many updates at once or switch to asynchronous work to overlap communication latencies. |
This currently depends on dask/distributed#1133 FWIW to me this approach feels much nicer than relying on channels as before. Some flaws: As mentioned the parameter server can't keep up. I suspect that we want a @MLnick if you're looking for a narrative for a blogpost, talk, etc.. then we might consider the progression of sequential computation on the parameter server, to batched, to asynchronous/batched. I'm personally curious to see the performance implications of these general choices to this problem. We have (or can easily construct) APIs for all of these fairly easily. |
@MLnick any objection to my including this as an example in a small blogpost? |
I'm happy to give you attribution for the work. |
Sounds great - I won't be able to help with it this week but otherwise
happy to work on something or help review
…On Mon, 12 Jun 2017 at 17:15, Matthew Rocklin ***@***.***> wrote:
I'm happy to give you attribution for the work.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#57 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA_SB8f-JpDXoJlyOyHDm3VZ1sx5PnF1ks5sDVXzgaJpZM4NmHsH>
.
|
I'd be interested in seeing this comparison, especially as number of workers/communication channels increase. I've looked through the PS in #57 (comment). As far as I can tell, the main benefit behind this PS is with async communication: the above distributes communication to many worker-PS channels (which means it's not limited by bandwidth of one worker-PS channel). Correct?
Can you expand what you mean by this? I am interested in async updates that do not have locks on |
This is a summary of an e-mail between myself and @MLnick
From Nick
Taking the simplest version of say logistic regression, the basic idea is to split up the parameter vector (beta) itself into chunks (so it could be a dask array potentially, or some other distributed dask datastructure). Training would then iterate over "mini-batches" using SGD (let's say each mini batch is a chunk of the X array). In each mini batch, the worker will "pull" the latest version of "beta" from the Parameter Server and compute the (local) gradient for the batch. The worker then sends this gradient to the PS, which then performs the update (i.e. update its part of "beta" using say the gradient update from the worker and the step size). The next iteration then proceeds in the same way. This can be sync or async (but typically is either fully async or "bounded stale" async).
The key is to do this effectively as direct communication from the worker doing the mini batch gradient computation, to the worker holding the parameters (the "parameter server"), without involving the master ("client" app) at all, and to only "pull" and "push" the part of beta required for local computation (due to sparsity this doesn't need to be the full beta in many cases). In situations where the data is very sparse (e.g. like the Criteo data) the communication is substantially reduced in this approach. And the model size can be scaled up significantly (e.g. for FMs the model size can be very large).
This is slightly different from the way say L-BFGS works currently (and the way I seem to understand ADMM works in dask-glm) - i.e. that more or less a set of local computations are performed on the distributed data on the workers, and the results collected back to the "master", where an update step is performed (using LBFGS or the averaging of ADMM, respectively). This is also the way Spark does things.
What I'm struggling with is quite how to achieve the PS approach in dask. It seems possible to do it in a few different ways, e.g. perhaps it's possible just using simple distributed dask arrays, or perhaps using "worker_client" and/or Channels. The issue I have is how to let each worker "pull" the latest view of "beta" in each iteration, and how to have each worker "push" its local gradient out to update the "beta" view, without the "master" being involved.
I'm looking into the async work in http://matthewrocklin.com/blog/work/2017/04/19/dask-glm-2 also to see if I can do something similar here.
From me
First, there are two nodes that you might consider the "master", the scheduler and the client. This is somewhat of a deviation from Spark, where they are both in the same spot.
Second, what are your communication and computation requirements? A roundtrip from the client to scheduler to worker to scheduler to client takes around 10ms on a decent network. A worker-worker communication would be shorter, definitely, but may also involve more technology. We can do worker-to-worker direct, but I wanted to make sure that this was necessary.
Channels currently coordinate metadata through the scheduler. They work a bit like this:
So there are a few network hops here, although each should be in the millisecond range (I think?).
We could also set up a proper parameter server structure with single hop communicatinos. Building these things isn't hard. As usual my goal is to extract from this experiment something slightly more general to see if we can hit a broader use case.
So I guess my questions become:
From Nick
The PS idea is very simple at the high level. The "parameter server" can be thought of as a "distributed key-value store". There could be 1 or more PS nodes (the idea is precisely to allow scaling the size of model parameters across multiple nodes, such as in the case of factorization machines, neural networks etc).
A good reference paper is https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf
So in theory, at the start of an iteration, a worker node asks the PS for only the parameters it needs to compute its update (in sparse data situations, this might only be a few % of the overall # features, per "partition" or "batch"). This can be thought of as a set of (key, value) pairs where the keys are vector indices and the values are vector values at the corresponding index, of the parameter vector. In practice, each PS node will hold a "slice" of the parameter vector (the paper uses a chord key layout for example), and will work with vectors rather than raw key-value pairs, for greater efficiency.
It seems like Channels might be a decent way to go about this. Yes, there is some network comm overhead but in practice for a large scale problem, the time to actually send the data (parameters and gradients say) would dominate the few ms of network hops. This cost could also be partly hidden through async operations.
The way I thought about it with Channels, which you touch on is:
To answer your specific questions:
From me
So here is some code just to get things started off:
For what it's worth I expect this code to fail in some way. I think that channels will probably have to be slightly modified somehow. For example currently we're going to record all of the updates that have been sent to the updates channel. We need to have some way of stating that a reference is no longer needed. Channels need some mechanism to consume and destroy references to futures safely.
The text was updated successfully, but these errors were encountered: