-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add csp.md #7706
Add csp.md #7706
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the first version of the design doc. I had a few questions that I have posted in the review comments, along with a few suggested fixes.
doc/design/csp.md
Outdated
@@ -0,0 +1,96 @@ | |||
# Design Doc: CSP in PaddlePaddle Fluid | |||
|
|||
## Motivations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Motivations => Motivation
doc/design/csp.md
Outdated
|
||
## Motivations | ||
|
||
Concurrent programming is important for deep learning. Example applications include |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example applications include => Few example applications are :
doc/design/csp.md
Outdated
|
||
Concurrent programming is important for deep learning. Example applications include | ||
|
||
1. A thread uses the GPU for computing while the main thread keeps loading the next minibatch, and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe re-write to:
- The main thread keeps reading the next mini-batch while another thread uses the GPU for computing.
- The main thread performs the computation while another thread uploads the local gradients from each trainer to the parameter server.
doc/design/csp.md
Outdated
1. A thread uses the GPU for computing while the main thread keeps loading the next minibatch, and | ||
1. a thread uploads the local gradients to the parameter server while the main thread keeps computing. | ||
|
||
Most DL systems, including TensorFlow, Caffe2, and MxNet, can asynchronously execute operators in a graph. However, Fluid doesn't have the concept graph at all, as the design goal of Fluid is a programming language. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
concept => concept of a
is a => is that of a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Memorized the expressions!
doc/design/csp.md
Outdated
| message passing | MPI | | ||
| bulk synchronous parallel (BSP) | Pregel distributed programming framework | | ||
|
||
Because Fluid was designed to be a programming language, we would like to implement CSP. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because => Since
implement CSP => implement CSP in Fluid.
doc/design/csp.md
Outdated
|
||
The type *channel* is conceptually the blocking queue. In Go, its implemented is a [blocking circular queue](https://github.com/golang/go/blob/68ce117cf17b8debf5754bfd476345779b5b6616/src/runtime/chan.go#L31-L50), which supports send and recv. The challenge lies more in select. | ||
|
||
The operation select has been in OS kernels long before Go language. All Unix kernels implement system calls *poll* and *select*. They work by inquiry all file descriptors under their monitoring. This takes O(N) time. Since Linux 2.6, a new system call, *epoll*, can do O(1). In BSD systems, there is a similar system call *kqueue*. Go's Linux implementation uses epoll. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
operation select => select
operation
They work by inquiry all file descriptors under their monitoring. => They monitor multiple file descriptors to see if I/O is possible on any of them.
can do O(1). => can do the same in O(1) time.
doc/design/csp.md
Outdated
|
||
The operation select has been in OS kernels long before Go language. All Unix kernels implement system calls *poll* and *select*. They work by inquiry all file descriptors under their monitoring. This takes O(N) time. Since Linux 2.6, a new system call, *epoll*, can do O(1). In BSD systems, there is a similar system call *kqueue*. Go's Linux implementation uses epoll. | ||
|
||
It might be a great idea to implement Fluid's select using epoll too. In this design doc, we start from the O(N) way, so could we focus on Python binding and the syntax. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great idea => good idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we => we could
|
||
Fluid supports many data types: | ||
|
||
1. Tensor, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The numbering is all 1s.
doc/design/csp.md
Outdated
|
||
### Select | ||
|
||
## Exmaple Programs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exmaple => Example
doc/design/csp.md
Outdated
|
||
Fluid has two fundamental control-flows: *if-else* and *while*. If we are to implement CSP, we need: | ||
|
||
1. a new data type: *channel*, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also need something similar to the concept of a Go-routine ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are very right! I added goroutine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @wangkuiyi for the initial version of the design doc. After reading this doc, I seem to get the answer to my question about the workload and the use-cases that we wish to address using concurrent programming in fluid. Thanks 👍
doc/design/csp.md
Outdated
|
||
The operation select has been in OS kernels long before Go language. All Unix kernels implement system calls *poll* and *select*. They work by inquiry all file descriptors under their monitoring. This takes O(N) time. Since Linux 2.6, a new system call, *epoll*, can do O(1). In BSD systems, there is a similar system call *kqueue*. Go's Linux implementation uses epoll. | ||
|
||
It might be a great idea to implement Fluid's select using epoll too. In this design doc, we start from the O(N) way, so could we focus on Python binding and the syntax. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we => we could
After some thinking, I came up with some thoughts that we can port Go directly to PaddlePaddle to make use of CSP like below. But never mind, this may not be the main branch but could be some "experimental" thing.
This enables reusing the Go's CSP concurrent programming model so we don't have to implement it again. This makes C++ acts like the driver of the computing The current Fluid implementation does not have to change, and the new Go implementation can be under a distinct directory as "experimental" feature. For |
doc/design/csp.md
Outdated
1. A thread uses the GPU for computing while the main thread keeps loading the next minibatch, and | ||
1. a thread uploads the local gradients to the parameter server while the main thread keeps computing. | ||
|
||
Most DL systems, including TensorFlow, Caffe2, and MxNet, can asynchronously execute operators in a graph. However, Fluid doesn't have the concept graph at all, as the design goal of Fluid is a programming language. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note: the big majority of the TensorFlow OP is synchronous, the TF executor runs none-dependent OPs in parallel on different threads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the note!
In Fluid, we should be able to do the same: | ||
|
||
```python | ||
ch = fluid.make_chan(dtype=INT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a very important element type that fluid's channel should support is pair
, for example, we want to send pair(image, label)
using one channel, rather than using multiple channels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @helinwang. I also think we can generalize a pair to an n-tuple
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, great point @helinwang
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added some suggestions. Would love your feedback on that.
doc/design/csp.md
Outdated
1. A thread uses the GPU for computing while the main thread keeps loading the next minibatch, and | ||
1. a thread uploads the local gradients to the parameter server while the main thread keeps computing. | ||
|
||
Most DL systems, including TensorFlow, Caffe2, and MxNet, can asynchronously execute operators in a graph. However, Fluid doesn't have the concept graph at all, as the design goal of Fluid is a programming language. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
concept of graph
doc/design/csp.md
Outdated
|
||
### CSP v.s. Actor Model | ||
|
||
A well-known implementation of Actor Model is the Erlang programming language. In Actor Model, *processes* could send messages to and receive messages from another process given it ID. We can find the three ingredients, process with ID, send, and recv, in MPI too. Indeed, we can rewrite Erlang programs in Python + MPI with possibly fewer lines of code. Our concern with Actor Model is that it doesn't look reasonable to implement process management in a programming language's runtime library; instead, it seems the OS's responsibility to manage processes and libraries like MPI for send/recv. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should write processes/actor, because in actor model processes are actors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also mention in this doc that a major concern with the Actor model is that, we need to define the concept of a mailbox. Hence every receiver should know its sender, which might be difficult in our paradigm.
In addition to that, we want channels that can hold more complex element types, e.g., Tensors of float16: | ||
|
||
```python | ||
ch = fluid.make_chan(dtype=Tensor, etype=float16) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we should define a new type class in Python that can represent such a hierarchy? Using etype
will not be scalable if the composition is long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that our VarDesc should be upgraded to describe such composite types. I am not a Python expert, but it might be reasonable to have a Python class hierarchy.
In Fluid, we should be able to do the same: | ||
|
||
```python | ||
ch = fluid.make_chan(dtype=INT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @helinwang. I also think we can generalize a pair to an n-tuple
.
@typhoonzero I had the same idea and I discussed it with @wangkuiyi on Friday. We came to an understanding that the prominent reason, we do not want to do that is because we would like to attend CSP implementation instead of wrapping Go's runtime. |
@abhinavarora I see, Thank you! |
@typhoonzero When we were discussing could we wrap up our Go's RecordIO implementation into a C++ library, @reyoung reminded that such a wrapper would need to link the Go's runtime library, which is too heavy for a C++ library. I think here we are facing the similar choice. Another reason, as @abhinavarora explained, I think we'd anyway implement CSP in C++ by ourselves because we need to grasp the core ideas we present to our users. |
I'm afraid this happens only when calling Go from C++ code, If we call C++ kernels from Go, I think it's Okay.
Yep, definitely agree with this! |
@typhoonzero When you say "port Go to C++", do you mean something like reuse the Go's source code? If that is what you mean, I am sorry to tell that the Go source code is a mixture of Go and C and assembly, and the C and assembly part is in Plan 9 syntax and cannot be built using gcc/gas. Also, Go's implementation assumes that the language runtime supports multi-threading, which implies that we need to port Go's runtime into C++ as well. It is too much more work than we could imagine. :-) |
@wangkuiyi "When you say "port Go to C++", do you mean something like reuse the Go's source code?" -- No. I mean implement operators and executors in Go, and operators written in Go can call C++ written kernels which can run on any devices. Then people can directly write a Go concurrent program and then compile to a binary which launches a go-runtime embedded with current kernel implementations. Then the control flow is done by Go and calculations is done by kernels, which is easy to implement. But we still need to care about memory allocation, so I put |
I think channel should not be a server which I have some code to describe how to use CSP with Sorry I updated the comment to make it more clear. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I saw @typhoonzero have some comments, but I could not quite understand #7706 (comment) , maybe we can have another separate issue discussing it?
Fixes #7771