Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add csp.md #7706

Merged
merged 2 commits into from
Jan 24, 2018
Merged

Add csp.md #7706

merged 2 commits into from
Jan 24, 2018

Conversation

wangkuiyi
Copy link
Collaborator

@wangkuiyi wangkuiyi commented Jan 20, 2018

Fixes #7771

Copy link

@kavyasrinet kavyasrinet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the first version of the design doc. I had a few questions that I have posted in the review comments, along with a few suggested fixes.

@@ -0,0 +1,96 @@
# Design Doc: CSP in PaddlePaddle Fluid

## Motivations

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Motivations => Motivation


## Motivations

Concurrent programming is important for deep learning. Example applications include

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example applications include => Few example applications are :


Concurrent programming is important for deep learning. Example applications include

1. A thread uses the GPU for computing while the main thread keeps loading the next minibatch, and

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe re-write to:

  1. The main thread keeps reading the next mini-batch while another thread uses the GPU for computing.
  2. The main thread performs the computation while another thread uploads the local gradients from each trainer to the parameter server.

1. A thread uses the GPU for computing while the main thread keeps loading the next minibatch, and
1. a thread uploads the local gradients to the parameter server while the main thread keeps computing.

Most DL systems, including TensorFlow, Caffe2, and MxNet, can asynchronously execute operators in a graph. However, Fluid doesn't have the concept graph at all, as the design goal of Fluid is a programming language.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

concept => concept of a
is a => is that of a

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Memorized the expressions!

| message passing | MPI |
| bulk synchronous parallel (BSP) | Pregel distributed programming framework |

Because Fluid was designed to be a programming language, we would like to implement CSP.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because => Since
implement CSP => implement CSP in Fluid.


The type *channel* is conceptually the blocking queue. In Go, its implemented is a [blocking circular queue](https://github.com/golang/go/blob/68ce117cf17b8debf5754bfd476345779b5b6616/src/runtime/chan.go#L31-L50), which supports send and recv. The challenge lies more in select.

The operation select has been in OS kernels long before Go language. All Unix kernels implement system calls *poll* and *select*. They work by inquiry all file descriptors under their monitoring. This takes O(N) time. Since Linux 2.6, a new system call, *epoll*, can do O(1). In BSD systems, there is a similar system call *kqueue*. Go's Linux implementation uses epoll.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

operation select => select operation
They work by inquiry all file descriptors under their monitoring. => They monitor multiple file descriptors to see if I/O is possible on any of them.
can do O(1). => can do the same in O(1) time.


The operation select has been in OS kernels long before Go language. All Unix kernels implement system calls *poll* and *select*. They work by inquiry all file descriptors under their monitoring. This takes O(N) time. Since Linux 2.6, a new system call, *epoll*, can do O(1). In BSD systems, there is a similar system call *kqueue*. Go's Linux implementation uses epoll.

It might be a great idea to implement Fluid's select using epoll too. In this design doc, we start from the O(N) way, so could we focus on Python binding and the syntax.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great idea => good idea

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we => we could


Fluid supports many data types:

1. Tensor,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The numbering is all 1s.


### Select

## Exmaple Programs

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exmaple => Example


Fluid has two fundamental control-flows: *if-else* and *while*. If we are to implement CSP, we need:

1. a new data type: *channel*,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also need something similar to the concept of a Go-routine ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are very right! I added goroutine.

Copy link
Contributor

@sidgoyal78 sidgoyal78 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wangkuiyi for the initial version of the design doc. After reading this doc, I seem to get the answer to my question about the workload and the use-cases that we wish to address using concurrent programming in fluid. Thanks 👍


The operation select has been in OS kernels long before Go language. All Unix kernels implement system calls *poll* and *select*. They work by inquiry all file descriptors under their monitoring. This takes O(N) time. Since Linux 2.6, a new system call, *epoll*, can do O(1). In BSD systems, there is a similar system call *kqueue*. Go's Linux implementation uses epoll.

It might be a great idea to implement Fluid's select using epoll too. In this design doc, we start from the O(N) way, so could we focus on Python binding and the syntax.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we => we could

@typhoonzero
Copy link
Contributor

After some thinking, I came up with some thoughts that we can port Go directly to PaddlePaddle to make use of CSP like below. But never mind, this may not be the main branch but could be some "experimental" thing.

  • Go side:
    • Program definitions, executors
    • Control operators, can use the channel, select etc.
    • Calculation operators, reference kernel calls
    • Wrapper types (Variable...)
  • C++ side:
    • kernels
    • memory management (CPU and GPU)
    • data types (Variable, Tensor...)

This enables reusing the Go's CSP concurrent programming model so we don't have to implement it again.

This makes C++ acts like the driver of the computing
devices and Go act as the user API.

The current Fluid implementation does not have to change, and the new Go implementation can be under a distinct directory as "experimental" feature. For
Tensorflow they use https://github.com/tensorflow/tensorflow/tree/master/tensorflow/go but it was a
simply "graph builder", but we can do more.

1. A thread uses the GPU for computing while the main thread keeps loading the next minibatch, and
1. a thread uploads the local gradients to the parameter server while the main thread keeps computing.

Most DL systems, including TensorFlow, Caffe2, and MxNet, can asynchronously execute operators in a graph. However, Fluid doesn't have the concept graph at all, as the design goal of Fluid is a programming language.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note: the big majority of the TensorFlow OP is synchronous, the TF executor runs none-dependent OPs in parallel on different threads.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the note!

In Fluid, we should be able to do the same:

```python
ch = fluid.make_chan(dtype=INT)
Copy link
Contributor

@helinwang helinwang Jan 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a very important element type that fluid's channel should support is pair, for example, we want to send pair(image, label) using one channel, rather than using multiple channels.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @helinwang. I also think we can generalize a pair to an n-tuple.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, great point @helinwang

Copy link
Contributor

@abhinavarora abhinavarora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added some suggestions. Would love your feedback on that.

1. A thread uses the GPU for computing while the main thread keeps loading the next minibatch, and
1. a thread uploads the local gradients to the parameter server while the main thread keeps computing.

Most DL systems, including TensorFlow, Caffe2, and MxNet, can asynchronously execute operators in a graph. However, Fluid doesn't have the concept graph at all, as the design goal of Fluid is a programming language.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

concept of graph


### CSP v.s. Actor Model

A well-known implementation of Actor Model is the Erlang programming language. In Actor Model, *processes* could send messages to and receive messages from another process given it ID. We can find the three ingredients, process with ID, send, and recv, in MPI too. Indeed, we can rewrite Erlang programs in Python + MPI with possibly fewer lines of code. Our concern with Actor Model is that it doesn't look reasonable to implement process management in a programming language's runtime library; instead, it seems the OS's responsibility to manage processes and libraries like MPI for send/recv.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should write processes/actor, because in actor model processes are actors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also mention in this doc that a major concern with the Actor model is that, we need to define the concept of a mailbox. Hence every receiver should know its sender, which might be difficult in our paradigm.

In addition to that, we want channels that can hold more complex element types, e.g., Tensors of float16:

```python
ch = fluid.make_chan(dtype=Tensor, etype=float16)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should define a new type class in Python that can represent such a hierarchy? Using etype will not be scalable if the composition is long.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that our VarDesc should be upgraded to describe such composite types. I am not a Python expert, but it might be reasonable to have a Python class hierarchy.

In Fluid, we should be able to do the same:

```python
ch = fluid.make_chan(dtype=INT)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @helinwang. I also think we can generalize a pair to an n-tuple.

@abhinavarora
Copy link
Contributor

@typhoonzero I had the same idea and I discussed it with @wangkuiyi on Friday. We came to an understanding that the prominent reason, we do not want to do that is because we would like to attend CSP implementation instead of wrapping Go's runtime.

@typhoonzero
Copy link
Contributor

@abhinavarora I see, Thank you!

@wangkuiyi
Copy link
Collaborator Author

wangkuiyi commented Jan 23, 2018

@typhoonzero When we were discussing could we wrap up our Go's RecordIO implementation into a C++ library, @reyoung reminded that such a wrapper would need to link the Go's runtime library, which is too heavy for a C++ library. I think here we are facing the similar choice.

Another reason, as @abhinavarora explained, I think we'd anyway implement CSP in C++ by ourselves because we need to grasp the core ideas we present to our users.

@typhoonzero
Copy link
Contributor

@wangkuiyi

When we were discussing could we wrap up our Go's RecordIO implementation into a C++ library, @reyoung reminded that such a wrapper would need to link the Go's runtime library, which is too heavy for a C++ library. I think here we are facing the similar choice.

I'm afraid this happens only when calling Go from C++ code, If we call C++ kernels from Go, I think it's Okay.

Another reason, as @abhinavarora explained, I think we'd anyway implement CSP in C++ by ourselves because we need to grasp the core ideas we present to our users.

Yep, definitely agree with this!

@wangkuiyi
Copy link
Collaborator Author

wangkuiyi commented Jan 23, 2018

@typhoonzero When you say "port Go to C++", do you mean something like reuse the Go's source code? If that is what you mean, I am sorry to tell that the Go source code is a mixture of Go and C and assembly, and the C and assembly part is in Plan 9 syntax and cannot be built using gcc/gas. Also, Go's implementation assumes that the language runtime supports multi-threading, which implies that we need to port Go's runtime into C++ as well. It is too much more work than we could imagine. :-)

@typhoonzero
Copy link
Contributor

@wangkuiyi "When you say "port Go to C++", do you mean something like reuse the Go's source code?" -- No. I mean implement operators and executors in Go, and operators written in Go can call C++ written kernels which can run on any devices. Then people can directly write a Go concurrent program and then compile to a binary which launches a go-runtime embedded with current kernel implementations. Then the control flow is done by Go and calculations is done by kernels, which is easy to implement. But we still need to care about memory allocation, so I put Variable allocations inside C++ side, but all the other memories can be handled by original Go's GC.

@typhoonzero
Copy link
Contributor

typhoonzero commented Jan 23, 2018

I think channel should not be a server which send_op can send messages to. Operators can deal with channels in current program but not on the remote server. But we may put messages in channel before sending them. Communications between nodes should be done by RPC ops like listen_and_serv and recv.

I have some code to describe how to use CSP with send/recv #6508


Sorry I updated the comment to make it more clear.

Copy link
Contributor

@helinwang helinwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I saw @typhoonzero have some comments, but I could not quite understand #7706 (comment) , maybe we can have another separate issue discussing it?

@abhinavarora abhinavarora merged commit 7ccbc70 into PaddlePaddle:develop Jan 24, 2018
@wangkuiyi wangkuiyi deleted the csp branch January 26, 2018 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants