Update cpp reader doc #9079

JiayiFeng · 2018-03-14T13:10:23Z

Updates cpp reader document to our newest design.

… dev_update_reader_doc

abhinavarora

There are certain grammatical and sentence structure changes needed. Please correct them and reach out to me in case of any questions.

abhinavarora · 2018-03-14T18:51:35Z

doc/design/cpp_data_feeding.md

+};
+```
+
+A file reader binds with a single file and reads one instance of data from the file at a time. Each type of file reader shall implement its own `ReadNextImpl()`, `HasNext()` and `ReInit()`.


You can write this as : reads one data instance at a time.

abhinavarora · 2018-03-14T18:52:12Z

doc/design/cpp_data_feeding.md

+
+A file reader binds with a single file and reads one instance of data from the file at a time. Each type of file reader shall implement its own `ReadNextImpl()`, `HasNext()` and `ReInit()`.
+
+The `ReadNextImpl()` is invoked by `ReadNext()`. Besides invoking `ReadNextImpl()`, `ReadNext()` is also in charge of checking the output, making sure that each shape of `LoDTensor` in `*out` is consistent with the one in `dims_`.  


in charge of -> responsible for

abhinavarora · 2018-03-14T18:52:45Z

doc/design/cpp_data_feeding.md

+
+### DecoratedReader
+
+A decorated reader takes another reader(both file reader and decorated reader are OK) as its 'underlying reader'. It gets data from its underlying reader, does some process on them(shuffling,  batching or something else), then yields processed data. The output data of a decorated reader can be a single instance or a batch. `ShuffleReader` and `BatchReader` are both decorated readers.


process -> processing

abhinavarora · 2018-03-14T18:53:14Z

doc/design/cpp_data_feeding.md

 };
 ```

-### `FileReader` and `DecoratedReader`
+All the `FileReader` and `DecoratedReader` share exactly the same interfaces as defined in `ReaderBase`. So they can be decorated for more than one time: We can **shuffle** a reader's outputs and then **batch** the shuffle outputs. The interface consistency also allows related ops use readers without knowing what they are exactly.


All -> Both

interfaces -> interface

for more than one time -> multiple times

batch the shuffle outputs -> batch the shuffled outputs

what they are exactly -> their underlying type

abhinavarora · 2018-03-14T19:02:41Z

doc/design/cpp_data_feeding.md


-These two classes are derived from the `ReaderBase` and will further be derived by more specific readers. Thus, in our design, there are two kinds of readers: file readers and decorated readers. A file reader reads from a file of some specific format, and yield only one instance of data at a time. For example, RecordIO reader, jpg reader, .... A decorated reader takes another reader(both file reader and decorated reader are OK) as its 'underlying reader'. It gets data from its underlying reader, does some processing on them(shuffling, or batching), then yields processed data. The output data of a decorated reader can be a single instance or a batch. `ShuffleReader` and `BatchReader` are both decorated readers.
+So `MultipleReader` is introduced. It is also derived from `ReaderBase`. A `MultipleReader` holds several prefetching `FileReaders` and these readers run concurrently. Another pivotal part of a `MultipleReader` is a buffer channel. The channel collects data yield by all prefetching readers and makes subsequent OPs or decorated readers be able to fetch data without concerning about multiple readers scheduling.


It is great to see that you are using the Buffered Channel to implement this. If you run into any kind of the problem with the Buffered Channel, please let me know.

Thanks. Channel has been used in the implementation of DoubleBufferReader. It works really well.

abhinavarora · 2018-03-14T19:07:39Z

doc/design/cpp_data_feeding.md

+
+## Program with Readers
+
+A `Program` holds readers as its persistable variables. These variables are created by `CreateReaderOp` or `OpenFilesOp`. Obviously, these ops shall run only once. So they shall be settled in the `startup_program`. `HasNextOp`, `ResetOp` and `ReadOp` are required by training loop, so they shall be in the `main_program`.


We should drop the word Obviously. It does not look good in formal documents.

abhinavarora · 2018-03-14T19:08:33Z

doc/design/cpp_data_feeding.md

+}
+```
+
+Two things are worth mentioning when considering these two programs:


This line can be written as:
Two important considerations for these programs are as follows:

helinwang · 2018-03-14T21:49:44Z

doc/design/cpp_data_feeding.md

-  // Reinitialize the reader and read the file from the beginning.
-  virtual void ReInit() = 0;
+  // Checks whether the next instance exists.
+  virtual bool HasNext() = 0;


Can we have ReadNext return EOF error code when there is no next? This way we can make this important interface simpler.

Of course we can return EOF. We do need to do something when users try to invoke ReadNext() while there is no next data. It can be throwing an exception or returning an EOF. Both one is OK.

But I'm afraid the HasNext() is still needed. Sometimes users would like to do something(like saving model) after each pass training. It requires an interface to check whether a pass is completed.

Maybe we can also implement the checking by trying invoking ReadNext() and check the return value. But it seems a little hard.

helinwang · 2018-03-14T21:55:07Z

doc/design/cpp_data_feeding.md

+
+```
+while_op {
+    has_next = has_next_op(double_buffer_reader)


This is a lot of work for the user, I am not convinced that we should expose reset_op, has_next_op to the user. Can we do:

while_op { // -1 means read multi epoch forever, in this case while op should control how many steps. x = read_op(multi_epoch_reader(double_buffer_reader, -1)) ... (subsequent training ops) }

@helinwang You are right! We have discussed this issue with @JiayiFeng and @panyx0718 before. We will have a MultiEpochReader for users just like you described. Users may not need to use reset and has_next operators. We will not implement reset and has_next operators at first.

The concern about reset and has_next operators is that users might need to print some log and save models at the end of an epoch. It might be useful for users to know it is the end of an epoch by has_next_op and manually reset reader by reset_op. However, we will not implement these operators until there are actual cases.

Great idea. The multi_epoch_reader is also in our design. It is definitely a convenient way for users to config their readers.

But reset_op and has_next_op still need to be provided. For some users would like to check whether a pass is completed by themselves and then do some customized operations. Just like my reply to your previous comment.

Thanks for the reply @reyoung @JiayiFeng !

But reset_op and has_next_op still need to be provided. For some users would like to check whether a pass is completed by themselves and then do some customized operations. Just like my reply to your previous comment.

That's a good point, but I think we will probably rely on visual DL for visualization. If the user need to do some customize operation, he need to write an OP to do it, which adds the complexity. I my opinion has_next_op and if op is a lot of code for the user to learn, maybe we should not support it?

EDIT: we discussed offline, in general we don't want the user to use has_next_op and reset_op, but they will be low level APIs for the user to use if he want detailed control. Another thing is if has_next_op is replaced by ReadNext returning EOF error code, the user can't easily access the error code.

helinwang · 2018-03-14T22:03:24Z

doc/design/cpp_data_feeding.md

+
+A file reader binds with a single file and reads one instance of data from the file at a time. Each type of file reader shall implement its own `ReadNextImpl()`, `HasNext()` and `ReInit()`.
+
+The `ReadNextImpl()` is invoked by `ReadNext()`. Besides invoking `ReadNextImpl()`, `ReadNext()` is also in charge of checking the output, making sure that each shape of `LoDTensor` in `*out` is consistent with the one in `dims_`.  


Why ReadNext should check the shape of the output is correct, shouldn't every OP check if it's input has the correct shape? Otherwise if OP_B's input is the output of OP_A (any OP that is not reader OP), how can OP_B know it's input is correct?

If we don't check shapes in readers, the shape of real tensors at runtime can be different with the setting shape at compile time. We shall not allow this happen.

helinwang · 2018-03-14T22:17:32Z

doc/design/cpp_data_feeding.md

-### `ReaderHolder`
+To the subsequent two decorated readers, the `MultipleReader` is **a single reader**. They don't need to concern about how prefetch readers are scheduled. They only need to invoke `MultipleReader::ReadNext()` to get the next data from the buffer channel. 
+
+### ReaderHolder

 Different readers belong to different class types. This leads to a problem: How can we drop them into `Variable`s and fetch them out by a unified method? For example, if a Variable holds a `BatchReader`, we can not get it by the following code:


The following code seems perfectly reasonable (get a implementation of an interface, and then invoke some function of that interface), curious why we can't do this, is it a C++ specific reason?

Do you mean why we can't write:

var->Get<ReaderBase>("batch_reader");

?

That is because our Variable doesn't support convert an object to its parent type. If we drop a BatchReader into a Variable, we can't fetch it by Get<ReaderBase>() for they are different types and the Variable doesn't know that they have an inheritance relationship.

Yes, sorry I forgot to paste the code.

Thanks, I thought the compiler would know the inheritance, probably I missed something, will look more detail into the code.

helinwang · 2018-03-14T22:20:15Z

doc/design/cpp_data_feeding.md

@@ -69,10 +113,59 @@ To solve this problem, we introduce `ReaderHolder` as a wrapper. It acts as an e

 To create and invoke readers, some new ops are introduced:

-### `CreateReaderOp`
+### CreateReaderOp


Why do we need CreateReaderOp if there are creator operator for each kind of reader?

CreateReaderOp is just a general name for all reader creator operators. There is no such CreateReaderOp.

Maybe rename it to "### Operators That Creates Readers"? Otherwise the reader could think CreateReaderOp is actually an OP.

JiayiFeng · 2018-03-15T02:24:21Z

Thank you so much @abhinavarora ! Your comments are quite helpful!

… dev_update_reader_doc

abhinavarora

LGTM!

JiayiFeng added 5 commits March 14, 2018 10:31

merge

6519f6c

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

6ea4582

… dev_update_reader_doc

update reader doc

1a9f4e5

replace pdf images with png images

c1234c8

Add introduce of reader registry

852dbca

JiayiFeng requested review from reyoung and dzhwinter March 14, 2018 16:04

abhinavarora suggested changes Mar 14, 2018

View reviewed changes

helinwang reviewed Mar 14, 2018

View reviewed changes

some grammatical and sentence structure changes

1961d6b

panyx0718 self-requested a review March 15, 2018 04:53

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

f617ffb

… dev_update_reader_doc

abhinavarora approved these changes Mar 16, 2018

View reviewed changes

JiayiFeng merged commit 60314ee into PaddlePaddle:develop Mar 16, 2018

JiayiFeng deleted the dev_update_reader_doc branch March 16, 2018 06:22

This was referenced Mar 20, 2018

Multi-pass c++ reader #9260

Closed

Update c++ readers doc #9352

Merged


		A file reader binds with a single file and reads one instance of data from the file at a time. Each type of file reader shall implement its own `ReadNextImpl()`, `HasNext()` and `ReInit()`.

		The `ReadNextImpl()` is invoked by `ReadNext()`. Besides invoking `ReadNextImpl()`, `ReadNext()` is also in charge of checking the output, making sure that each shape of `LoDTensor` in `*out` is consistent with the one in `dims_`.


		### DecoratedReader

		A decorated reader takes another reader(both file reader and decorated reader are OK) as its 'underlying reader'. It gets data from its underlying reader, does some process on them(shuffling, batching or something else), then yields processed data. The output data of a decorated reader can be a single instance or a batch. `ShuffleReader` and `BatchReader` are both decorated readers.


		These two classes are derived from the `ReaderBase` and will further be derived by more specific readers. Thus, in our design, there are two kinds of readers: file readers and decorated readers. A file reader reads from a file of some specific format, and yield only one instance of data at a time. For example, RecordIO reader, jpg reader, .... A decorated reader takes another reader(both file reader and decorated reader are OK) as its 'underlying reader'. It gets data from its underlying reader, does some processing on them(shuffling, or batching), then yields processed data. The output data of a decorated reader can be a single instance or a batch. `ShuffleReader` and `BatchReader` are both decorated readers.
		So `MultipleReader` is introduced. It is also derived from `ReaderBase`. A `MultipleReader` holds several prefetching `FileReaders` and these readers run concurrently. Another pivotal part of a `MultipleReader` is a buffer channel. The channel collects data yield by all prefetching readers and makes subsequent OPs or decorated readers be able to fetch data without concerning about multiple readers scheduling.


		## Program with Readers

		A `Program` holds readers as its persistable variables. These variables are created by `CreateReaderOp` or `OpenFilesOp`. Obviously, these ops shall run only once. So they shall be settled in the `startup_program`. `HasNextOp`, `ResetOp` and `ReadOp` are required by training loop, so they shall be in the `main_program`.

Update cpp reader doc #9079

Update cpp reader doc #9079

Conversation

JiayiFeng commented Mar 14, 2018 • edited Loading

abhinavarora left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JiayiFeng Mar 15, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

reyoung Mar 15, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang Mar 15, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JiayiFeng Mar 15, 2018 • edited Loading

Choose a reason for hiding this comment

helinwang Mar 15, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang Mar 15, 2018 • edited Loading

Choose a reason for hiding this comment

JiayiFeng commented Mar 15, 2018

abhinavarora left a comment

Choose a reason for hiding this comment

JiayiFeng commented Mar 14, 2018 •

edited

Loading

JiayiFeng Mar 15, 2018 •

edited

Loading

reyoung Mar 15, 2018 •

edited

Loading

helinwang Mar 15, 2018 •

edited

Loading

JiayiFeng Mar 15, 2018 •

edited

Loading

helinwang Mar 15, 2018 •

edited

Loading

helinwang Mar 15, 2018 •

edited

Loading