Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update cpp reader doc #9079

Merged
merged 7 commits into from
Mar 16, 2018
Merged

Conversation

JiayiFeng
Copy link
Collaborator

@JiayiFeng JiayiFeng commented Mar 14, 2018

Updates cpp reader document to our newest design.

Copy link
Contributor

@abhinavarora abhinavarora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are certain grammatical and sentence structure changes needed. Please correct them and reach out to me in case of any questions.

};
```

A file reader binds with a single file and reads one instance of data from the file at a time. Each type of file reader shall implement its own `ReadNextImpl()`, `HasNext()` and `ReInit()`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can write this as : reads one data instance at a time.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


A file reader binds with a single file and reads one instance of data from the file at a time. Each type of file reader shall implement its own `ReadNextImpl()`, `HasNext()` and `ReInit()`.

The `ReadNextImpl()` is invoked by `ReadNext()`. Besides invoking `ReadNextImpl()`, `ReadNext()` is also in charge of checking the output, making sure that each shape of `LoDTensor` in `*out` is consistent with the one in `dims_`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in charge of -> responsible for

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


### DecoratedReader

A decorated reader takes another reader(both file reader and decorated reader are OK) as its 'underlying reader'. It gets data from its underlying reader, does some process on them(shuffling, batching or something else), then yields processed data. The output data of a decorated reader can be a single instance or a batch. `ShuffleReader` and `BatchReader` are both decorated readers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

process -> processing

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

};
```

### `FileReader` and `DecoratedReader`
All the `FileReader` and `DecoratedReader` share exactly the same interfaces as defined in `ReaderBase`. So they can be decorated for more than one time: We can **shuffle** a reader's outputs and then **batch** the shuffle outputs. The interface consistency also allows related ops use readers without knowing what they are exactly.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All -> Both

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interfaces -> interface

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for more than one time -> multiple times

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

batch the shuffle outputs -> batch the shuffled outputs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what they are exactly -> their underlying type

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


These two classes are derived from the `ReaderBase` and will further be derived by more specific readers. Thus, in our design, there are two kinds of readers: file readers and decorated readers. A file reader reads from a file of some specific format, and yield only one instance of data at a time. For example, RecordIO reader, jpg reader, .... A decorated reader takes another reader(both file reader and decorated reader are OK) as its 'underlying reader'. It gets data from its underlying reader, does some processing on them(shuffling, or batching), then yields processed data. The output data of a decorated reader can be a single instance or a batch. `ShuffleReader` and `BatchReader` are both decorated readers.
So `MultipleReader` is introduced. It is also derived from `ReaderBase`. A `MultipleReader` holds several prefetching `FileReaders` and these readers run concurrently. Another pivotal part of a `MultipleReader` is a buffer channel. The channel collects data yield by all prefetching readers and makes subsequent OPs or decorated readers be able to fetch data without concerning about multiple readers scheduling.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is great to see that you are using the Buffered Channel to implement this. If you run into any kind of the problem with the Buffered Channel, please let me know.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Channel has been used in the implementation of DoubleBufferReader. It works really well.


## Program with Readers

A `Program` holds readers as its persistable variables. These variables are created by `CreateReaderOp` or `OpenFilesOp`. Obviously, these ops shall run only once. So they shall be settled in the `startup_program`. `HasNextOp`, `ResetOp` and `ReadOp` are required by training loop, so they shall be in the `main_program`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should drop the word Obviously. It does not look good in formal documents.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}
```

Two things are worth mentioning when considering these two programs:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line can be written as:
Two important considerations for these programs are as follows:

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// Reinitialize the reader and read the file from the beginning.
virtual void ReInit() = 0;
// Checks whether the next instance exists.
virtual bool HasNext() = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have ReadNext return EOF error code when there is no next? This way we can make this important interface simpler.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent!

Copy link
Collaborator Author

@JiayiFeng JiayiFeng Mar 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course we can return EOF. We do need to do something when users try to invoke ReadNext() while there is no next data. It can be throwing an exception or returning an EOF. Both one is OK.

But I'm afraid the HasNext() is still needed. Sometimes users would like to do something(like saving model) after each pass training. It requires an interface to check whether a pass is completed.

Maybe we can also implement the checking by trying invoking ReadNext() and check the return value. But it seems a little hard.


```
while_op {
has_next = has_next_op(double_buffer_reader)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot of work for the user, I am not convinced that we should expose reset_op, has_next_op to the user. Can we do:

while_op {
    // -1 means read multi epoch forever, in this case while op should control how many steps.
    x = read_op(multi_epoch_reader(double_buffer_reader, -1))
    ... (subsequent training ops)
}

Copy link
Collaborator

@reyoung reyoung Mar 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@helinwang You are right! We have discussed this issue with @JiayiFeng and @panyx0718 before. We will have a MultiEpochReader for users just like you described. Users may not need to use reset and has_next operators. We will not implement reset and has_next operators at first.

The concern about reset and has_next operators is that users might need to print some log and save models at the end of an epoch. It might be useful for users to know it is the end of an epoch by has_next_op and manually reset reader by reset_op. However, we will not implement these operators until there are actual cases.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea. The multi_epoch_reader is also in our design. It is definitely a convenient way for users to config their readers.

But reset_op and has_next_op still need to be provided. For some users would like to check whether a pass is completed by themselves and then do some customized operations. Just like my reply to your previous comment.

Copy link
Contributor

@helinwang helinwang Mar 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reply @reyoung @JiayiFeng !

But reset_op and has_next_op still need to be provided. For some users would like to check whether a pass is completed by themselves and then do some customized operations. Just like my reply to your previous comment.

That's a good point, but I think we will probably rely on visual DL for visualization. If the user need to do some customize operation, he need to write an OP to do it, which adds the complexity. I my opinion has_next_op and if op is a lot of code for the user to learn, maybe we should not support it?

EDIT: we discussed offline, in general we don't want the user to use has_next_op and reset_op, but they will be low level APIs for the user to use if he want detailed control. Another thing is if has_next_op is replaced by ReadNext returning EOF error code, the user can't easily access the error code.


A file reader binds with a single file and reads one instance of data from the file at a time. Each type of file reader shall implement its own `ReadNextImpl()`, `HasNext()` and `ReInit()`.

The `ReadNextImpl()` is invoked by `ReadNext()`. Besides invoking `ReadNextImpl()`, `ReadNext()` is also in charge of checking the output, making sure that each shape of `LoDTensor` in `*out` is consistent with the one in `dims_`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ReadNext should check the shape of the output is correct, shouldn't every OP check if it's input has the correct shape? Otherwise if OP_B's input is the output of OP_A (any OP that is not reader OP), how can OP_B know it's input is correct?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't check shapes in readers, the shape of real tensors at runtime can be different with the setting shape at compile time. We shall not allow this happen.

### `ReaderHolder`
To the subsequent two decorated readers, the `MultipleReader` is **a single reader**. They don't need to concern about how prefetch readers are scheduled. They only need to invoke `MultipleReader::ReadNext()` to get the next data from the buffer channel.

### ReaderHolder

Different readers belong to different class types. This leads to a problem: How can we drop them into `Variable`s and fetch them out by a unified method? For example, if a Variable holds a `BatchReader`, we can not get it by the following code:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following code seems perfectly reasonable (get a implementation of an interface, and then invoke some function of that interface), curious why we can't do this, is it a C++ specific reason?

Copy link
Collaborator Author

@JiayiFeng JiayiFeng Mar 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean why we can't write:

var->Get<ReaderBase>("batch_reader");

?

That is because our Variable doesn't support convert an object to its parent type. If we drop a BatchReader into a Variable, we can't fetch it by Get<ReaderBase>() for they are different types and the Variable doesn't know that they have an inheritance relationship.

Copy link
Contributor

@helinwang helinwang Mar 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry I forgot to paste the code.

Thanks, I thought the compiler would know the inheritance, probably I missed something, will look more detail into the code.

@@ -69,10 +113,59 @@ To solve this problem, we introduce `ReaderHolder` as a wrapper. It acts as an e

To create and invoke readers, some new ops are introduced:

### `CreateReaderOp`
### CreateReaderOp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need CreateReaderOp if there are creator operator for each kind of reader?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CreateReaderOp is just a general name for all reader creator operators. There is no such CreateReaderOp.

Copy link
Contributor

@helinwang helinwang Mar 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename it to "### Operators That Creates Readers"? Otherwise the reader could think CreateReaderOp is actually an OP.

@JiayiFeng
Copy link
Collaborator Author

Thank you so much @abhinavarora ! Your comments are quite helpful!

@panyx0718 panyx0718 self-requested a review March 15, 2018 04:53
Copy link
Contributor

@abhinavarora abhinavarora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@JiayiFeng JiayiFeng merged commit 60314ee into PaddlePaddle:develop Mar 16, 2018
@JiayiFeng JiayiFeng deleted the dev_update_reader_doc branch March 16, 2018 06:22
This was referenced Mar 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants