-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update cpp reader doc #9079
Update cpp reader doc #9079
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are certain grammatical and sentence structure changes needed. Please correct them and reach out to me in case of any questions.
doc/design/cpp_data_feeding.md
Outdated
}; | ||
``` | ||
|
||
A file reader binds with a single file and reads one instance of data from the file at a time. Each type of file reader shall implement its own `ReadNextImpl()`, `HasNext()` and `ReInit()`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can write this as : reads one data instance at a time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
doc/design/cpp_data_feeding.md
Outdated
|
||
A file reader binds with a single file and reads one instance of data from the file at a time. Each type of file reader shall implement its own `ReadNextImpl()`, `HasNext()` and `ReInit()`. | ||
|
||
The `ReadNextImpl()` is invoked by `ReadNext()`. Besides invoking `ReadNextImpl()`, `ReadNext()` is also in charge of checking the output, making sure that each shape of `LoDTensor` in `*out` is consistent with the one in `dims_`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in charge of -> responsible for
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
doc/design/cpp_data_feeding.md
Outdated
|
||
### DecoratedReader | ||
|
||
A decorated reader takes another reader(both file reader and decorated reader are OK) as its 'underlying reader'. It gets data from its underlying reader, does some process on them(shuffling, batching or something else), then yields processed data. The output data of a decorated reader can be a single instance or a batch. `ShuffleReader` and `BatchReader` are both decorated readers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
process -> processing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
doc/design/cpp_data_feeding.md
Outdated
}; | ||
``` | ||
|
||
### `FileReader` and `DecoratedReader` | ||
All the `FileReader` and `DecoratedReader` share exactly the same interfaces as defined in `ReaderBase`. So they can be decorated for more than one time: We can **shuffle** a reader's outputs and then **batch** the shuffle outputs. The interface consistency also allows related ops use readers without knowing what they are exactly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All -> Both
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interfaces -> interface
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for more than one time -> multiple times
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
batch the shuffle outputs -> batch the shuffled outputs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what they are exactly -> their underlying type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
doc/design/cpp_data_feeding.md
Outdated
|
||
These two classes are derived from the `ReaderBase` and will further be derived by more specific readers. Thus, in our design, there are two kinds of readers: file readers and decorated readers. A file reader reads from a file of some specific format, and yield only one instance of data at a time. For example, RecordIO reader, jpg reader, .... A decorated reader takes another reader(both file reader and decorated reader are OK) as its 'underlying reader'. It gets data from its underlying reader, does some processing on them(shuffling, or batching), then yields processed data. The output data of a decorated reader can be a single instance or a batch. `ShuffleReader` and `BatchReader` are both decorated readers. | ||
So `MultipleReader` is introduced. It is also derived from `ReaderBase`. A `MultipleReader` holds several prefetching `FileReaders` and these readers run concurrently. Another pivotal part of a `MultipleReader` is a buffer channel. The channel collects data yield by all prefetching readers and makes subsequent OPs or decorated readers be able to fetch data without concerning about multiple readers scheduling. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is great to see that you are using the Buffered Channel to implement this. If you run into any kind of the problem with the Buffered Channel, please let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Channel
has been used in the implementation of DoubleBufferReader
. It works really well.
doc/design/cpp_data_feeding.md
Outdated
|
||
## Program with Readers | ||
|
||
A `Program` holds readers as its persistable variables. These variables are created by `CreateReaderOp` or `OpenFilesOp`. Obviously, these ops shall run only once. So they shall be settled in the `startup_program`. `HasNextOp`, `ResetOp` and `ReadOp` are required by training loop, so they shall be in the `main_program`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should drop the word Obviously. It does not look good in formal documents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
doc/design/cpp_data_feeding.md
Outdated
} | ||
``` | ||
|
||
Two things are worth mentioning when considering these two programs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line can be written as:
Two important considerations for these programs are as follows:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
doc/design/cpp_data_feeding.md
Outdated
// Reinitialize the reader and read the file from the beginning. | ||
virtual void ReInit() = 0; | ||
// Checks whether the next instance exists. | ||
virtual bool HasNext() = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have ReadNext
return EOF
error code when there is no next? This way we can make this important interface simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course we can return EOF
. We do need to do something when users try to invoke ReadNext()
while there is no next data. It can be throwing an exception or returning an EOF
. Both one is OK.
But I'm afraid the HasNext()
is still needed. Sometimes users would like to do something(like saving model) after each pass training. It requires an interface to check whether a pass is completed.
Maybe we can also implement the checking by trying invoking ReadNext()
and check the return value. But it seems a little hard.
doc/design/cpp_data_feeding.md
Outdated
|
||
``` | ||
while_op { | ||
has_next = has_next_op(double_buffer_reader) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a lot of work for the user, I am not convinced that we should expose reset_op
, has_next_op
to the user. Can we do:
while_op {
// -1 means read multi epoch forever, in this case while op should control how many steps.
x = read_op(multi_epoch_reader(double_buffer_reader, -1))
... (subsequent training ops)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@helinwang You are right! We have discussed this issue with @JiayiFeng and @panyx0718 before. We will have a MultiEpochReader
for users just like you described. Users may not need to use reset
and has_next
operators. We will not implement reset
and has_next
operators at first.
The concern about reset
and has_next
operators is that users might need to print some log and save models at the end of an epoch. It might be useful for users to know it is the end of an epoch by has_next_op
and manually reset reader by reset_op
. However, we will not implement these operators until there are actual cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea. The multi_epoch_reader
is also in our design. It is definitely a convenient way for users to config their readers.
But reset_op
and has_next_op
still need to be provided. For some users would like to check whether a pass is completed by themselves and then do some customized operations. Just like my reply to your previous comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reply @reyoung @JiayiFeng !
But reset_op and has_next_op still need to be provided. For some users would like to check whether a pass is completed by themselves and then do some customized operations. Just like my reply to your previous comment.
That's a good point, but I think we will probably rely on visual DL for visualization. If the user need to do some customize operation, he need to write an OP to do it, which adds the complexity. I my opinion has_next_op
and if op
is a lot of code for the user to learn, maybe we should not support it?
EDIT: we discussed offline, in general we don't want the user to use has_next_op
and reset_op
, but they will be low level APIs for the user to use if he want detailed control. Another thing is if has_next_op
is replaced by ReadNext
returning EOF error code, the user can't easily access the error code.
doc/design/cpp_data_feeding.md
Outdated
|
||
A file reader binds with a single file and reads one instance of data from the file at a time. Each type of file reader shall implement its own `ReadNextImpl()`, `HasNext()` and `ReInit()`. | ||
|
||
The `ReadNextImpl()` is invoked by `ReadNext()`. Besides invoking `ReadNextImpl()`, `ReadNext()` is also in charge of checking the output, making sure that each shape of `LoDTensor` in `*out` is consistent with the one in `dims_`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why ReadNext
should check the shape of the output is correct, shouldn't every OP check if it's input has the correct shape? Otherwise if OP_B's input is the output of OP_A (any OP that is not reader OP), how can OP_B know it's input is correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't check shapes in readers, the shape of real tensors at runtime can be different with the setting shape at compile time. We shall not allow this happen.
doc/design/cpp_data_feeding.md
Outdated
### `ReaderHolder` | ||
To the subsequent two decorated readers, the `MultipleReader` is **a single reader**. They don't need to concern about how prefetch readers are scheduled. They only need to invoke `MultipleReader::ReadNext()` to get the next data from the buffer channel. | ||
|
||
### ReaderHolder | ||
|
||
Different readers belong to different class types. This leads to a problem: How can we drop them into `Variable`s and fetch them out by a unified method? For example, if a Variable holds a `BatchReader`, we can not get it by the following code: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following code seems perfectly reasonable (get a implementation of an interface, and then invoke some function of that interface), curious why we can't do this, is it a C++ specific reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean why we can't write:
var->Get<ReaderBase>("batch_reader");
?
That is because our Variable
doesn't support convert an object to its parent type. If we drop a BatchReader
into a Variable
, we can't fetch it by Get<ReaderBase>()
for they are different types and the Variable
doesn't know that they have an inheritance relationship.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sorry I forgot to paste the code.
Thanks, I thought the compiler would know the inheritance, probably I missed something, will look more detail into the code.
doc/design/cpp_data_feeding.md
Outdated
@@ -69,10 +113,59 @@ To solve this problem, we introduce `ReaderHolder` as a wrapper. It acts as an e | |||
|
|||
To create and invoke readers, some new ops are introduced: | |||
|
|||
### `CreateReaderOp` | |||
### CreateReaderOp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need CreateReaderOp
if there are creator operator for each kind of reader?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CreateReaderOp
is just a general name for all reader creator operators. There is no such CreateReaderOp
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe rename it to "### Operators That Creates Readers"? Otherwise the reader could think CreateReaderOp
is actually an OP.
Thank you so much @abhinavarora ! Your comments are quite helpful! |
… dev_update_reader_doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Updates cpp reader document to our newest design.