Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use terminology *data reader creator* instead of *data reader*, and r… #1412

Merged
merged 4 commits into from
Feb 22, 2017
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
104 changes: 60 additions & 44 deletions doc/design/reader/README.md
Original file line number Diff line number Diff line change
@@ -1,30 +1,40 @@
# Python Data Reader Design Doc

Paddle reads data from data reader during training. It will be passed into `paddle.train` as a parameter.
At training and testing time, PaddlePaddle programs need to read data. To ease the users' work to write data reading code, we define that

- A *reader* is a function that reads data (from file, network, random number generator, etc) and yields data items.
- A *reader creator* is a function that returns a reader function.
- A *reader* decorator is a function, which accepts one or more readers, and returns a reader.

and provide frequently used reader creators and reader decorators.

## Data Reader Interface

Data reader is a function with no parameter that creates a iterable (anything can be used in `for x in iterable`):
Indeed, *data reader* doesn't have to be a function that reads and yields data items. It can be any function with no parameter that creates a iterable (anything can be used in `for x in iterable`):

```
iterable = data_reader()
```

Element produced for the iterable should be a **single** entry of data, **not** a mini batch. That entry of data could be a single item, or a tuple of items. Item should be of [supported type](http://www.paddlepaddle.org/doc/ui/data_provider/pydataprovider2.html?highlight=dense_vector#input-types) (e.g., numpy 1d array of float32, int, list of int)
Element produced from the iterable should be a **single** entry of data, **not** a mini batch. That entry of data could be a single item, or a tuple of items. Item should be of [supported type](http://www.paddlepaddle.org/doc/ui/data_provider/pydataprovider2.html?highlight=dense_vector#input-types) (e.g., numpy 1d array of float32, int, list of int)

An example implementation for single item data reader:
An example implementation for single item data reader creator:

```python
def data_reader_fake_image():
while True:
yield numpy.random.uniform(-1, 1, size=20*20)
def reader_creator_random_image(width, height):
def reader():
while True:
yield numpy.random.uniform(-1, 1, size=width*height)
return reader
```

An example implementation for multiple item data reader:
An example implementation for multiple item data reader creator:
```python
def data_reader_fake_image_and_label():
while True:
yield numpy.random.uniform(-1, 1, size=20*20), False
def reader_creator_random_imageand_label(widht, height, label):
def reader():
while True:
yield numpy.random.uniform(-1, 1, size=width*height), label
return reader
```

## Usage
Expand All @@ -41,9 +51,9 @@ label_layer = paddle.layer.data("label", ...)
paddle.train(paddle.dataset.mnist, {"image":0, "label":1}, 128, 10, ...)
```

## Data Reader Decorators
## Data Reader Decorator

Data reader decorators takes a single or multiple data reader, returns a new data reader. It is similar to a [python decorator](https://wiki.python.org/moin/PythonDecorators), but it does not use `@` syntax.
*Data reader decorator* takes a single or multiple data reader, returns a new data reader. It is similar to a [python decorator](https://wiki.python.org/moin/PythonDecorators), but it does not use `@` syntax.

Since we have a strict interface for data readers (no parameter, return a single data item). Data reader can be used flexiable via data reader decorators. Following are a few examples:

Expand All @@ -61,23 +71,27 @@ buffered_reader = paddle.reader.buffered(paddle.dataset.mnist, 100)

### Compose Multiple Data Readers

For example, we want to use a source of real images (reusing mnist dataset), and a source of fake images as input for [Generative Adversarial Networks](https://arxiv.org/abs/1406.2661).
For example, we want to use a source of real images (reusing mnist dataset), and a source of random images as input for [Generative Adversarial Networks](https://arxiv.org/abs/1406.2661).

We can do:

```python
def data_reader_fake_image():
while True:
yield numpy.random.uniform(-1, 1, size=20*20)

def data_reader_bool(t):
while True:
yield t

true_reader = lambda : data_reader_bool(True)
false_reader = lambda : data_reader_bool(False)

reader = paddle.reader.combine(paddle.dataset.mnist, data_reader_fake_image, true_reader, false_reader)
def reader_creator_random_image(width, height):
def reader():
while True:
yield numpy.random.uniform(-1, 1, size=width*height)
return reader

def reader_creator_bool(t):
def reader:
while True:
yield t
return reader

true_reader = reader_creator_bool(True)
false_reader = reader_creator_bool(False)

reader = paddle.reader.compose(paddle.dataset.mnist, data_reader_creator_random_image(20, 20), true_reader, false_reader)
# Skipped 1 because paddle.dataset.mnist produces two items per data entry.
# And we don't care second item at this time.
paddle.train(reader, {"true_image":0, "fake_image": 2, "true_label": 3, "false_label": 4}, ...)
Expand All @@ -98,29 +112,31 @@ reader = paddle.reader.shuffle(paddle.dataset.mnist, 512)

If a mini batch is returned, data reader need to take care of batch size. But batch size is a concept for training, it makes more sense for user to specify batch size as a parameter for `train`.

Practically, always return a single entry make reusing existing data reader much easier (e.g., if existing data reader return not a single entry but 3 entries, training code will be more complex because it need to handle cases like batch size 2).
Practically, always return a single entry make reusing existing data readers much easier (e.g., if existing reader return not a single entry but 3 entries, training code will be more complex because it need to handle cases like batch size 2).

### Why use a dictionary but not a list to provide mapping?

We decided to use dictionary (`{"image":0, "label":1}`) instead of list (`["image", "label"]`) is because that user can easily resue item (e.g., using `{"image_a":0, "image_b":0, "label":1}`) or skip item (e.g., using `{"image_a":0, "label":2}`).

### How to create custom data reader
### How to create custom data reader creator

```python
def image_reader(image_path, label_path, n):
f = open(image_path)
l = open(label_path)
images = numpy.fromfile(
f, 'ubyte', count=n * 28 * 28).reshape((n, 28 * 28)).astype('float32')
images = images / 255.0 * 2.0 - 1.0
labels = numpy.fromfile(l, 'ubyte', count=n).astype("int")
for i in xrange(n):
yield images[i, :], labels[i] # a single entry of data is created each time
f.close()
l.close()

# use python lambda to change image_reader into a function with no parameters.
reader = lambda : image_reader("/path/to/image_file", "/path/to/label_file", 1024)
def image_reader_creator(image_path, label_path, n):
def reader():
f = open(image_path)
l = open(label_path)
images = numpy.fromfile(
f, 'ubyte', count=n * 28 * 28).reshape((n, 28 * 28)).astype('float32')
images = images / 255.0 * 2.0 - 1.0
labels = numpy.fromfile(l, 'ubyte', count=n).astype("int")
for i in xrange(n):
yield images[i, :], labels[i] # a single entry of data is created each time
f.close()
l.close()
return reader

# images_reader_creator creates a reader
reader = image_reader_creator("/path/to/image_file", "/path/to/label_file", 1024)
paddle.train(reader, {"image":0, "label":1}, ...)
```

Expand All @@ -129,7 +145,7 @@ paddle.train(reader, {"image":0, "label":1}, ...)
An example implementation of paddle.train could be:

```python
def minibatch_decorater(reader, minibatch_size):
def make_minibatch(reader, minibatch_size):
def ret():
r = reader()
buf = [r.next() for x in xrange(minibatch_size)]
Expand All @@ -140,6 +156,6 @@ def minibatch_decorater(reader, minibatch_size):

def train(reader, mapping, batch_size, total_pass):
for pass_idx in range(total_pass):
for mini_batch in minibatch_decorater(reader): # this loop will never end in online learning.
for mini_batch in make_minibatch(reader): # this loop will never end in online learning.
do_forward_backward(mini_batch, mapping)
```