Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to batch train with fit_generator()? #7729

Closed
IanKirwan opened this issue Aug 24, 2017 · 6 comments
Closed

How to batch train with fit_generator()? #7729

IanKirwan opened this issue Aug 24, 2017 · 6 comments

Comments

@IanKirwan
Copy link

Apologies if this is the wrong place to raise my issue (please help me out with where best to raise it if that's the case). I'm a novice with Keras and Python so hope responses have that in mind.

I'm trying to train a CNN steering model that takes images as input. It's a fairly large dataset, so I created a data generator to work with fit_generator(). It's not clear to me how to make this method trains on batches, so I assumed that the generator has to return batches to fit_generator(). The generator looks like this:

def gen(file_name, batchsz = 64):
    csvfile = open(file_name)
    reader = csv.reader(csvfile)
    batchCount = 0
    while True:
        for line in reader:
            inputs = []
            targets = []
            temp_image = cv2.imread(line[1]) # line[1] is path to image
            measurement = line[3] # steering angle
            inputs.append(temp_image)
            targets.append(measurement)
            batchCount += 1
            if batchCount >= batchsz:
                batchCount = 0
                X = np.array(inputs)
                y = np.array(targets)
                yield X, y
        csvfile.seek(0)

It reads a csv file containing telemetry data (steering angle etc) and paths to image samples, and returns arrays of size: batchsz
The call to fit_generator() looks like this:

    tgen = gen('h:/Datasets/dataset14-no.zero.speed.trn.csv', batchsz = 128) # Train data generator
    vgen = gen('h:/Datasets/dataset14-no.zero.speed.val.csv', batchsz = 128) # Validation data generator
    try:
        #model.fit(X_all, y_all, validation_split=0.2, shuffle=True, nb_epoch=epochs)
        model.fit_generator(
            tgen,
            samples_per_epoch=113526,
            nb_epoch=6,
            validation_data=vgen,
            nb_val_samples=20001
        )

The dataset contains 113526 sample points yet the model training update output reads like this (for example):

  1020/113526 [..............................] - ETA: 27737s - loss: 0.0080���������������������������������������������������������������������������
  1021/113526 [..............................] - ETA: 27723s - loss: 0.0080���������������������������������������������������������������������������
  1022/113526 [..............................] - ETA: 27709s - loss: 0.0080���������������������������������������������������������������������������
  1023/113526 [..............................] - ETA: 27696s - loss: 0.0080���������������������������������������������������������������������������

Which appears to be training sample by sample (stochastically?).
The resultant model is useless. I previously trained on a much smaller dataset using .fit() with the whole dataset loaded into memory, and that produced a model that at least works even if poorly. Clearly something is wrong with my fit_generator() approach. Will be very grateful for some help with this.

@StripedBanana
Copy link

I don't think you should use for loop in your generator. The reason for that is Keras will spawn multiple threads when using fit_generator, each calling your generator trying to fetch examples in advance. This helps parallelizing data fetching on the CPU.

From your code I understand you want to go through your whole dataset on one epoch of your fit_generator. This makes sense, but unfortunately the method wasn't really designed like that if I got it right. You have two ways of doing it:

  • fetch random batches in your while True: loop indefinitely
  • fetch batches by indexing your dataset, and playing with steps_per_epoch to make it stop exactly at the end of your data

I opted for the latter, and it works well, though be careful of the threading nature of the method (it may try to fetch data outside your range, hence my condition in the example below:)

def my_generator(data, labels, indices, batch_size, steps):
    """Generator used by `keras.models.Sequential.fit_generator` to yield batches
    of pairs.

    Such a generator is required by the parallel nature of the aforementioned
    Keras function. It can theoretically feed batches of pairs indefinitely
    (looping over the dataset). Ideally, it would be called so that an epoch ends
    exactly with the last batch of the dataset.
    """
    i = 1
    while 1:
        (batch_pairs, batch_labels) = fetch_batch(i, data, labels,
                                                  indices, batch_size)
        if i == steps:
            i = 1 # avoids going too far in the data
            # will preload the first batches for the next epoch
        else:
            i += 1 # go for the next batch
        yield [batch_pairs[:, 0], batch_pairs[:, 1]], batch_labels

Don't mind my fetch_batch function, it basically index a batch of data with the index i.

@IanKirwan
Copy link
Author

@StripedBanana Thanks. I'll address the for loop. However i'm still at a loss. Am I correct in assuming that the generator has to return the data in batches in order for fit_generator() to batch train?

@yuyang-huang
Copy link
Contributor

Yes, the generator has to return the data in batches. But the problem is that you put inputs = [] inside the for loop. So for each line you read, inputs is cleared and fit_generator always gets a batch size = 1.

@IanKirwan
Copy link
Author

IanKirwan commented Aug 26, 2017

@myutwo150 Thanks. I have now corrected that.
@StripedBanana I'm at a loss with taking the for loop out. Something is going to have to iterate through the data. Even if I delegate the loop out to another function and call it from the generator it will still suffer re-entracy problems if what you say is correct. However the keras documentation provides this as an example generator:

def generate_arrays_from_file(path):
    while 1:
    f = open(path)
    for line in f:
        # create Numpy arrays of input data
        # and labels, from each line in the file
        x, y = process_line(line)
        yield (x, y)
    f.close()

model.fit_generator(generate_arrays_from_file('/my_file.txt'),
        steps_per_epoch=1000, epochs=10)

So I'm not sure it's a problem.

@ViaFerrata
Copy link

ViaFerrata commented Sep 5, 2017

Would also be interested in that!
As you mentioned, you have to iterate through the data somehow. And I don't understand what the difference between a 'while' and 'for' loop would be in that scenario.

Other than that, I'd rather use DataFlow from tensorpack if you're concerned about the speed of the generator.
Or you wait until TF data tensors support has been integrated in Keras (provides data entirely in C++ which avoids Python overhead).
Depending on your scenario you may not need the improved speed though - I'd propose to check how many images per second your GPU could potentially train.

Anyways, here's another snippet of a generator (with hdf5 files as input though and not csv) if it helps:

def generate_batches_from_hdf5_file(filepath, batchsize):
    """
    Generator that returns batches of images ('xs') and labels ('ys') from a h5 file.
    :param string filepath: Full filepath of the input h5 file, e.g. '/path/to/file/file.h5'.
    :param int batchsize: Size of the batches that should be generated.
    :return: (ndarray, ndarray) (xs, ys): Yields a tuple which contains a full batch of images and labels.
    """
    dimensions = (batchsize, 28, 28, 1) # 28x28 pixel, one channel
 
    while 1:
        f = h5py.File(filepath, "r")
        filesize = len(f['y'])

        # count how many entries we have read
        n_entries = 0
        # as long as we haven't read all entries from the file: keep reading
        while n_entries < (filesize - batchsize):
            # start the next batch at index 0
            # create numpy arrays of input data (features)
            xs = f['x'][n_entries : n_entries + batchsize]
            xs = np.reshape(xs, dimensions).astype('float32')

            # and label info. Contains more than one label in my case, e.g. is_dog, is_cat, fur_color,...
            y_values = f['y'][n_entries:n_entries+batchsize]
            ys = np.array(np.zeros((batchsize, 2))) # data with 2 different classes (e.g. dog or cat)

            # Select the labels that we want to use, e.g. is dog/cat
            for c, y_val in enumerate(y_values):
                ys[c] = encode_targets(y_val, class_type='dog_vs_cat') # returns categorical labels [0,1], [1,0]

            # we have read one more batch from this file
            n_entries += batchsize
            yield (xs, ys)
        f.close()

@stale
Copy link

stale bot commented Dec 12, 2017

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants