Skip to content
This repository has been archived by the owner on Nov 3, 2022. It is now read-only.

[WIP] initial TensorFlow TFRecord integration with Keras #27

Closed
wants to merge 7 commits into from

Conversation

ahundt
Copy link
Collaborator

@ahundt ahundt commented Feb 19, 2017

About TFRecords: https://www.tensorflow.org/api_guides/python/python_io
Code Source: https://github.com/indraforyou/keras_tfrecord
Primary Author: Indranil Sur @indraforyou
License: MIT (same as Keras)

@ahundt
Copy link
Collaborator Author

ahundt commented Feb 20, 2017

@farizrahman4u How do I configure this code to pass on the theano backend? This code is a keras API for tensorflow TFrecords, so even the theano backend will need access to tensorflow by definition. Perhaps something like what's done in keras/backend/init.py?

@patyork
Copy link
Contributor

patyork commented Feb 20, 2017

By definition, this would be a feature that can only be used with the TensorFlow backend.

I don't know if it's been decided on whether contrib will allow features that only work on one backend. The only Keras functionality that fits that bill is the seperableConv layer (it's TF only currently).

@ahundt
Copy link
Collaborator Author

ahundt commented Feb 20, 2017

Well in theory it could probably run on the theano backend if tf is available strictly for dataset loading, but that should probably be addressed when there is a user that needs it.

@patyork I'm not sure what changes to make, what's the best way to skip this file when the tensorflow backend is not available?

Also, is the proposed datasets location appropriate? I was split between there and the backends folder...

@patyork
Copy link
Contributor

patyork commented Feb 20, 2017

It would run with the Theano backend in very specific circumstances, where the performance impact would negate any benefits that may or may not come from having the image preprocessing in an unusable graph.

  • TF would have to be installed
  • TF would have to be configured to either use the CPU, or to not allocate all GPU memory
  • The output from the TF graph would have to converted to Numpy arrays on the CPU before being passed to the Theano graph (negating any performance benefit of image processing on the GPU)
  • Compilation time of the TF graph alone would probably negate any performance gain, and that graph is not even usable within Theano
  • The data preprocessing is still limited by what TF provides (or you write yourself)

And no, any backend-specific code needs to be in the corresponding *_backend.py file.

Also, you're main problem is a compilation error: from keras.backend import tf should be import tensorflow as tf

@ahundt
Copy link
Collaborator Author

ahundt commented Feb 20, 2017

And no, any backend-specific code needs to be in the corresponding *_backend.py file.

Are you saying I should move tfrecord.py directly into keras_contrib/backend/tensorflow_backend.py before it can be merged? Not completely sure that's the best fit yet, but I could do that.

All TFRecord code will depend on TF unless someone implemented a python+protobuf only backend for loading them, not actually that hard but also probably not high priority for anyone. I might implement something that converts TFRecord datasets to the more commonly used keras dataset formats though.

Also, you're main problem is a compilation error: from keras.backend import tf should be import tensorflow as tf

Thanks, I don't know how I missed that one.

@patyork
Copy link
Contributor

patyork commented Feb 20, 2017

any backend-specific code needs to be in the corresponding *_backend.py file.

I stand by this statement; how you go about moving all references to TensorFlow/tf.* into that file is up to you, but given that this is entirely tensorflow functionality, I don't see how it would be possible to split it up in any meaningful way. Also, the datasets module is about downloading/loading/returning actual numpy data for a specific dataset (e.g. mnist_data = keras.datasets.mnist.load_data()); it is not a place for generic data loading.

If you do manage to split up the code, I'd image the TF specific stuff goes into the TF backend, and any usage of that functionality might be in the preprocessing/image.py file.

@ahundt
Copy link
Collaborator Author

ahundt commented Feb 20, 2017

any backend-specific code needs to be in the corresponding *_backend.py file.
I stand by this statement;

No problem! I'll move it right over.

specific dataset (e.g. mnist_data = keras.datasets.mnist.load_data()); it is not a place for generic data loading.

I plan to write/integrate a couple of specific dataset loaders over the next few weeks. One of those is only available as TFRecords.

@ahundt ahundt changed the title initial TensorFlow TFRecord integration with Keras [WIP] initial TensorFlow TFRecord integration with Keras Feb 21, 2017
@ahundt
Copy link
Collaborator Author

ahundt commented Feb 21, 2017

Before adding tests, is this the best design?

Specifically, is there perhaps a way the existing fit() functions in keras/keras models could be parameterized to reduce the duplicate code necessary to support the new TFRecord data format and thus the backend code could reduce duplication and become much more minimal?

@indraforyou
Copy link

I would argue its a bad design.. I quickly made the code working to meet @ahundt 's bounty deadline. Rewrite of both compile and fit can be avoided.

I would suggest to make an extra param say tfrecord and pass it to compile(). In compile() if the param is True, sample_weight, targets placeholders are not enabled. The param can be saved as a member of the model and be used in the fit function appropriately.

@ahundt
Copy link
Collaborator Author

ahundt commented Feb 22, 2017

@indraforyou so you're suggesting that for upstream Keras-2?

I don't think a parameter explicitly named tfrecord would necessarily be ideal, since it wouldn't help with the next format added. Probably the best option in the end would be like what's done for the tensorflow backend, where a data format backend could be incorporated and specialized for HDF5 and tfrecord.

However, data format backends seems much larger scope. Perhaps the current basic function is sufficient for the moment and a separate issue should be created until there is both more interest and a dataset available?

@tboquet
Copy link
Collaborator

tboquet commented Feb 22, 2017

Nice work! This feature is definitely interesting since it's a problem in a lot of organizations. HDF5 and tfrecord seem to be the 2 formats of choice. I will try to review the PR today and add some comments. The integration in a clean way seems tricky.

@ahundt
Copy link
Collaborator Author

ahundt commented Feb 23, 2017

@farizrahman4u that merge seemed to break travis

@tboquet
Copy link
Collaborator

tboquet commented Feb 23, 2017

We could maybe merge this PR when it will be repaired and warn the user that it is an experimental feature? This way we will be able to test tfRecord with Keras and see where we could integrate these functionalities in the core code. What do you think?

@ahundt
Copy link
Collaborator Author

ahundt commented Feb 24, 2017

@tboquet I would because I've tested it manually, but I can't accept pull requests. If you want to try out the code you can clone it from my fork or do it with github's hub command line command. https://github.com/github/hub

@ibadami
Copy link

ibadami commented Mar 7, 2017

Thank you @indraforyou , @ahundt for this excellent work. This was much needed for my project and it works like a charm. Also this patch was one of the reason I switched from TFLearn to Keras around 5 days ago.

I have few questions to extend this work further.

  • How can one use validation data in fit_tfrecord function?
  • How to do the data augmentation?
  • How to evaluate using TFRecord test set.

ahundt added 2 commits April 26, 2017 12:02
# Conflicts:
#	keras_contrib/backend/tensorflow_backend.py
# Conflicts:
#	keras_contrib/backend/tensorflow_backend.py
@ahundt
Copy link
Collaborator Author

ahundt commented Apr 26, 2017

@tboquet What are the next steps to merge this?

@ibadami have you been using this?

Is anyone interested in pushing this functionality through to be merged?

Copy link
Collaborator

@tboquet tboquet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it is possible to move all the new functions from the tensorflow backend to another module? It will be easier to manage I guess. I'm not sure about this so let's discuss about it.

@@ -153,6 +158,309 @@ def depth_to_space(input, scale, data_format=None):
out = _postprocess_conv2d_output(out, data_format)
return out


def data_to_tfrecord(images, labels, filename):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rename this images_to_tfrecord what do you think?

@tboquet
Copy link
Collaborator

tboquet commented Jun 4, 2017

Hi @ahundt, what's the status on this? These are nice functionalities especially with the tf.contrib.data library.

@ahundt
Copy link
Collaborator Author

ahundt commented Jun 5, 2017

@tboquet I think it needs some reworking. I tried to discuss the next steps at tensorflow/tensorflow#8787 but haven't really heard back.

@drewrice2
Copy link

hi @ahundt
thanks for spearheading this movement. commented on the tensorflow issue to show some interest.

@ahundt
Copy link
Collaborator Author

ahundt commented Jun 10, 2017

I've done a second round of this upstream, I'd appreciate some testing + fixes + review from anyone who is interested:
keras-team/keras#6928

I'll probably close this pull request soon since keras-team/keras#6928 will be a significant improvement once it works.

@tboquet
Copy link
Collaborator

tboquet commented Jun 10, 2017

@ahundt sounds good!

@ahundt
Copy link
Collaborator Author

ahundt commented Jun 13, 2017

keras-team/keras#6928 now passes the unit tests so I'm closing this.

@ahundt ahundt closed this Jun 13, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants