Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert RNN generation to TF v2. #1978

Merged
merged 19 commits into from
Aug 19, 2020
Merged

Convert RNN generation to TF v2. #1978

merged 19 commits into from
Aug 19, 2020

Conversation

mihaimaruseac
Copy link
Member

TF v2 offers ways to handle validation and TensorBoard during model callback. We can also do the batching using tf.data instead of writing the loop manually.

I'm trying it this way now to not do too many changes at once but will probably send another PR later to convert to the recommended way.

TF v2 models only save 2 files, not three.
If no checkpoint has been saved, validation and data generation will
fail as they imply loading saved weights to use in the rebatched model.

A better design would be to have callbacks that do validation and
generation and all TensorBoard stuff but for that we first need to
convert to a dataset representation and do automatic batching.
@google-cla google-cla bot added the cla: yes CLA signed. label Aug 18, 2020
@mihaimaruseac
Copy link
Member Author

Fixes #1540

@Dor1s
Copy link
Contributor

Dor1s commented Aug 18, 2020

/gcbrun

Copy link
Contributor

@Dor1s Dor1s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is crazy and LGTM! Let's see what makes the CI unhappy

@Dor1s
Copy link
Contributor

Dor1s commented Aug 18, 2020

/gcbrun

@mihaimaruseac
Copy link
Member Author

The pylint issue on Travis I think can be fixed by pylint-dev/pylint#1542 (comment). TensorBoard recommends the as_default() syntax in the migration guide

I cannot see the GCP build log so I don't know what failed there.

Should I instead try and convert code to use tf.data and callbacks for validation?

@mihaimaruseac
Copy link
Member Author

I disabled it inline, hopefully this would fix the travis build

@mihaimaruseac
Copy link
Member Author

Ok, Travis passes now. Should we try another /gcbrun ?

@inferno-chromium
Copy link
Collaborator

/gcbrun

@mihaimaruseac
Copy link
Member Author

Still fails :( But I cannot see the log

@inferno-chromium
Copy link
Collaborator

Still fails :( But I cannot see the log

======================================================================
FAIL: test_train_rnn (tests.core.bot.tasks.ml_train_task_test.MLRnnTrainTaskIntegrationTest)
Test train RNN model on a simple corpus.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/workspace/src/python/tests/core/bot/tasks/ml_train_task_test.py", line 193, in test_train_rnn
    self.assertTrue(ml_train_task.get_last_saved_model(self.model_directory))
AssertionError: {} is not true

----------------------------------------------------------------------
Ran 155 tests in 30.966s

FAILED (failures=1, skipped=4)

Ran 1437 tests (111 skipped, 0 errors, 1 failures).

@inferno-chromium
Copy link
Collaborator

Still fails :( But I cannot see the log

======================================================================
FAIL: test_train_rnn (tests.core.bot.tasks.ml_train_task_test.MLRnnTrainTaskIntegrationTest)
Test train RNN model on a simple corpus.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/workspace/src/python/tests/core/bot/tasks/ml_train_task_test.py", line 193, in test_train_rnn
    self.assertTrue(ml_train_task.get_last_saved_model(self.model_directory))
AssertionError: {} is not true

----------------------------------------------------------------------
Ran 155 tests in 30.966s

FAILED (failures=1, skipped=4)

Ran 1437 tests (111 skipped, 0 errors, 1 failures).

Using python butler.py py_unittest -t core -m

@mihaimaruseac
Copy link
Member Author

Thanks. I'll try to replicate locally, it's possible to be caused by the change in model format

@inferno-chromium
Copy link
Collaborator

/gcbrun

@inferno-chromium
Copy link
Collaborator

/gcbrun

Copy link
Contributor

@Dor1s Dor1s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!!! Thanks @inferno-chromium for helping to debug!

@Dor1s
Copy link
Contributor

Dor1s commented Aug 18, 2020

We can merge this today, but if anyone deploys, I won't be able to monitor the errors as I'm OOO today.

@mihaimaruseac
Copy link
Member Author

I'm ok with waiting :)

@Dor1s Dor1s merged commit 8f1da98 into google:master Aug 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes CLA signed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants