Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create CI check to detect training memory leaks #7827

Closed
3 tasks
joejuzl opened this issue Jan 27, 2021 · 0 comments · Fixed by #8249
Closed
3 tasks

Create CI check to detect training memory leaks #7827

joejuzl opened this issue Jan 27, 2021 · 0 comments · Fixed by #8249
Assignees
Labels
area:rasa-oss 🎡 Anything related to the open source Rasa framework effort:atom-squad/2 Label which is used by the Rasa Atom squad to do internal estimation of task sizes. type:maintenance 🔧 Improvements to tooling, testing, deployments, infrastructure, code style.

Comments

@joejuzl
Copy link
Contributor

joejuzl commented Jan 27, 2021

Description of Problem:
We recently identified a memory leak in the training of Rasa models.
This leak was on the main branch, but only detected when it got so bad due to a new change that it was crashing CI test workers.
After much investigation the leak was narrowed to down to a bit of tensorflow code and fixed.
Ideally this memory leak would have been caught sooner, as it would have affected users.
The leak was apparent when either training with a high number of epochs, or training multiple times (like in the test suite).
We would like to have an automated check to test that we don't introduce another memory leak.

Overview of the Solution:
Tensorflow memory leaks can be hard to identify and fix as they often occur when the graph is being executed which could be in c code for example. This means the "leaking" variables are often not visible when looking at all the python objects in the interpreter. However you can identify if a memory leak exists by looking at the total memory usage of the process.
We used https://pypi.org/project/memory-profiler/ to track the memory usage of a python process when training the TED model to find the leak mentioned above.
This tool tracks the total memory usage over time and writes it to a file which can be parsed or plotted.
To use this in a automated fashion we could:

  • Create a test which trains a a model with dummy data but a high number of epochs
  • Run this python process wrapped in the profiler
  • Analyse the output to see the trend of the total memory usage

We could have a threshold that if crossed fails the test, e.g. 1GB

Definition of Done:

  • The check can identify the bug that is mentioned in the description.
  • The check works in the CI.
  • The check works locally.
@joejuzl joejuzl added type:maintenance 🔧 Improvements to tooling, testing, deployments, infrastructure, code style. area:rasa-oss 🎡 Anything related to the open source Rasa framework labels Jan 27, 2021
@wochinge wochinge added the effort:atom-squad/2 Label which is used by the Rasa Atom squad to do internal estimation of task sizes. label Mar 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:rasa-oss 🎡 Anything related to the open source Rasa framework effort:atom-squad/2 Label which is used by the Rasa Atom squad to do internal estimation of task sizes. type:maintenance 🔧 Improvements to tooling, testing, deployments, infrastructure, code style.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants