Create CI check to detect training memory leaks #7827

joejuzl · 2021-01-27T10:24:14Z

Description of Problem:
We recently identified a memory leak in the training of Rasa models.
This leak was on the main branch, but only detected when it got so bad due to a new change that it was crashing CI test workers.
After much investigation the leak was narrowed to down to a bit of tensorflow code and fixed.
Ideally this memory leak would have been caught sooner, as it would have affected users.
The leak was apparent when either training with a high number of epochs, or training multiple times (like in the test suite).
We would like to have an automated check to test that we don't introduce another memory leak.

Overview of the Solution:
Tensorflow memory leaks can be hard to identify and fix as they often occur when the graph is being executed which could be in c code for example. This means the "leaking" variables are often not visible when looking at all the python objects in the interpreter. However you can identify if a memory leak exists by looking at the total memory usage of the process.
We used https://pypi.org/project/memory-profiler/ to track the memory usage of a python process when training the TED model to find the leak mentioned above.
This tool tracks the total memory usage over time and writes it to a file which can be parsed or plotted.
To use this in a automated fashion we could:

Create a test which trains a a model with dummy data but a high number of epochs
Run this python process wrapped in the profiler
Analyse the output to see the trend of the total memory usage

We could have a threshold that if crossed fails the test, e.g. 1GB

Definition of Done:

The check can identify the bug that is mentioned in the description.
The check works in the CI.
The check works locally.

joejuzl added type:maintenance 🔧 Improvements to tooling, testing, deployments, infrastructure, code style. area:rasa-oss 🎡 Anything related to the open source Rasa framework labels Jan 27, 2021

wochinge added the effort:atom-squad/2 Label which is used by the Rasa Atom squad to do internal estimation of task sizes. label Mar 12, 2021

TyDunn assigned wochinge Mar 15, 2021

wochinge mentioned this issue Mar 18, 2021

add automated memory leak tests #8249

Merged

4 tasks

wochinge closed this as completed in #8249 Apr 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create CI check to detect training memory leaks #7827

Create CI check to detect training memory leaks #7827

joejuzl commented Jan 27, 2021

Create CI check to detect training memory leaks #7827

Create CI check to detect training memory leaks #7827

Comments

joejuzl commented Jan 27, 2021