Create CI check to detect training memory leaks #7827
Labels
area:rasa-oss 🎡
Anything related to the open source Rasa framework
effort:atom-squad/2
Label which is used by the Rasa Atom squad to do internal estimation of task sizes.
type:maintenance 🔧
Improvements to tooling, testing, deployments, infrastructure, code style.
Description of Problem:
We recently identified a memory leak in the training of Rasa models.
This leak was on the main branch, but only detected when it got so bad due to a new change that it was crashing CI test workers.
After much investigation the leak was narrowed to down to a bit of tensorflow code and fixed.
Ideally this memory leak would have been caught sooner, as it would have affected users.
The leak was apparent when either training with a high number of epochs, or training multiple times (like in the test suite).
We would like to have an automated check to test that we don't introduce another memory leak.
Overview of the Solution:
Tensorflow memory leaks can be hard to identify and fix as they often occur when the graph is being executed which could be in c code for example. This means the "leaking" variables are often not visible when looking at all the python objects in the interpreter. However you can identify if a memory leak exists by looking at the total memory usage of the process.
We used https://pypi.org/project/memory-profiler/ to track the memory usage of a python process when training the TED model to find the leak mentioned above.
This tool tracks the total memory usage over time and writes it to a file which can be parsed or plotted.
To use this in a automated fashion we could:
We could have a threshold that if crossed fails the test, e.g. 1GB
Definition of Done:
The text was updated successfully, but these errors were encountered: