Fix realtime sims in distributed architecture #56

Dinika · 2024-11-13T15:52:14Z

Implements the following changes to fix the new architecture:

Replace celery state updates with redis pub sub. Motivations:
- Celery state updates are not event driven. This means, it is not possible to react to a new state update, the server can only poll continuously. The state updates will therefore be duplicated (which is why a "hash" was needed).
- If we add a sleep() (or similar) then we risk losing state updates that were for data.
- The server was polling for state updates continuously and until all the sub tasks were complete, it was unable to attend to any other simulation request.
Start a sub process to run the simulation inside the celery worker - If the simulation is not started in a child, the neuron simulator is not reset. It still holds state from the previous simulations that that particular celery worker ran. This is why we were seeing repeated data for current.
Reduce the frequency of health checks of celery workers - The command celery -A bluenaas.infrastructure.celery status takes around 7 seconds to reply. The worker gets busy replying to these health checks and the simulation performance goes down.

Neuron simulator does not reset itself automatically after a call to simulation.run() This causes in-correct simulation data because it (the data) also contains information from previous simulations that ran on the same worker. The best way to avoid this (as suggested by neuron forums) is to run the simulation inside the child process. Since we cannot use multiprocessing module with celery, billiard is used. See: - https://www.neuron.yale.edu/phpBB/viewtopic.php?t=4039 - https://stackoverflow.com/questions/54858326/python-multiprocessing-billiard-vs-multiprocessing - celery/celery#5362 (comment) Signed-off-by: Dinika Saxena <[email protected]>

pgetta · 2024-11-13T16:06:02Z

Great work, Dinika, the change looks good to me.

Dinika added 5 commits November 13, 2024 09:01

Use redis pub sub instead of celery update states

1648efe

Implement frequency varying simulation

724d32e

Fix frquency varying synaptome simulation

e37f5b5

Handle error in blocking code

3c50a2f

pgetta approved these changes Nov 13, 2024

View reviewed changes

Dinika added 3 commits November 13, 2024 17:13

Add doc to state why health check frequency is reduced

2707cd1

With get_message

2108d51

Unsubscribe and close channel when work is done

4a49e65

Provide feedback