Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rogue Garbage Collector #6087

Closed
Tracked by #1543
mguidon opened this issue Jul 22, 2024 · 5 comments · Fixed by #6564
Closed
Tracked by #1543

Rogue Garbage Collector #6087

mguidon opened this issue Jul 22, 2024 · 5 comments · Fixed by #6564
Assignees
Labels
bug buggy, it does not work as expected

Comments

@mguidon
Copy link
Member

mguidon commented Jul 22, 2024

Which deploy/s?

All

Current Behavior

I think I could (togehter with @matusdrobuliak66) finally reproduce why the garbage collector is treating me like garbage:

  1. I often use the platform from my laptop when I am in the train
  2. When arriving at my destination, I typically close the lid.
  3. I go to work and forget about it
  4. A few hours later I want to start a new service in the same deployment from my desktop machine at the office
  5. Garbage collector insists of shutting me down all the time
  6. In Redis I have two client_session_ids the new one and the old one
  7. Somehow the GC sees the stale one and assumes all of my sessions are to be terminated
  8. Deleteing the stale key manually allows me to start again a service
@mguidon mguidon added the bug buggy, it does not work as expected label Jul 22, 2024
@pcrespov
Copy link
Member

pcrespov commented Jul 23, 2024

I wonder if this issue is also affecting the end-to-end tests with anonymous users, which get logged out before the test finishes.

@sanderegg
Copy link
Member

I think we do have a mess with the session IDs. Also if you duplicate a tab you get the same one I think

@matusdrobuliak66
Copy link
Contributor

matusdrobuliak66 commented Oct 16, 2024

  • 16.10. Taylor is reporting he can not open any study in osparc.io
  • I have scaled GC to 0
  • Taylor can open a study
  • After 10 minutes while his study was already running I turned on GC back to 1
  • GC triggers the stop of his service

His sessions state:
image

  • All client sessions has alive and resources keys
  • Only 2 client sessions has both socket_id and project_id in the resources
  • The other 2 has only the socket_id (NO project_id)
  • Also Taylor currently doesn't have any running service
    • Q: what does it mean and why in 2 session resources there are 2 project_id? Might this be the issue?
      • Project IDs: 258a69a4-82fb-11ef-b5f6-0242ac174ade & 879ad9e2-6921-11ef-a2a7-0242ac1745e4
  • The alive key is updating the TTL always for all 4 of his sessions

Issue: I guess GC wrongly deduces the trigger of the shutdown of the service, because of multiple sessions

@GitHK
Copy link
Contributor

GitHK commented Oct 17, 2024

  • Q: what does it mean and why in 2 session resources there are 2 project_id? Might this be the issue?

Totally normal, this means that each "tab"(session_id) has an opened project.

* Project IDs: `258a69a4-82fb-11ef-b5f6-0242ac174ade` & `879ad9e2-6921-11ef-a2a7-0242ac1745e4`
  • The alive key is updating the TTL always for all 4 of his sessions

very good to know this, good that you checked

Issue: I guess GC wrongly deduces the trigger of the shutdown of the service, because of multiple sessions

Somewhere in the code there is an issue on how it decides that a project is active or not.

@sanderegg
Copy link
Member

Here are new findings that I think explain what is going on. For info @matusdrobuliak66 , @GitHK , @mguidon , @odeimaiz

Here is the mechanism to uniquely identify browser tab/project of a user

  • Everytime a browser tab opens osparc, the frontend creates a client_session_id that uniquely identifies the tab
  • This ID is used to initialize the websocket (via socket.io) and this is recorded in Redis database
    image
  • Subsequently the frontend passes the same client_session_id when opening a new project, and the backend can assign the opened project to the browser tab of the use,
  • Also the frontend periodically handshakes, and the backend keeps the 'client_session_id' alive key in Redis (it will automatically disappear after 15 minutes of inactivity)

Here is how the GC works:

  • checks what projects are opened,
  • checks if these opened projects are linked to user tab by checking there is a client_session_id assigned to it and that it was refreshed,
  • checks that the alive key is present (TTL 15 minutes),
    --> if the alive key is gone, closes the project
    --> if a project is not assigned a client_session_id, closes the project

Now after playing with @mguidon 's laptop we observed the following:

  • After opening the lid, the original websocket was reconnected,
  • Redis database shows the client_session_id with which it was opened at the time,
  • When clicking on + button, another client_session_id was passed from the frontend to the backend,
    --> the project starts opening, but the Redis client_session_id is then a different one from the one containing the socket_id
    --> therefore we get a project that has no socket_id assigned (e.g. no tab from the POV of the backend)
    --> therefore the GC rightfully closes the project

Actions

  • On reconnection, the frontend, if it creates a new client_session_id, shall check if the websocket client_session_id is different and output a warning,
  • Ideally the frontend instead of re-creating a client_session_id shall re-use the one from the websocket, and this should fix the problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug buggy, it does not work as expected
Projects
None yet
5 participants