-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[🐛 Bug]: Session is killed and removed from Session Map in the same second that it receives another request #2129
Comments
@alequint, thank you for creating this issue. We will troubleshoot it as soon as we can. Info for maintainersTriage this issue by using labels.
If information is missing, add a helpful comment and then
If the issue is a question, add the
If the issue is valid but there is no time to troubleshoot it, consider adding the
If the issue requires changes or fixes from an external project (e.g., ChromeDriver, GeckoDriver, MSEdgeDriver, W3C),
add the applicable
After troubleshooting the issue, please add the Thank you! |
@diemol, have you seen something similar was reported to the upstream repo before? I guess there is no actionable from the container side. |
No, I've never seen that reported. If it truly was an issue, several people would be affected. I don't really know what to do with an issue like this. It feels like they are dump their environment logs here for us to figure out what the issue is. |
@VietND96 @diemol the logs above are mainly to give you a clearer understanding of what issue we are facing... in summary, hub detects session was killed, then hub release the slot, delete the session and, in the same second we they say, correctly, to be unable to execute request for that just removed session. In what situation I don't know how test could run against a session that it just killed, may any async process trying to use a session no longer in use by accident as it happens in the same second when session was killed? |
When you are using a session id that is no longer active.
This is a good indicator that your test code is not working well when executed concurrently. Most likely you are using |
💬 Please ask questions at:
|
Hey guys, @diemol @VietND96, I understand the issue is closed, but I would like to give you more context about this. First, a comment from @diemol :
This is not really the case, because we have separated containers running 1 different client for each test we run, so they are not sharing same variables of even environment, though both point to the same grid via remote driver. Another comment from @diemol :
@diemol I have been looking at the backlog of issues we have here, and I noticed some similar issues being reported, quite similar to what we reported here. And in all the cases, however, users have difficulty to reproduce the same, pretty much like me, as they really happen randomly and there is no clear step by step to reach that bad result. Let me point some of the issues we noticed here:
By the fact that several people are reporting the same issue and struggle to reproduce them clearly, isn't there anything or any insights or suggestions you guys could provide for cases like this? Let me know if we can reopen the issue and continue the work here, we are really focused on that problem and can contribute to that issue others are also reporting |
Can you share a test script we can use to reproduce the script reliably? With that, we'd be happy to reopen and have a look at this. |
In my view, looks like this unstable appears when autoscaling in K8s, even your report here also autoscaling.enabled: true, scalingType: deployment |
Thanks a lot for the insights @VietND96 ... Some comments about your post above:
|
I think the hub will not drain the node automatically, message (from UP to DRAIN) I believe at that time, Node was scaled down by the scaler, before Node shutdown, pod exec preStop hook to send drain signal to hub and wait. If the Node deprovisioned immediately, I believe there was no session in progress in that Node. Otherwise Node keep running with status DRAINING, and Hub will not assign any new session to that Node. In a sequential, this process is fine, I just don't know what happens if both assign and drain in the same second as my comment above.
As my observation on how it works, The difference with You can try |
@VietND96 thanks for your insights above and detailed explanation, appreciate the help. Just want to give you an answer back: I have configured my scaling type to job. It is too soon to say all our problems are fixed, but so far, in the two runs of our selenium test pipeline configured like that, I did not see the problem mentioned in this issue anymore. No session issues so far! As this was a random issue that used to happen when tests were running in parallel (normally), we need some more runs to evaluate, but that move to use Job had some much better results than the previous deployment type. I will keep you posted. |
@alequint, thank you for your continuous feedback. Let me consider putting this info into the chart README, then people can be aware when dealing with grid autoscaling on Kubernetes. |
@VietND96 one more comment here about your last sentence above related to I've really noticed the behavior you commented above. I did run a very simple test here which just authenticate in our application and then suddenly 15 new nodes (running jobs) were created, but only the first one had a session assigned to it (which made it enter in draining state). But the other 14 pods were left there and never completed. What is interesting to note is there if I run a test that just instantiates remote driver, requesting a session, and then does nothing, the hub receives the request and it assigns that to one of the 14 running jobs. And when it happened, that pod enters in a draining state and when session is killed we end up with 13 leftover pods. The problem really comes when I run tests that really do something in browser, in parallel or one after the other in pipeline, because one will create 15 pods, the other creates 8 and then I end up in a CPU issue and pods get Pending because of the amount of resources in place. I will try the strategy default that you recommended here: #2068 (comment) During my tests I end up with two more weird problems, these ones not related to scaling type, but they cause some trouble to our processes. I will open in different issues, just FYI in case you already saw something similiar:
Just want to anticipate the problems in case you know any existing topic about that from the top of your head... but I will open individual issues about these two items. Thanks for your support. |
Regarding
Can you read through this comment to see any similar? #2093 (comment) - myself still finding a possible case for that issue |
@VietND96 coincidently, even after deleting and installing helm chart again, I still see the new session issue. In my screenshot, right side is the code that runs my test, left side is the hub that was supposed to be contacted. And, at this time, Keda Operator does not have any error. So it is something else. I will read the commend linked above and let you know taking advantage I had reproduced the state just now.. let's see. |
Regarding
I believe |
Thanks @VietND96 , I will check... (using charts 0.25.0 - I need to upgrade). I've seen two flavors of this Terminating issue.
|
This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
This issue will be fixed in SeleniumHQ/selenium#14282 |
FYI, image tag |
What happened?
Session is killed and removed from Session Map in the same second that it receives another request.
My selenium tests are running in a Kubernetes cluster (OpenShift), and they are configured to point to a remote selenium grid deployed in the same cluster and installed via helm chart (using chrome node and with auto scale enabled). Follow versions and configuration details:
Versions
4.17.0-20240123
4.17.0-20240123
AutoScaling as configured in Helm values
I have noticed the following behaviors:
Here is the exception we have in the test container:
Session Id:
6e5053d746543857588f380cafbe73f1
Then here is what happens in Selenium Hub from the point
6e5053d746543857588f380cafbe73f1
is created (03:04:48.747
) till it has its slot released (03:18:12.023
) and then deleted from Session Map (03:18:12.023
). Few miliseconds later, in the same seconds (03:18:12.265
), it complains saying it is Unable to process a request against the just removed session.Later in the logs of hub it keeps trying to process that request and keeps complaining it is unable to find that session id.
Again, we do not see this problem when the test is executed in isolation. We can reproduce this when other tests are running at the same time (each test runs in a differenr container but all pointing to the hub remotely). Thoughts?
Complete hub log is attached for further analysis here:
selenium-hub-94b85cd4-b6bvs-selenium-hub.log
Command used to start Selenium Grid with Docker (or Kubernetes)
selenium-values-yml have versions and "autoscaling" mentioned above, but images are pulled from our internal registry. Follow template used (only chrome node is enabled):
Operating System
Kubernetes (OpenShift)
Docker Selenium version (image tag)
4.17.0-20240123
Selenium Grid chart version (chart version)
0.25.0
The text was updated successfully, but these errors were encountered: