Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load being distributed to missing workers too Version 2.7.4.dev14 #2008

Closed
radhakrishnaakamat opened this issue Feb 11, 2022 · 10 comments · Fixed by #2010
Closed

Load being distributed to missing workers too Version 2.7.4.dev14 #2008

radhakrishnaakamat opened this issue Feb 11, 2022 · 10 comments · Fixed by #2010
Labels

Comments

@radhakrishnaakamat
Copy link

When running Locust in a distributed mode in Kubernetes, if the pod restarts then the worker status gets changed to missing. I can see that spawning distributes the users equally to all workers, but in this case it distributes to missing workers as well which leads to reduced spwaning rate.
Kindly help!!
I have made some changes by adding Kubernetes support natively, have shared the details in slack channel. kindly suggest on how we can take this further.

@cyberw
Copy link
Collaborator

cyberw commented Feb 11, 2022

Hi! @mboutet is the person most with most insight into user distribution, but I think he'll need more to go on.

Perhaps you can write a failing unit test that shows your issue? Look here for inspiration, and run tox or pytest to execute it. https://github.com/locustio/locust/blob/master/locust/test/test_dispatch.py

(my opinion has always been that it is ok for locust to fail miserably if workers go missing in the middle of a test, because IMO that test is already invalid :)

@mboutet
Copy link
Contributor

mboutet commented Feb 13, 2022

@radhakrishnaakamat, I also used Locust in k8s at the previous company I worked for. One of the thing I did to prevent restarted pods from going missing forever and reappearing as another worker was to run the workers as a stateful set instead of a deployment. Doing so made Locust less flaky because the worker would reconnect with the same ID.

Also, there's already logic in the master runner that removes missing workers from the dispatcher as you can see here:

locust/locust/runners.py

Lines 849 to 856 in 440c612

if client.heartbeat < 0 and client.state != STATE_MISSING:
logger.info("Worker %s failed to send heartbeat, setting state to missing." % str(client.id))
client.state = STATE_MISSING
client.user_classes_count = {}
if self._users_dispatcher is not None:
self._users_dispatcher.remove_worker(client)
if self.rebalancing_enabled() and self.state == STATE_RUNNING and self.spawning_completed:
self.start(self.target_user_count, self.spawn_rate)

Unless there's a bug in this logic (which is completely possible), the logic proposed in #2010 might be redundant. Am I missing something?

@radhakrishnaakamat
Copy link
Author

Closed this issue as it is linked with 2010.

@mboutet
Copy link
Contributor

mboutet commented Feb 13, 2022

Did you want to link the issue with the PR? If so, then you should simply add "Fixes #<the_issue_number>" to the PR's description.

@radhakrishnaakamat
Copy link
Author

radhakrishnaakamat commented Feb 13, 2022

@radhakrishnaakamat, I also used Locust in k8s at the previous company I worked for. One of the thing I did to prevent restarted pods from going missing forever and reappearing as another worker was to run the workers as a stateful set instead of a deployment. Doing so made Locust less flaky because the worker would reconnect with the same ID.

Also, there's already logic in the master runner that removes missing workers from the dispatcher as you can see here:

locust/locust/runners.py

Lines 849 to 856 in 440c612

if client.heartbeat < 0 and client.state != STATE_MISSING:
logger.info("Worker %s failed to send heartbeat, setting state to missing." % str(client.id))
client.state = STATE_MISSING
client.user_classes_count = {}
if self._users_dispatcher is not None:
self._users_dispatcher.remove_worker(client)
if self.rebalancing_enabled() and self.state == STATE_RUNNING and self.spawning_completed:
self.start(self.target_user_count, self.spawn_rate)

Unless there's a bug in this logic (which is completely possible), the logic proposed in #2010 might be redundant. Am I missing something?

Though this is removed from the dispatcher, it is getting used in distribution. So did some debugging with print statements ( any resource on how to debug in distributed setup locally will help a lot, as I am unable to attach it to debugger due to gevent) and found that the redistribution didn't happen. only after adding this logic it started working. I am trying to optimize my changes to remove the worker from runner env too. so that it is also consistant in ui.

Also I feel that running as deployment saves resources( not significantly in this case) over stateful set.

@radhakrishnaakamat
Copy link
Author

Did you want to link the issue with the PR? If so, then you should simply add "Fixes #<the_issue_number>" to the PR's description.

Thank you, I was trying the same.

@mboutet
Copy link
Contributor

mboutet commented Feb 13, 2022

If you're using PyCharm, you need to enable gevent compatibility in the settings.

For debugging the distributed scenario, your best bet is to write integration tests such as the ones in:

class TestMasterWorkerRunners(LocustTestCase):

I don't think you're fixing the issue in the right place. The UsersDispatcher should be alright. The issue is likely due to one or more bugs in the runners.py module.

Try to design integration tests such as the ones in

class TestRemoveWorker(unittest.TestCase):

that will reproduce the issue.


That being said, I experienced various issues like you have when running in k8s at my last job. I made a few "improvements" on my fork at master...mboutet:mboutet-master (I did not update the tests so they are failing at the moment).

The main changes are:

  • Add the LOCUST_WORKER_ID_OVERRIDE environment variable on the worker so that a worker running in a pod can keep the same worker id across pod restarts. This needs to be combined with workers running in a stateful set.

  • Implement a periodic rebalance background job (enabled using the --enable-rebalancing flag). This will rebalance the users every 10s.

  • Implement a rebalance_work_queue (using gevent.queue.Queue()) that queues all the requested rebalances from the various places in the master. This queue prevents a bunch of rebalance from conflicting with each others. Instead, they are processed on a FIFO basis.

Obviously, these changes are highly experimental, but I found that they made Locust a lot more stable in a k8s environment with a lot of dynamic events such as restarting pods and auto-scaling workers.

Perhaps you could try my fork in your environment?

@radhakrishnaakamat
Copy link
Author

If you're using PyCharm, you need to enable gevent compatibility in the settings.

For debugging the distributed scenario, your best bet is to write integration tests such as the ones in:

class TestMasterWorkerRunners(LocustTestCase):

I don't think you're fixing the issue in the right place. The UsersDispatcher should be alright. The issue is likely due to one or more bugs in the runners.py module.

Try to design integration tests such as the ones in

class TestRemoveWorker(unittest.TestCase):

that will reproduce the issue.


That being said, I experienced various issues like you have when running in k8s at my last job. I made a few "improvements" on my fork at master...mboutet:mboutet-master (I did not update the tests so they are failing at the moment).

The main changes are:

  • Add the LOCUST_WORKER_ID_OVERRIDE environment variable on the worker so that a worker running in a pod can keep the same worker id across pod restarts. This needs to be combined with workers running in a stateful set.

  • Implement a periodic rebalance background job (enabled using the --enable-rebalancing flag). This will rebalance the users every 10s.

  • Implement a rebalance_work_queue (using gevent.queue.Queue()) that queues all the requested rebalances from the various places in the master. This queue prevents a bunch of rebalance from conflicting with each others. Instead, they are processed on a FIFO basis.

Obviously, these changes are highly experimental, but I found that they made Locust a lot more stable in a k8s environment with a lot of dynamic events such as restarting pods and auto-scaling workers.

Perhaps you could try my fork in your environment?

Thank you.. This is really helpful..

@b-heimann-senacor
Copy link

The main changes are:

  • Add the LOCUST_WORKER_ID_OVERRIDE environment variable on the worker so that a worker running in a pod can keep the same worker id across pod restarts. This needs to be combined with workers running in a stateful set.

I couldn't find information on this environment variable, is it still supported?

Currently, I am facing the issue that occasionally a Locust Worker loses connection to the Master and forever continues executing requests as an orphan.

@cyberw
Copy link
Collaborator

cyberw commented Nov 13, 2023

Disconnected workers should probably stop themselves after a while (maybe 60s?) but that is a different issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants