Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed test stopped despite workers running #1707

Closed
ghost opened this issue Feb 18, 2021 · 14 comments · Fixed by #1710
Closed

Distributed test stopped despite workers running #1707

ghost opened this issue Feb 18, 2021 · 14 comments · Fixed by #1710
Labels

Comments

@ghost
Copy link

ghost commented Feb 18, 2021

Describe the bug

When executing a distributed load test where worker node might not heartbeat back in-time (which is not configurable anymore) due to CPU and/or I/O-intensive tasks, it can happen that the whole test is being stopped despite the workers being fine and just busy.

Expected behavior

The test continues to run.

Actual behavior

All workers are being stopped by the master after the following messages:

[2021-02-18 10:58:33,241] 7c22a81c40a0/INFO/locust.runners: Worker ffaeb7471fb6_898127e830cc4c7487b6674f88b045fc failed to send heartbeat, setting state to missing.
[2021-02-18 10:58:33,241] 7c22a81c40a0/INFO/locust.runners: Worker da57de88394e_76e79054084547768aa00e0adba033bf failed to send heartbeat, setting state to missing.
[2021-02-18 10:58:33,241] 7c22a81c40a0/INFO/locust.runners: Worker d3c53c424e43_1644c39706a44f118090761360c76fe1 failed to send heartbeat, setting state to missing.
[2021-02-18 10:58:34,241] 7c22a81c40a0/INFO/locust.runners: Worker 95ac4f8adc8a_ce1bb0094a494d7f8a0540ebab54e105 failed to send heartbeat, setting state to missing.
[2021-02-18 10:58:34,242] 7c22a81c40a0/INFO/locust.runners: Worker c08fe40ccea4_b63905214f7846cea3f18cf529cb8767 failed to send heartbeat, setting state to missing.
[2021-02-18 10:58:35,242] 7c22a81c40a0/INFO/locust.runners: Worker 8784191196a2_9a136837b45d468cb46b0edaa3c3697c failed to send heartbeat, setting state to missing.
[2021-02-18 10:58:35,243] 7c22a81c40a0/INFO/locust.runners: Worker a5c9ae640c92_1f323758e42c4b609fa5a050c28bac50 failed to send heartbeat, setting state to missing.
[2021-02-18 10:58:36,243] 7c22a81c40a0/INFO/locust.runners: Worker 3f2bf4b8fc3f_90f59e8d0efb4e6fb424fb2e95c8c50a failed to send heartbeat, setting state to missing.
[2021-02-18 10:58:36,243] 7c22a81c40a0/INFO/locust.runners: The last worker went missing, stopping test.

After logging some more internals, it became evident that the calculation is simply wrong:

... The last worker went missing, stopping test (workers: 15, missing: 15).

...where:

  • workers = self.worker_count (despite actually running 30 workers in my case)
  • missing = len(self.clients.missing)

...however, self.worker_count doesn't even include missing clients, which makes the condition completely obsolete:

if self.worker_count - len(self.clients.missing) <= 0:

So, either self.worker_count needs to include missing clients or the condition should changed to this instead:

if self.worker_count <= 0:

Steps to reproduce

Create a load-test that has a CPU-intensive task that runs for more than 3 seconds on each worker.

Environment

  • OS: Ubuntu 20.04 LTS
  • Python version: 3.8
  • Locust version: 1.4.3
  • Locust command line that you ran: docker-compose up --scale worker=30 (see docker-compose file below)
  • Locust file contents (anonymized if necessary): the one that I have is too complex for this

Docker compose file:

version: "3.7"

x-base-service: &base_service
  image: "locustio/locust:latest"
  restart: "no"
  volumes:
    - ./:/mnt/tests:ro
  working_dir: "/mnt/tests"

services:
  locust-master:
    <<: *base_service
    container_name: locust-master
    command: [
      "--master",
      "--headless",
      "--locustfile", "/mnt/tests/${LOCUST_FILE:?Locustfile not specified}",
      "--users", "${NUM_USERS:-10}",
      "--spawn-rate", "${SPAWN_RATE:-7}",
      "--run-time", "${RUN_TIME:-5m}",
      "--stop-timeout", "${STOP_TIMEOUT:-60}",
      "--expect-workers", "${LOCUST_WORKERS:-1}",
      "--host", "${LOCUST_TARGET:?No test target host specified}"
    ]

  worker:
    <<: *base_service
    command: [
      "--worker",
      "--master-host", "locust-master",
      "--locustfile", "/mnt/tests/${LOCUST_FILE:?Locustfile not specified}",
      "--users", "${NUM_USERS:-10}",
      "--spawn-rate", "${SPAWN_RATE:-7}",
      "--host", "${LOCUST_TARGET:?No test target host specified}"
    ]

.env file for the specific run:

LOCUST_FILE=<redacted>
LOCUST_WORKERS=30
NUM_USERS=240
SPAWN_RATE=30
RUN_TIME=15m
STOP_TIMEOUT=60
LOCUST_TARGET=<redacted>
@ghost ghost added the bug label Feb 18, 2021
@cyberw
Copy link
Collaborator

cyberw commented Feb 19, 2021

You may be right, but any code that blocks the worker a significant amount of time (without doing async I/O or sleeping, and thus yielding control) must be considered out of scope for locust, as it will make response times wrong (for other concurrent Users).

If you can make a PR (and maybe test case) then I would be happy to merge, just know that even then your test will be in a kind of a bad place.

@ghost
Copy link
Author

ghost commented Feb 22, 2021

Thanx for the clarification. I didn't went that deep into the per-worker concurrency strategy (yet).

This issue is not about long-running task resulting into the worker being reported as "missing" per se, but simply the fact it only takes half of workers to result into a tear-down of the whole test run.

In the example above, I have 30 workers and as soon as 15 are being reported as "missing" all the other 15 workers which are running just fine are stopped by the master and the test run is ended immediately.

If this is a wanted behavior, then it may not be a bug, but instead a configuration enhancement request.

@cyberw
Copy link
Collaborator

cyberw commented Feb 22, 2021

Ah, I didnt read your initial issue thoroughly enough. If losing only half of the workers makes the test shut down then that is definitely a bug. Dont have time to look into it myself though...

@ghost
Copy link
Author

ghost commented Feb 22, 2021

I have created (and tested with docker-compose) the necessary modifications with PR #1710 , if you mind taking a look.

@cyberw
Copy link
Collaborator

cyberw commented Feb 22, 2021

Cool stuff! Would it be possible to add a unit test for it or is it too hard?

@ghost
Copy link
Author

ghost commented Feb 22, 2021

Sure, I'll add a unit test. Didn't have the time to look into the test setup itself, yet.

ghost pushed a commit to eNote-GmbH/locust that referenced this issue Feb 25, 2021
ghost pushed a commit to eNote-GmbH/locust that referenced this issue Feb 25, 2021
ghost pushed a commit to eNote-GmbH/locust that referenced this issue Feb 25, 2021
@ghost
Copy link
Author

ghost commented Feb 25, 2021

So, finally added a test to reproduce the issue (mainly by fixing and enhancing an existing test). Rebased the commits to verify failure before the fix commit.

@gauravgola96
Copy link

Using locust 1.4.4

2021-04-13 17:39:38,486] gauravs-MacBook-Pro.local/INFO/locust.runners: Worker gauravs-MacBook-Pro.local_00d5a80398254a89bfee9fc5148aeb2a failed to send heartbeat, setting state to missing.
[2021-04-13 17:39:38,486] gauravs-MacBook-Pro.local/INFO/locust.runners: Worker gauravs-MacBook-Pro.local_f10c89fc41e543db8ed2b8b5d97c285c failed to send heartbeat, setting state to missing.
[2021-04-13 17:39:38,486] gauravs-MacBook-Pro.local/INFO/locust.runners: Worker gauravs-MacBook-Pro.local_d1f5a519e3e4448a8a3cd783687b5c18 failed to send heartbeat, setting state to missing.
[2021-04-13 17:39:38,486] gauravs-MacBook-Pro.local/INFO/locust.runners: Worker gauravs-MacBook-Pro.local_af7ef6e4860f49cd8dfcb6b173ab2dbc failed to send heartbeat, setting state to missing.
[2021-04-13 17:39:38,487] gauravs-MacBook-Pro.local/INFO/locust.runners: Worker gauravs-MacBook-Pro.local_5dad29f6107d4ed383b2b4115df2aa7d failed to send heartbeat, setting state to missing.
[2021-04-13 17:39:38,487] gauravs-MacBook-Pro.local/INFO/locust.runners: Worker gauravs-MacBook-Pro.local_f0e025db99054ecbaff26156d923a1c2 failed to send heartbeat, setting state to missing.
[2021-04-13 17:39:38,487] gauravs-MacBook-Pro.local/INFO/locust.runners: Worker gauravs-MacBook-Pro.local_0b9a1df330d9452d81fac9d0f68844d8 failed to send heartbeat, setting state to missing.
[2021-04-13 17:39:38,487] gauravs-MacBook-Pro.local/INFO/locust.runners: Worker gauravs-MacBook-Pro.local_a26a25aa1b504d29bb218df3d0e4ea15 failed to send heartbeat, setting state to missing.
[2021-04-13 17:39:38,487] gauravs-MacBook-Pro.local/INFO/locust.runners: The last worker went missing, stopping test.

Avg response time is around 2sec for the task

master.conf :

locustfile = test_clients/load_test_grpc.py
host = http://0.0.0.0:8000
users = 3000
spawn-rate = 4
web-port = 8089
run-time = 10m
headless = true
master = true
expect-workers = 8

After running for some requests it get stucked..

@roquemoyano-tc
Copy link

the same issue and I have tried with different version of locust since 1.5.* to 2.0.0* and I'm always getting

Worker locust-worker-7f567764d7-5qtv7_0c6224c0bffe4faa89641944ddac097d failed to send heartbeat, setting state to missing. failed to send heartbeat, setting state to missing.

I'm running it in kubernetes

@amaanupstox
Copy link

@roquemoyano-tc did you find the solution?
I'm also facing the same issue

@roquemoyano-tc
Copy link

@roquemoyano-tc did you find the solution?
I'm also facing the same issue

no I didn't I have tried in AKS and in minikube and I'm getting the same issue, I hope someone here could help

@cyberw
Copy link
Collaborator

cyberw commented Aug 11, 2021

What do the worker logs say?

@cyberw
Copy link
Collaborator

cyberw commented Aug 11, 2021

Actually, you should probably open up a new ticket. This ticket was about locust shutting down despite some workers still being running (and connected).

@roquemoyano-tc
Copy link

yes I have opened this #1843

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants