Distributed test stopped despite workers running #1707

ghost · 2021-02-18T11:41:00Z

Describe the bug

When executing a distributed load test where worker node might not heartbeat back in-time (which is not configurable anymore) due to CPU and/or I/O-intensive tasks, it can happen that the whole test is being stopped despite the workers being fine and just busy.

Expected behavior

The test continues to run.

Actual behavior

All workers are being stopped by the master after the following messages:

[2021-02-18 10:58:33,241] 7c22a81c40a0/INFO/locust.runners: Worker ffaeb7471fb6_898127e830cc4c7487b6674f88b045fc failed to send heartbeat, setting state to missing.
[2021-02-18 10:58:33,241] 7c22a81c40a0/INFO/locust.runners: Worker da57de88394e_76e79054084547768aa00e0adba033bf failed to send heartbeat, setting state to missing.
[2021-02-18 10:58:33,241] 7c22a81c40a0/INFO/locust.runners: Worker d3c53c424e43_1644c39706a44f118090761360c76fe1 failed to send heartbeat, setting state to missing.
[2021-02-18 10:58:34,241] 7c22a81c40a0/INFO/locust.runners: Worker 95ac4f8adc8a_ce1bb0094a494d7f8a0540ebab54e105 failed to send heartbeat, setting state to missing.
[2021-02-18 10:58:34,242] 7c22a81c40a0/INFO/locust.runners: Worker c08fe40ccea4_b63905214f7846cea3f18cf529cb8767 failed to send heartbeat, setting state to missing.
[2021-02-18 10:58:35,242] 7c22a81c40a0/INFO/locust.runners: Worker 8784191196a2_9a136837b45d468cb46b0edaa3c3697c failed to send heartbeat, setting state to missing.
[2021-02-18 10:58:35,243] 7c22a81c40a0/INFO/locust.runners: Worker a5c9ae640c92_1f323758e42c4b609fa5a050c28bac50 failed to send heartbeat, setting state to missing.
[2021-02-18 10:58:36,243] 7c22a81c40a0/INFO/locust.runners: Worker 3f2bf4b8fc3f_90f59e8d0efb4e6fb424fb2e95c8c50a failed to send heartbeat, setting state to missing.
[2021-02-18 10:58:36,243] 7c22a81c40a0/INFO/locust.runners: The last worker went missing, stopping test.

After logging some more internals, it became evident that the calculation is simply wrong:

... The last worker went missing, stopping test (workers: 15, missing: 15).

...where:

workers = self.worker_count (despite actually running 30 workers in my case)
missing = len(self.clients.missing)

...however, self.worker_count doesn't even include missing clients, which makes the condition completely obsolete:

if self.worker_count - len(self.clients.missing) <= 0:

So, either self.worker_count needs to include missing clients or the condition should changed to this instead:

if self.worker_count <= 0:

Steps to reproduce

Create a load-test that has a CPU-intensive task that runs for more than 3 seconds on each worker.

Environment

OS: Ubuntu 20.04 LTS
Python version: 3.8
Locust version: 1.4.3
Locust command line that you ran: docker-compose up --scale worker=30 (see docker-compose file below)
Locust file contents (anonymized if necessary): the one that I have is too complex for this

Docker compose file:

version: "3.7"

x-base-service: &base_service
  image: "locustio/locust:latest"
  restart: "no"
  volumes:
    - ./:/mnt/tests:ro
  working_dir: "/mnt/tests"

services:
  locust-master:
    <<: *base_service
    container_name: locust-master
    command: [
      "--master",
      "--headless",
      "--locustfile", "/mnt/tests/${LOCUST_FILE:?Locustfile not specified}",
      "--users", "${NUM_USERS:-10}",
      "--spawn-rate", "${SPAWN_RATE:-7}",
      "--run-time", "${RUN_TIME:-5m}",
      "--stop-timeout", "${STOP_TIMEOUT:-60}",
      "--expect-workers", "${LOCUST_WORKERS:-1}",
      "--host", "${LOCUST_TARGET:?No test target host specified}"
    ]

  worker:
    <<: *base_service
    command: [
      "--worker",
      "--master-host", "locust-master",
      "--locustfile", "/mnt/tests/${LOCUST_FILE:?Locustfile not specified}",
      "--users", "${NUM_USERS:-10}",
      "--spawn-rate", "${SPAWN_RATE:-7}",
      "--host", "${LOCUST_TARGET:?No test target host specified}"
    ]

.env file for the specific run:

LOCUST_FILE=<redacted>
LOCUST_WORKERS=30
NUM_USERS=240
SPAWN_RATE=30
RUN_TIME=15m
STOP_TIMEOUT=60
LOCUST_TARGET=<redacted>

The text was updated successfully, but these errors were encountered:

cyberw · 2021-02-19T22:32:15Z

You may be right, but any code that blocks the worker a significant amount of time (without doing async I/O or sleeping, and thus yielding control) must be considered out of scope for locust, as it will make response times wrong (for other concurrent Users).

If you can make a PR (and maybe test case) then I would be happy to merge, just know that even then your test will be in a kind of a bad place.

ghost · 2021-02-22T09:47:39Z

Thanx for the clarification. I didn't went that deep into the per-worker concurrency strategy (yet).

This issue is not about long-running task resulting into the worker being reported as "missing" per se, but simply the fact it only takes half of workers to result into a tear-down of the whole test run.

In the example above, I have 30 workers and as soon as 15 are being reported as "missing" all the other 15 workers which are running just fine are stopped by the master and the test run is ended immediately.

If this is a wanted behavior, then it may not be a bug, but instead a configuration enhancement request.

cyberw · 2021-02-22T10:46:34Z

Ah, I didnt read your initial issue thoroughly enough. If losing only half of the workers makes the test shut down then that is definitely a bug. Dont have time to look into it myself though...

ghost · 2021-02-22T12:01:18Z

I have created (and tested with docker-compose) the necessary modifications with PR #1710 , if you mind taking a look.

cyberw · 2021-02-22T14:03:46Z

Cool stuff! Would it be possible to add a unit test for it or is it too hard?

ghost · 2021-02-22T14:28:11Z

Sure, I'll add a unit test. Didn't have the time to look into the test setup itself, yet.

…ount - refs locustio#1707

ghost · 2021-02-25T22:09:16Z

So, finally added a test to reproduce the issue (mainly by fixing and enhancing an existing test). Rebased the commits to verify failure before the fix commit.

gauravgola96 · 2021-04-13T12:17:33Z

Using locust 1.4.4

2021-04-13 17:39:38,486] gauravs-MacBook-Pro.local/INFO/locust.runners: Worker gauravs-MacBook-Pro.local_00d5a80398254a89bfee9fc5148aeb2a failed to send heartbeat, setting state to missing.
[2021-04-13 17:39:38,486] gauravs-MacBook-Pro.local/INFO/locust.runners: Worker gauravs-MacBook-Pro.local_f10c89fc41e543db8ed2b8b5d97c285c failed to send heartbeat, setting state to missing.
[2021-04-13 17:39:38,486] gauravs-MacBook-Pro.local/INFO/locust.runners: Worker gauravs-MacBook-Pro.local_d1f5a519e3e4448a8a3cd783687b5c18 failed to send heartbeat, setting state to missing.
[2021-04-13 17:39:38,486] gauravs-MacBook-Pro.local/INFO/locust.runners: Worker gauravs-MacBook-Pro.local_af7ef6e4860f49cd8dfcb6b173ab2dbc failed to send heartbeat, setting state to missing.
[2021-04-13 17:39:38,487] gauravs-MacBook-Pro.local/INFO/locust.runners: Worker gauravs-MacBook-Pro.local_5dad29f6107d4ed383b2b4115df2aa7d failed to send heartbeat, setting state to missing.
[2021-04-13 17:39:38,487] gauravs-MacBook-Pro.local/INFO/locust.runners: Worker gauravs-MacBook-Pro.local_f0e025db99054ecbaff26156d923a1c2 failed to send heartbeat, setting state to missing.
[2021-04-13 17:39:38,487] gauravs-MacBook-Pro.local/INFO/locust.runners: Worker gauravs-MacBook-Pro.local_0b9a1df330d9452d81fac9d0f68844d8 failed to send heartbeat, setting state to missing.
[2021-04-13 17:39:38,487] gauravs-MacBook-Pro.local/INFO/locust.runners: Worker gauravs-MacBook-Pro.local_a26a25aa1b504d29bb218df3d0e4ea15 failed to send heartbeat, setting state to missing.
[2021-04-13 17:39:38,487] gauravs-MacBook-Pro.local/INFO/locust.runners: The last worker went missing, stopping test.

Avg response time is around 2sec for the task

master.conf :

locustfile = test_clients/load_test_grpc.py
host = http://0.0.0.0:8000
users = 3000
spawn-rate = 4
web-port = 8089
run-time = 10m
headless = true
master = true
expect-workers = 8

After running for some requests it get stucked..

roquemoyano-tc · 2021-08-06T20:01:54Z

the same issue and I have tried with different version of locust since 1.5.* to 2.0.0* and I'm always getting

Worker locust-worker-7f567764d7-5qtv7_0c6224c0bffe4faa89641944ddac097d failed to send heartbeat, setting state to missing. failed to send heartbeat, setting state to missing.

I'm running it in kubernetes

amaanupstox · 2021-08-11T12:37:20Z

@roquemoyano-tc did you find the solution?
I'm also facing the same issue

roquemoyano-tc · 2021-08-11T12:51:12Z

@roquemoyano-tc did you find the solution?
I'm also facing the same issue

no I didn't I have tried in AKS and in minikube and I'm getting the same issue, I hope someone here could help

cyberw · 2021-08-11T13:00:40Z

What do the worker logs say?

cyberw · 2021-08-11T13:03:12Z

Actually, you should probably open up a new ticket. This ticket was about locust shutting down despite some workers still being running (and connected).

roquemoyano-tc · 2021-08-11T13:09:31Z

yes I have opened this #1843

ghost added the bug label Feb 18, 2021

ghost mentioned this issue Feb 22, 2021

Fix automatic distributed test shutdown #1710

Merged

ghost pushed a commit to eNote-GmbH/locust that referenced this issue Feb 25, 2021

Add log message for distributed worker self-healing and resend user c…

ec1d27f

…ount - refs locustio#1707

ghost pushed a commit to eNote-GmbH/locust that referenced this issue Feb 25, 2021

Fix and extend test_last_worker_missing_stops_test - refs locustio#1707

6e2b119

ghost pushed a commit to eNote-GmbH/locust that referenced this issue Feb 25, 2021

Fix all-workers-missing condition - fixes locustio#1707

bfa2084

cyberw closed this as completed in #1710 Feb 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed test stopped despite workers running #1707

Distributed test stopped despite workers running #1707

ghost commented Feb 18, 2021

cyberw commented Feb 19, 2021 •

edited

Loading

ghost commented Feb 22, 2021

cyberw commented Feb 22, 2021

ghost commented Feb 22, 2021

cyberw commented Feb 22, 2021

ghost commented Feb 22, 2021

ghost commented Feb 25, 2021

gauravgola96 commented Apr 13, 2021

roquemoyano-tc commented Aug 6, 2021

amaanupstox commented Aug 11, 2021

roquemoyano-tc commented Aug 11, 2021

cyberw commented Aug 11, 2021

cyberw commented Aug 11, 2021

roquemoyano-tc commented Aug 11, 2021

Distributed test stopped despite workers running #1707

Distributed test stopped despite workers running #1707

Comments

ghost commented Feb 18, 2021

Describe the bug

Expected behavior

Actual behavior

Steps to reproduce

Environment

cyberw commented Feb 19, 2021 • edited Loading

ghost commented Feb 22, 2021

cyberw commented Feb 22, 2021

ghost commented Feb 22, 2021

cyberw commented Feb 22, 2021

ghost commented Feb 22, 2021

ghost commented Feb 25, 2021

gauravgola96 commented Apr 13, 2021

roquemoyano-tc commented Aug 6, 2021

amaanupstox commented Aug 11, 2021

roquemoyano-tc commented Aug 11, 2021

cyberw commented Aug 11, 2021

cyberw commented Aug 11, 2021

roquemoyano-tc commented Aug 11, 2021

cyberw commented Feb 19, 2021 •

edited

Loading