Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ACK for worker connection #2044

Closed
marcinh opened this issue Mar 7, 2022 · 15 comments
Closed

Add ACK for worker connection #2044

marcinh opened this issue Mar 7, 2022 · 15 comments

Comments

@marcinh
Copy link
Contributor

marcinh commented Mar 7, 2022

Is your feature request related to a problem? Please describe.

Currently there is no mechanism of confirmation that worker is successfully connected to master and no retry mechanism in case master didn't acknowledge worker's presence. So when there is any error in communication worker does not know about it and keeps sending reports to master while master keeps discarding those reports

Describe the solution you'd like

Implement ACK mechanism together with retry. Once worker sends 'client_ready' message, master would respond with e.g. 'client_ready_ack' message so worker knows it was successfully connected. If worker fails to receive ACK message within specified time it could retry (n-times, to reconsider number of retries as well) the operation.

Describe alternatives you've considered

--

Additional context

We are using Locust extensively and we observe number of failure in our tests due to lack of workers connected (network problems or issue) and such ack + retry mechanism would improve our executions

@cyberw
Copy link
Collaborator

cyberw commented Mar 7, 2022

Are you sure this is what is your underlying problem? A glitch when connecting is of course possible, but it should be very rare.

Sounds like a reasonable feature though. But you’ll most likely have to build it yourself.

@marcinh
Copy link
Contributor Author

marcinh commented Mar 7, 2022

If you approve the feature request, we will probably work on it :)

@cyberw
Copy link
Collaborator

cyberw commented Mar 7, 2022

Awesome, go for it! Maybe start by writing a test case that illustrates the problem that you're trying to solve.

@marcinh
Copy link
Contributor Author

marcinh commented Mar 8, 2022

What about something like this for success:

    def test_worker_connect_success(self):
        class MyTestUser(User):
            @task
            def the_task(self):
                pass

        with mock.patch("locust.runners.CONNECTION_TIMEOUT", new=1):
            with mock.patch("locust.rpc.rpc.Client", mocked_rpc()) as client:
                worker = self.get_runner(environment=Environment(), user_classes=[MyTestUser])
                client.mocked_send(Message('client_ready_ack', {}, worker.client_id))
                self.assertEqual(1, len(client.outbox))
                self.assertEqual('client_ready', client.outbox[0].type)
                self.assertTrue(worker.connected)

and a failure case:

    def test_worker_connect_failure(self):
        class MyTestUser(User):
            @task
            def the_task(self):
                pass

        with mock.patch("locust.runners.CONNECTION_TIMEOUT", new=0.1):
            with mock.patch("locust.runners.CONNECTION_RETRY", new=1):
                with mock.patch("locust.rpc.rpc.Client", mocked_rpc()) as client:
                    worker = self.get_runner(environment=Environment(), user_classes=[MyTestUser])
                    sleep(0.3)
                    self.assertEqual(2, len(client.outbox))
                    self.assertFalse(worker.connected)

@cyberw
Copy link
Collaborator

cyberw commented Mar 8, 2022

Looks good, but does it actually test the retry code itself?

@marcinh
Copy link
Contributor Author

marcinh commented Mar 23, 2022

It checks for a number of messages in outbox - without the retry there would be only 1

@cyberw
Copy link
Collaborator

cyberw commented Mar 23, 2022

Ah, I see. So it tests that retries are done, but not that they help? Maybe that is the best we can do..

@Nosibb
Copy link

Nosibb commented Apr 15, 2022

Hi @cyberw
Is it ok to throw an exception when a worker cannot connect to the master?

Failure case for this approach:

def test_worker_connect_failure(self):
    class MyTestUser(User):
        @task
        def the_task(self):
            pass

    with mock.patch("locust.runners.CONNECTION_TIMEOUT", new=0.01):
        with mock.patch("locust.runners.CONNECTION_RETRY_COUNT", new=1):
            with mock.patch("locust.rpc.rpc.Client", mocked_rpc()) as client:
                try:
                    self.get_runner(environment=Environment(), user_classes=[MyTestUser])
                except ConnectionError:
                    self.assertEqual(2, len(client.outbox))

@cyberw
Copy link
Collaborator

cyberw commented Apr 15, 2022

I guess? The test case should also verify that the exception is really thrown I think. other than that it looks very reasonable.

@github-actions
Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the stale Issue had no activity. Might still be worth fixing, but dont expect someone else to fix it label Jun 15, 2022
@cyberw cyberw removed the stale Issue had no activity. Might still be worth fixing, but dont expect someone else to fix it label Jun 15, 2022
@cyberw cyberw closed this as completed Jun 15, 2022
@alvaro-ajv
Copy link

alvaro-ajv commented Jun 28, 2022

@marcinh Can you please add flags to set a custom time to these variables CONNECT_TIMEOUT and CONNECT_RETRY_COUNT as is implemented in the master to wait for the workers with this flag: --expect-workers-max-wait, so if the flag is present it will wait the time and if not it will work as in the past. This because sometimes the workers can start first and wait for master, specially if the implementation is in Kubernetes.

@cyberw
Copy link
Collaborator

cyberw commented Jun 28, 2022

I dont understand. What is your problem exactly?

@alvaro-ajv
Copy link

alvaro-ajv commented Jun 28, 2022

@cyberw Now with this feature if we start the worker before the master or the master is not ready for any reason, the worker won't connect with the master if the connection timeout excedes, because the CONNECT_TIMEOUT variable is waiting a max of 5 seconds and with 2 retries, that means every worker will wait a max of 15 seconds, this value is completely fixed, so what I'm requesting is a command flag similar to --expect-workers-max-wait in master. In locust version 2.9 the workers wait until the master is available.

@cyberw
Copy link
Collaborator

cyberw commented Jun 28, 2022

Ah. We should probably just change CONNECT_RETRY_COUNT to default to unlimited or something. I didnt fullt understand this behaviour.

@cyberw
Copy link
Collaborator

cyberw commented Jun 28, 2022

I dont usually make PR:s on my phone but here goes :) #2125

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants