Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue: etcd stops answering after ~250 concurent HTTP API clients. #11826

Closed
schlitzered opened this issue Apr 28, 2020 · 14 comments
Closed

Issue: etcd stops answering after ~250 concurent HTTP API clients. #11826

schlitzered opened this issue Apr 28, 2020 · 14 comments
Labels

Comments

@schlitzered
Copy link
Contributor

hey,

we are currently trying to adopt etcd for service discovery.

we are running a 3 node etcd cluster using "etcd-v3.4.7-linux-amd64"

out applications talk to etcd using the HTTP v3 api.

we noticed that the etcd cluster locks up after reaching a "high" number of watchers. whe where able to reproduce this with ~250 clients per etcd node.

the only way to recover from this situation is to restart the whole cluster.

we currently guess that we are triggering some kind of bug in etcd, since the cluster only recovers when restarting.

@tangcong
Copy link
Contributor

can you provide more info?such as metrics file?how to reproduce it? how do you use watchers?thanks.

@schlitzered
Copy link
Contributor Author

reproduce:

basically each client logs into etcd, using username/password, and starts watching a single prefix. but just that there are hundreds of clients.

i am sorry that i cannot provide code to reproduce this right now, but this is embedded into an c++ application. i am currently talking to our devs, if the can provide the exact calls that are made in the code.

what do you mean by "metrics" file? are you talking about the metrics endpoint?

@schlitzered
Copy link
Contributor Author

schlitzered commented Apr 29, 2020

okay, i was able to reproduce this issue with some python code & requests:

import json
import requests
import threading
import time

import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)


def run(number):
    token = requests.post(
        url='https://etcd-svc.example.com:2379/v3/auth/authenticate',
        data=json.dumps({"name": "root", "password": "root_password"}),
        verify=False
    )
    token = token.json()['token']
    print("token is: {0}".format(token))
    stream = requests.post(
        url='https://etcd-svc.example.com:2379/v3/watch',
        data=json.dumps({
            "create_request": {
                "key": "L2FhLw==",
                "range_end": "L3h4Lw==",
                "serializable": True
            }
        }),
        headers={'Authorization': token},
        verify=False,
        stream=True
    )
    for line in stream.iter_lines():
        if line:
            decoded_line = line.decode('utf-8')
            print("Thread {0} got soemthing".format(number))


if __name__ == '__main__':
    for i in range(500):
        thread = threading.Thread(target=run, args=(i,))
        thread.start()
        print("thread {0} is up".format(i))
        time.sleep(0.1)

with 500 clients, the http part of the API crashes:

i am no longer able to login using curl to the node:

curl -L https://$(hostname):2379/v3/auth/authenticate -X POST -d '{"name": "root", "password": "root_password"}'

this will timeout at some point in time:

the gRPC api is not affected by this:

calling "./etcdctl member list" and other commands still works.

the interesting part is, that some of the HTTP connections initiated by this dummy client are still receiving updates. some not.

also, as soon as i stop thy python script, curl requests to the affected etcd node are working again.

@schlitzered
Copy link
Contributor Author

i checked a little more, and it seems like in etcd there is a hard connection limit.

when setting the number of threads above 250, etcd stops answering requests.

@schlitzered schlitzered changed the title Issue: etcd lockup when using http API Issue: etcd stops answering after ~250 concurent HTTP API clients. Apr 29, 2020
@tangcong
Copy link
Contributor

tangcong commented Apr 29, 2020

there are no connection limit. authenticate interface has bad performance. you can see pr #11735 you can specify config --metric = extensive, then you can get the latency of every grpc method.

curl http://hostip:2379/metrics > metrics, it is better if you can provider metric file.

@schlitzered
Copy link
Contributor Author

hmm, have you tried the above python code i posted?
for me it more or less always stops after 250 open threads, after this no new connections will be able to log into etcd, at least not using the HTTP api

@schlitzered
Copy link
Contributor Author

also, can you please explain what a metrics files is, and how to generate it?

doing the curl request will most likely not work for me, because when i run the above python script, the API will not respond to HTTP calls anymore

@tangcong
Copy link
Contributor

tangcong commented Apr 29, 2020 via email

@tangcong
Copy link
Contributor

tangcong commented Apr 29, 2020 via email

@schlitzered
Copy link
Contributor Author

i have just updated to the 2.4.9 release, and still facing the same issue. but it seems like, etcd can now handle a little more connections. i can now see ~310 established connections, but this is still to less.

here is the output of curl https://$(hostname):2379/metrics

metrics.txt

@schlitzered schlitzered reopened this May 25, 2020
@tangcong
Copy link
Contributor

Authenticate is very expensive. How about your etcd cpu load? etcd v3.4.9 includes a pr that can improve Auth performance from 18/s to 200/s in 16core32G machine.

grpc_server_handling_seconds_bucket{grpc_method="Authenticate",grpc_service="etcdserverpb.Auth",grpc_type="unary",le="0.005"} 0
grpc_server_handling_seconds_bucket{grpc_method="Authenticate",grpc_service="etcdserverpb.Auth",grpc_type="unary",le="0.01"} 0
grpc_server_handling_seconds_bucket{grpc_method="Authenticate",grpc_service="etcdserverpb.Auth",grpc_type="unary",le="0.025"} 0
grpc_server_handling_seconds_bucket{grpc_method="Authenticate",grpc_service="etcdserverpb.Auth",grpc_type="unary",le="0.05"} 0
grpc_server_handling_seconds_bucket{grpc_method="Authenticate",grpc_service="etcdserverpb.Auth",grpc_type="unary",le="0.1"} 2
grpc_server_handling_seconds_bucket{grpc_method="Authenticate",grpc_service="etcdserverpb.Auth",grpc_type="unary",le="0.25"} 38
grpc_server_handling_seconds_bucket{grpc_method="Authenticate",grpc_service="etcdserverpb.Auth",grpc_type="unary",le="0.5"} 73
grpc_server_handling_seconds_bucket{grpc_method="Authenticate",grpc_service="etcdserverpb.Auth",grpc_type="unary",le="1"} 77
grpc_server_handling_seconds_bucket{grpc_method="Authenticate",grpc_service="etcdserverpb.Auth",grpc_type="unary",le="2.5"} 255
grpc_server_handling_seconds_bucket{grpc_method="Authenticate",grpc_service="etcdserverpb.Auth",grpc_type="unary",le="5"} 255
grpc_server_handling_seconds_bucket{grpc_method="Authenticate",grpc_service="etcdserverpb.Auth",grpc_type="unary",le="10"} 255
grpc_server_handling_seconds_bucket{grpc_method="Authenticate",grpc_service="etcdserverpb.Auth",grpc_type="unary",le="+Inf"} 255
grpc_server_handling_seconds_sum{grpc_method="Authenticate",grpc_service="etcdserverpb.Auth",grpc_type="unary"} 356.53820076299996

@tangcong
Copy link
Contributor

please see issue #9615,you can also configure --bcrypt-cost to improve performance. @schlitzered

@schlitzered
Copy link
Contributor Author

except when restarting etcd, where it takes 800% cpu, the cpu load on etcd is usually well below 10%.

i also just set "--bcrypt-cost 4", and still we are facing issues with HTTP requests that are not answered.

here is a current metrics file:

metrics.txt

@stale
Copy link

stale bot commented Aug 23, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Aug 23, 2020
@stale stale bot closed this as completed Sep 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

2 participants