-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: when apisix starts for a while, its communication with etcd starts to time out #7078
Comments
This error will be reported by ETCD when it cannot receive Do you have any monitoring data about the networking between APISIX and ETCD? Something like network saturation ratio, errors and bandwidth utilization are desired. |
I thought about this possibility, however, etcdctl can execute smoothly during the period of time when apisix cannot communicate with etcd. Also, when I restarted apisix, the problem disappeared immediately and reappeared after a while.
Good idea, I'll see if I can spot something through the monitoring system. |
Note the use of ETCDCTL and APISIX is not same, ETCDCTL just uses the gRPC service while APISIX relies on the ETCD gRPC Gateway, it sends Rest requests, all Rest requests are converted to gRPC streams by gRPC Gateway (it's embedded in the ETCD server). |
Thanks for the reminder, I understand that. I just want to point out that maybe the root cause of this problem is not necessarily on etcd. |
Hard to judge. If you have captured some network packets. Would you like to share them, this maybe helpful for the troubleshooting. |
@tokers It is very exaggerated that the connections of apisix to etcd in the ESTABLISHED state is very high and has been rising all the time. [root@knode10-152-6-32 operation]# date && netstat -apn |grep 5078|grep 10.132.15.138|grep ESTABLISHED|wc -l
Fri May 20 10:05:03 CST 2022
631
[root@knode10-152-6-32 operation]# date && netstat -apn |grep 5078|grep 10.132.15.138|grep ESTABLISHED|wc -l
Fri May 20 10:05:07 CST 2022
645
[root@knode10-152-6-32 operation]# date && netstat -apn |grep 5078|grep 10.132.15.138|grep ESTABLISHED|wc -l
Fri May 20 10:05:11 CST 2022
660
[root@knode10-152-6-32 operation]# date && netstat -apn |grep 5078|grep 10.132.15.138|grep ESTABLISHED|wc -l
Fri May 20 10:06:48 CST 2022
754
[root@knode10-152-6-32 operation]# date && netstat -apn |grep 5078|grep 10.132.15.138|grep ESTABLISHED|wc -l
Fri May 20 10:07:16 CST 2022
782
|
How many APISIX worker processed that you created? |
The CPU of my host is 8 cores, so the number of nginx workers is 8. My apisix configuration is as follows, and the number of workers is as expected: nginx_config: # config for render the template to generate nginx.conf
#user: root # specifies the execution user of the worker process.
# the "user" directive makes sense only if the master process runs with super-user privileges.
# if you're not root user,the default is current user.
error_log: /DATA1/apisix/logs/error.log
error_log_level: warn # warn,error
worker_processes: auto # if you want use multiple cores in container, you can inject the number of cpu as environment variable "APISIX_WORKER_PROCESSES" |
From experience, this is a problem with etcd and I suggest you go to the etcd repository for help. |
Thanks for the reply, I have located the cause of the problem is not on apisix. When I use the curl command line to directly call the API of grpc-gateway, the request is also blocked, so the cause of the problem is indeed on etcd. I plan to close this issue later. Thanks again everyone for the replies, apisix is a great product. |
So, how do you solved the problem finally? my issue is the same with you . |
The root cause of this problem is ETCD's bug. ETCD's HTTP/2-based The official version 3.5.5 has not yet been released, but it has been fixed in branch 3.4 and a new version 3.4.20 has been released. For branch 3.5, you can clone the ETCD source code and compile the The way to recompile ETCD is as follows: git checkout release-3.5
make GOOS=linux GOARCH=amd64 For more information, check out this issue: etcd-io/etcd#14169 |
@hansedong Your resolution works for me extremely. Thanks so much ! |
Current Behavior
I have encountered a problem in apisix that cannot communicate properly with etcd through the admin api.
There are 3 etcd nodes in my environment. When I start apisix, everything is normal. Moreover, I can also operate resources such as Route and Upstreams through the apisix admin api.
However, after a period of time (the specific time is uncertain, generally a few hours), operations via the apisix admin api will time out.
At this point, the following error will appear in the apisix log:
At this time, the request I send to the apisix admin api will also get stuck:
However, there is no problem operating etcd via etcdctl.
This etcd node can be operated through etcdctl, so from the perspective of etcd cluster, etcd is normally served. I also tried to operate the same etcd node with the etcd watch command, and the command was executed normally. So, I guess, there is something wrong with apisix's watch query on etcd.
Then, after I adjusted the log level of etcd cluster from
info
todebug
, I saw the following logs in etcd:When I don't start apisix again, etcd does not have the above error log. After starting apisix for a period of time, watch-related errors begin to appear, and the time range is consistent with the error log in apisix error.log.
My configuration file content is as follows:
My etcd certificate is private, but I'm sure my etcd certificate is fine, otherwise apisix wouldn't be able to communicate with etcd properly in the beginning.
Also, when there is a problem with apisix communicating with etcd, I also get stuck executing
apisix init_etcd
without any output.What I want to ask is, for this kind of problem, how should I troubleshoot it, and what is the possible point of the problem?
Expected Behavior
No response
Error Logs
apisix error log:
etcd error log:
Steps to Reproduce
I'm not sure how to reproduce this issue
Environment
apisix version
):2.13.1
uname -a
):Linux knode10-132-15-138 4.14.105-19-0023 #1 SMP Mon Jan 10 17:53:54 CST 2022 x86_64 x86_64 x86_64 GNU/Linux
openresty -V
ornginx -V
):curl http://127.0.0.1:9090/v1/server_info
):{"etcdserver":"3.5.4","etcdcluster":"3.5.0"}
luarocks --version
):The text was updated successfully, but these errors were encountered: