Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

http server occasionally experiences high latency #3419

Open
shaxiaozz opened this issue Sep 3, 2024 · 4 comments
Open

http server occasionally experiences high latency #3419

shaxiaozz opened this issue Sep 3, 2024 · 4 comments

Comments

@shaxiaozz
Copy link

shaxiaozz commented Sep 3, 2024

1、Environment Information
Tornado Version: TornadoServer/6.1
Python Version: 3.7
Docker Version: 20.10.7

My Tornado Web Server runs in K8s, so the traffic is forwarded by Envoy. The client uses HTTP1.1 requests by default, so HTTP Keep-Alive is used between Envoy and Tornado Pod
Envoy--->Tornado Pod

Tornado startup sample code is as follows:
Tornado Pod cpu limit=4

if __name__ == "__main__":
    if conf.IS_BACKUP:
        application = Application([
            (r"/(\w+)", TestHandlerBase)], **settings)
    else:
        application = Application([
            (r"/(\w+)", TestHandlerBase)])

    num_processes = conf.TORNADO_PROCESS_NUM
    _initialization(num_processes)

    server = httpserver.HTTPServer(application)
    server.bind('9880', '0.0.0.0', backlog=2048)
    server.start(num_processes)

    skywalking_agent()
    print ('start service...')
    ioloop.IOLoop.instance().start()

2、Problems encountered
When I performed stress testing on the Tornado interface, the original interface response delay should be within 100ms, but after 1 minute of stress testing, the interface response delay rose to more than 300ms, which is intolerable for our business.

We used tcpdump to capture packets and check the communication, and found the following situation:
172.29.222.1 is the envoy ip, 172.29.86.34 is the tornado web pod ip

Serial number 20498: envoy forwards the request to tornado pod (time: 11:17:57.3055)
Serial number 20499: tornado pod responds with ACK (time: 11:17:57.3056)
Serial number 22235: tornado pod responds with http data (time: 11:17:57.799)
The entire HTTP response time is 494ms
image
The request entered the tornado pod at 11:17:57.3056, but we observed the tornado access log and found that tornado actually started processing the request at 11:17:57.751, and completed the processing at 11:17:57.799, sending an http response
image

We don't know where the time is spent. At the same time, is there any tool to observe the time consumption of this part?

I hope to get your reply. This problem has troubled us for a long time. Thank you~

@bdarnell
Copy link
Member

bdarnell commented Sep 9, 2024

There are a few likely possibilities here:

  • Some long-running operation is blocking the event loop thread. Use asyncio debug mode to investigate this possibility
  • Poor load balancing. I've seen significant imbalances between the processes when running in multi-process mode, so one process can get overloaded while the others are fine. Since you're already running envoy as a proxy here, I wouldn't recommend using tornado's multi-process mode. Instead, run multiple independent copies of the tornado process, each listening on a separate port, and configure envoy to balance the load between them (see https://www.tornadoweb.org/en/stable/guide/running.html, although it covers nginx instead of envoy). This is generally more reliable than just letting all the processes share a port. (I've heard that SO_REUSEPORT can also help load balancing here but I haven't used it myself)
  • The system is simply overloaded. You didn't share how you're running the load test but if you give the system more load than it can handle, latency will go up. And in asyncio-based systems, this often manifests as a few unlucky requests experiencing significant delays while others are fine (in contrast to a more gradual decline in performance often seen in multithreaded solutions).

@shaxiaozz
Copy link
Author

@bdarnell First of all, thank you very much for your reply~

I debugged according to the above 3 points:
1、Enable asyncio debug mode
"logging.basicConfig(level=logging.DEBUG)" And "Setting the [PYTHONASYNCIODEBUG] environment variable to 1"

I used Apache JMeter to stress test the Tornado interface, with 500 concurrent requests and a duration of 180 seconds.
image
During the stress test, I observed occasional callbacks in the DEBUG log that took more than 100 milliseconds.
image


2、Poor load balancing
I have tried tornad xheaders=true. And under the same Jmeter pressure in the first point, the performance is slower. I tried to find the definition of the xheaders parameter in the source code, and I only saw "These headers are useful when running Tornado behind a reverse proxy or load balancer".

At the same time, I observed that there is DefaultHostMatches logic in tornado/routing.py, which seems to be used for routing judgment?
image

I don't know what actions tornado will perform when xheaders=true. I found that this parameter will increase the time consumption when I tested it with Jmeter. Maybe it has something to do with my operation?


3、Tornado uses single-process mode
Due to the special business volume, the traffic in the PRD environment will be relatively high, because we will choose multi-threaded, non-blocking startup mode to run when going online.

I tried to directly set the number of CPU cores to 1, and the POD replicas were also expanded accordingly. The performance time consumption under the same pressure of Jmeter is not optimistic. . .


4、Full of resources
Since we are based on the Promehtues+Grafana monitoring system, during the stress test, we observed that the CPU & memory usage of the POD was stable and did not reach their respective maximum limits.

@bdarnell
Copy link
Member

During the stress test, I observed occasional callbacks in the DEBUG log that took more than 100 milliseconds.

These messages that say "poll %.3f ms took %.3f ms" are harmless - they're reporting on internal state of the asyncio event loop (basically, how long the loop sat idle waiting for network activity). This message was removed in python 3.8.

The message you're looking for is "Executing %s took %.3f seconds". If you see any of those, you have a problem with a callback that's running too long.

I don't know what actions tornado will perform when xheaders=true.

This mainly affects logging: With xheaders=True, logs will show the IP address from the X-Real-IP header instead of the IP from the TCP connection (which is the load balancer). There's a tiny amount of extra work involved when xheaders=True but I would be surprised if it were a measurable difference.

I tried to directly set the number of CPU cores to 1, and the POD replicas were also expanded accordingly. The performance time consumption under the same pressure of Jmeter is not optimistic. . .

Hmm, this sounds like the configuration I would recommend so it's worth investigating it further to see what's going on.

we observed that the CPU & memory usage of the POD was stable and did not reach their respective maximum limits.

OK. The question you need to answer is whether the CPU usage is balanced across all the cores or if some cores are at 100% while others are less. Or, when you run one core per pod with more pods, whether the CPU usage is balanced across the pods.

@mirrornight
Copy link

@bdarnell First of all, thank you very much for your reply~

I debugged according to the above 3 points: 1、Enable asyncio debug mode "logging.basicConfig(level=logging.DEBUG)" And "Setting the [PYTHONASYNCIODEBUG] environment variable to 1"

I used Apache JMeter to stress test the Tornado interface, with 500 concurrent requests and a duration of 180 seconds. image During the stress test, I observed occasional callbacks in the DEBUG log that took more than 100 milliseconds. image

2、Poor load balancing I have tried tornad xheaders=true. And under the same Jmeter pressure in the first point, the performance is slower. I tried to find the definition of the xheaders parameter in the source code, and I only saw "These headers are useful when running Tornado behind a reverse proxy or load balancer".

At the same time, I observed that there is DefaultHostMatches logic in tornado/routing.py, which seems to be used for routing judgment? image

I don't know what actions tornado will perform when xheaders=true. I found that this parameter will increase the time consumption when I tested it with Jmeter. Maybe it has something to do with my operation?

3、Tornado uses single-process mode Due to the special business volume, the traffic in the PRD environment will be relatively high, because we will choose multi-threaded, non-blocking startup mode to run when going online.

I tried to directly set the number of CPU cores to 1, and the POD replicas were also expanded accordingly. The performance time consumption under the same pressure of Jmeter is not optimistic. . .

4、Full of resources Since we are based on the Promehtues+Grafana monitoring system, during the stress test, we observed that the CPU & memory usage of the POD was stable and did not reach their respective maximum limits.

Can you give a complete code example?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants