Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

druid.server.http.defaultQueryTimeout does not set timeout for Historical servers #17475

Open
GWphua opened this issue Nov 14, 2024 · 0 comments

Comments

@GWphua
Copy link
Contributor

GWphua commented Nov 14, 2024

Affected Version

Environments are run on Kubernetes clusters:
Production env: 0.22.1
Test env: 27

Description

Cluster Info: 2 Historicals, max heap 8G, direct memory size 16G. Limit 4cpu, 24G memory.

According to the docs, Historical servers can configure a query timeout which stops the queries. The default value is 300_000ms, which translates to 5min.

This problem is first encountered when my Historical servers are being overloaded with slow queries, and I want to set a timeout of 90s on Historicals to alleviate this problem. However, setting druid.server.http.defaultQueryTimeout=90000 does not seem to work.

Grafana dashboard reports still record Historical latencies of up to 8min, while Brokers show latencies of 5min, which shows that the default 5min druid.server.http.defaultQueryTimeout configuration is valid for the Brokers. (Maybe Historicals need more time to process query cancellations? Read from Cancel a Query that Druid does a "best-effort cancellation", but may need more details on how this 'best-effort' is conducted, and whether query cancellation trigger a stop Historical processes.)

Grafana Dashboard

Adding a timeout under query context does help with limiting the latency. Here's a try of adding a timeout in the query context from 500ms to 5s, and back:
Changing Query Context

To further investigate, I set up a test cluster and modified the druid.server.http.defaultQueryTimeout in the historical/runtime.properties file to a very low value (5ms) to trigger a timeout condition. However, the query continued to execute as normal (Historical latencies of up to 10s) despite this change.

I added the following lines to historical/runtime.properties, and rebuilt the Druid image to run on my Kubernetes setup:

druid.server.http.defaultQueryTimeout=90000
druid.server.http.maxQueryTimeout=90000

Despite these configurations, my Historical servers began experiencing issues loading segments, resulting in an inability to query Druid's provided trips_xaa datasource. The error message indicated that the configured timeout had reverted to the default of 5 minutes.

Timeout Error

In an effort to troubleshoot, I thoroughly checked the configuration spellings and reviewed ServerConfig.java but found no discrepancies. This raises the possibility that the documentation may misrepresent the ability to configure timeouts for Historical servers, suggesting it may only be applicable to Broker instances.

Request for Clarification

If Druid is indeed lacking the capability to adjust timeout settings for Historical servers or if there has been a change in the configuration naming, I kindly request an update to the documentation for clarity.

Thank you for your attention to this matter!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant