You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When running queries via the Query component, we often get partial results and errors back from various sidecars that look like the below. Definitely a Prometheus/Sidecar IP and not a Store component. See graphic.
We broke out tcpdump to figure out the exact problem. The Prometheus remote read API is returning a HTTP 400 status code to the Sidecar which propagates the result to Thanos. The actual error message is below:
As we are trying to build a transition path from direct Prometheus queries to querying data via Thanos, we have not yet lowered the retention of data on our Prometheus VMs. So they still have 30 days of data (and 60 in some cases). It appears that some of our Prometheus instances are large enough that Prometheus remote read queries end up returning a large amount of data and at times exceed this limit. It also seems to vastly slow down query performance.
The preferred thing to do here, obviously, is run Prometheus VMs with limited retention as data will exist in GCS/Store. We did this in our devel environment and problems went away, query performance increased, and using downsampled results seemed to work much better. However, I find myself in an interesting position. I don't feel comfortable calling my Thanos Query service "production" yet, and I do not feel comfortable removing retention from team's Prometheus VMs without a production Thanos Query service running. It feels like there may be other reasons folks may wish to keep longer than a day or 3 of retention on their Prometheus VMs as well.
I'm wondering if this can be solved by adding a command line option to the Sidecar that would provide a duration like 3d. When the Sidecar process is running it would advertise and use the most recent value of mint or now-3d as the mint argument for Prometheus's remote read API. This would effectively limit queries from the Sidecar to the Prometheus DB to the last 3 days. The remaining long term data would be provided by GCS. This would provide the best of both worlds, limits queries against Prometheus, and allowing us to keep retention until we have teams moved over to the Thanos endpoints.
Thanos, Prometheus and Golang version used
Thanos 0.4.0 with Goland 1.12.1.
What happened
Prometheus remote read APIs return 400 errors when the exact same query would run on the native Prometheus VM without issue and quite quickly.
What you expected to happen
Queries to return correctly.
How to reproduce it (as minimally and precisely as possible):
Use a query that will search through most/all of the TSDB blocks on a Prometheus VM like count (up) by (job) and execute it through Thanos Query. If the TSDB blocks on the Prometheus VM are large enough combined with enough retention, the Thanos Query will start producing this error as the time range is increased. Sometimes its 5 or 6 days, sometimes its 1 to 4 weeks for me.
Anything else we need to know
Asking questions before I start coding. Adding a CLI option here for this looks pretty simple.
The text was updated successfully, but these errors were encountered:
I saw this requirement mentioned even somewhere else IIRC with the same context. Migrating to Thanos but still holding the data on Prometheus instances.
So this might be worth to add?
So currently with streaming remote read (Thanos 0.7.0+ and Prometheus 2.13+) this issue should be largely mitigated.
Time slicing for a sidecar for requests might make sense, but we need to be explicit that there are minor differents vs the store gw. I think storeapi.min-time limit as mentioned by @povilasv (jus min-time without max time) might be ok. Some attempt was there sidecar: time limit requests to Prometheus remote read api #1267 Happy to merge this once comments will be addressed.
When running queries via the Query component, we often get partial results and errors back from various sidecars that look like the below. Definitely a Prometheus/Sidecar IP and not a Store component. See graphic.
We broke out tcpdump to figure out the exact problem. The Prometheus remote read API is returning a HTTP 400 status code to the Sidecar which propagates the result to Thanos. The actual error message is below:
As we are trying to build a transition path from direct Prometheus queries to querying data via Thanos, we have not yet lowered the retention of data on our Prometheus VMs. So they still have 30 days of data (and 60 in some cases). It appears that some of our Prometheus instances are large enough that Prometheus remote read queries end up returning a large amount of data and at times exceed this limit. It also seems to vastly slow down query performance.
The preferred thing to do here, obviously, is run Prometheus VMs with limited retention as data will exist in GCS/Store. We did this in our devel environment and problems went away, query performance increased, and using downsampled results seemed to work much better. However, I find myself in an interesting position. I don't feel comfortable calling my Thanos Query service "production" yet, and I do not feel comfortable removing retention from team's Prometheus VMs without a production Thanos Query service running. It feels like there may be other reasons folks may wish to keep longer than a day or 3 of retention on their Prometheus VMs as well.
I'm wondering if this can be solved by adding a command line option to the Sidecar that would provide a duration like
3d
. When the Sidecar process is running it would advertise and use the most recent value ofmint
or now-3d as themint
argument for Prometheus's remote read API. This would effectively limit queries from the Sidecar to the Prometheus DB to the last 3 days. The remaining long term data would be provided by GCS. This would provide the best of both worlds, limits queries against Prometheus, and allowing us to keep retention until we have teams moved over to the Thanos endpoints.Thanos, Prometheus and Golang version used
Thanos 0.4.0 with Goland 1.12.1.
What happened
Prometheus remote read APIs return 400 errors when the exact same query would run on the native Prometheus VM without issue and quite quickly.
What you expected to happen
Queries to return correctly.
How to reproduce it (as minimally and precisely as possible):
Use a query that will search through most/all of the TSDB blocks on a Prometheus VM like
count (up) by (job)
and execute it through Thanos Query. If the TSDB blocks on the Prometheus VM are large enough combined with enough retention, the Thanos Query will start producing this error as the time range is increased. Sometimes its 5 or 6 days, sometimes its 1 to 4 weeks for me.Anything else we need to know
Asking questions before I start coding. Adding a CLI option here for this looks pretty simple.
The text was updated successfully, but these errors were encountered: