sidecar: exceeded sample limit #1191

jjneely · 2019-05-29T19:20:10Z

When running queries via the Query component, we often get partial results and errors back from various sidecars that look like the below. Definitely a Prometheus/Sidecar IP and not a Store component. See graphic.

We broke out tcpdump to figure out the exact problem. The Prometheus remote read API is returning a HTTP 400 status code to the Sidecar which propagates the result to Thanos. The actual error message is below:

HTTP/1.1 400 Bad Request
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Wed, 29 May 2019 14:41:01 GMT
Content-Length: 33

exceeded sample limit (50000000)

As we are trying to build a transition path from direct Prometheus queries to querying data via Thanos, we have not yet lowered the retention of data on our Prometheus VMs. So they still have 30 days of data (and 60 in some cases). It appears that some of our Prometheus instances are large enough that Prometheus remote read queries end up returning a large amount of data and at times exceed this limit. It also seems to vastly slow down query performance.

The preferred thing to do here, obviously, is run Prometheus VMs with limited retention as data will exist in GCS/Store. We did this in our devel environment and problems went away, query performance increased, and using downsampled results seemed to work much better. However, I find myself in an interesting position. I don't feel comfortable calling my Thanos Query service "production" yet, and I do not feel comfortable removing retention from team's Prometheus VMs without a production Thanos Query service running. It feels like there may be other reasons folks may wish to keep longer than a day or 3 of retention on their Prometheus VMs as well.

I'm wondering if this can be solved by adding a command line option to the Sidecar that would provide a duration like 3d. When the Sidecar process is running it would advertise and use the most recent value of mint or now-3d as the mint argument for Prometheus's remote read API. This would effectively limit queries from the Sidecar to the Prometheus DB to the last 3 days. The remaining long term data would be provided by GCS. This would provide the best of both worlds, limits queries against Prometheus, and allowing us to keep retention until we have teams moved over to the Thanos endpoints.

Thanos, Prometheus and Golang version used

Thanos 0.4.0 with Goland 1.12.1.

What happened

Prometheus remote read APIs return 400 errors when the exact same query would run on the native Prometheus VM without issue and quite quickly.

What you expected to happen

Queries to return correctly.

How to reproduce it (as minimally and precisely as possible):

Use a query that will search through most/all of the TSDB blocks on a Prometheus VM like count (up) by (job) and execute it through Thanos Query. If the TSDB blocks on the Prometheus VM are large enough combined with enough retention, the Thanos Query will start producing this error as the time range is increased. Sometimes its 5 or 6 days, sometimes its 1 to 4 weeks for me.

Anything else we need to know

Asking questions before I start coding. Adding a CLI option here for this looks pretty simple.

The text was updated successfully, but these errors were encountered:

brancz · 2019-06-05T12:59:31Z

Probably worth getting @bwplotka's opinion, but I think this is generally reasonable.

FUSAKLA · 2019-06-29T11:44:51Z

I saw this requirement mentioned even somewhere else IIRC with the same context. Migrating to Thanos but still holding the data on Prometheus instances.
So this might be worth to add?

bwplotka · 2019-10-03T09:53:09Z

Sorry for the late answer.

So currently with streaming remote read (Thanos 0.7.0+ and Prometheus 2.13+) this issue should be largely mitigated.
Time slicing for a sidecar for requests might make sense, but we need to be explicit that there are minor differents vs the store gw. I think storeapi.min-time limit as mentioned by @povilasv (jus min-time without max time) might be ok. Some attempt was there sidecar: time limit requests to Prometheus remote read api #1267 Happy to merge this once comments will be addressed.

This was referenced Jun 19, 2019

draft: Sidecar: exceeded sample limit #1263

Closed

sidecar: time limit requests to Prometheus remote read api #1267

Closed

FUSAKLA added component: sidecar feature request/improvement labels Jun 29, 2019

bwplotka mentioned this issue Oct 9, 2019

Added min-time limitaiton for sidecar. This allows optionally storing longer retention time on Prometheus. #1619

Merged

2 tasks

bwplotka closed this as completed in #1619 Oct 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sidecar: exceeded sample limit #1191

sidecar: exceeded sample limit #1191

jjneely commented May 29, 2019 •

edited

Loading

brancz commented Jun 5, 2019

FUSAKLA commented Jun 29, 2019

bwplotka commented Oct 3, 2019

sidecar: exceeded sample limit #1191

sidecar: exceeded sample limit #1191

Comments

jjneely commented May 29, 2019 • edited Loading

brancz commented Jun 5, 2019

FUSAKLA commented Jun 29, 2019

bwplotka commented Oct 3, 2019

jjneely commented May 29, 2019 •

edited

Loading