Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

store: i/o timeouts with new memcached config #1979

Open
kamilhristov opened this issue Jan 10, 2020 · 53 comments
Open

store: i/o timeouts with new memcached config #1979

kamilhristov opened this issue Jan 10, 2020 · 53 comments

Comments

@kamilhristov
Copy link

kamilhristov commented Jan 10, 2020

👋 Hello again and thanks for the quick response in #1974

Thanos version used: latest master with #1975

Object Storage Provider: AWS S3

What happened: I am testing the new memcached config for Thanos Store and I am seeing a lot of timeouts. The timeouts are gone when setting the limit to 10s 🤕

Memcached servers look OK and number of connections doesn't exceed max_idle_connections

Thanos Store process and Memcached servers are in the same AZ.

I tried the default config and also tweaked most of the parameters but to no avail.

Any recommendations ?

Arguments:

thanos store --index-cache.config-file=/app/cache-config.yaml --chunk-pool-size=6GB --objstore.config-file=/app/objstore-config.yaml

Config:

type: MEMCACHED
config:
  addresses:
    - "****.0001.euw1.cache.amazonaws.com:11211"
    - "****.0002.euw1.cache.amazonaws.com:11211"
  timeout: 1s
  max_idle_connections: 500
  max_async_concurrency: 200
  max_async_buffer_size: 10000
  max_get_multi_concurrency: 200
  max_get_multi_batch_size: 0
  dns_provider_update_interval: 10s

Logs:

level=warn ts=2020-01-10T11:57:31.215656814Z caller=memcached_client.go:266 msg="failed to store item to memcached" key=P:01DY46YH1MSD2N9PZ7M839E0SS:DcQKMLmxz9NRLFUvIYH7xHEXCRQCie5wnCttjitdnqw err="write tcp 172.17.0.3:35492->10.42.8.140:11211: i/o timeout"
level=warn ts=2020-01-10T11:57:31.215238251Z caller=memcached_client.go:266 msg="failed to store item to memcached" key=P:01DY46YH1MSD2N9PZ7M839E0SS:DcQKMLmxz9NRLFUvIYH7xHEXCRQCie5wnCttjitdnqw err="write tcp 172.17.0.3:35490->10.42.8.140:11211: i/o timeout"
level=warn ts=2020-01-10T11:57:31.076957889Z caller=memcached_client.go:266 msg="failed to store item to memcached" key=P:01DY46FXKKFSNF3D9TVXB13G1H:DcQKMLmxz9NRLFUvIYH7xHEXCRQCie5wnCttjitdnqw err="write tcp 172.17.0.3:35496->10.42.8.140:11211: i/o timeout"
level=warn ts=2020-01-10T11:57:30.831121025Z caller=memcached_client.go:266 msg="failed to store item to memcached" key=P:01DXJS43Y6CEXAMVT0HZVC480E:DcQKMLmxz9NRLFUvIYH7xHEXCRQCie5wnCttjitdnqw err="write tcp 172.17.0.3:34270->10.42.8.43:11211: i/o timeout"
level=warn ts=2020-01-10T11:57:30.830995908Z caller=memcached_client.go:266 msg="failed to store item to memcached" key=P:01DXJS43Y6CEXAMVT0HZVC480E:DcQKMLmxz9NRLFUvIYH7xHEXCRQCie5wnCttjitdnqw err="write tcp 172.17.0.3:50162->10.42.8.43:11211: i/o timeout"
level=warn ts=2020-01-10T11:57:30.754847072Z caller=memcached_client.go:266 msg="failed to store item to memcached" key=P:01DXJSHHT4N25BRSMM27M3QH8K:DcQKMLmxz9NRLFUvIYH7xHEXCRQCie5wnCttjitdnqw err="write tcp 172.17.0.3:50160->10.42.8.43:11211: i/o timeout"
level=warn ts=2020-01-10T11:57:30.754715844Z caller=memcached_client.go:266 msg="failed to store item to memcached" key=P:01DXJSHHT4N25BRSMM27M3QH8K:DcQKMLmxz9NRLFUvIYH7xHEXCRQCie5wnCttjitdnqw err="write tcp 172.17.0.3:46124->10.42.8.43:11211: i/o timeout"
level=warn ts=2020-01-10T11:57:29.148112488Z caller=memcached_client.go:277 msg="failed to fetch items from memcached" err="read tcp 172.17.0.3:34404->10.42.8.43:11211: i/o timeout"
level=warn ts=2020-01-10T11:57:29.148818833Z caller=memcached_client.go:277 msg="failed to fetch items from memcached" err="read tcp 172.17.0.3:47196->10.42.8.140:11211: i/o timeout"
level=warn ts=2020-01-10T11:57:29.108056739Z caller=memcached_client.go:277 msg="failed to fetch items from memcached" err="read tcp 172.17.0.3:48012->10.42.8.140:11211: i/o timeout"
level=warn ts=2020-01-10T11:57:29.105953129Z caller=memcached_client.go:277 msg="failed to fetch items from memcached" err="read tcp 172.17.0.3:34276->10.42.8.43:11211: i/o timeout"
@bwplotka
Copy link
Member

cc @pracucci @GiedriusS (:

@pracucci
Copy link
Contributor

pracucci commented Jan 10, 2020

Thanks @kamilhristov for the report. I'm seeing in the config you're using AWS ElastiCache: which instance type are you using for ElastiCache?

@pracucci
Copy link
Contributor

Could you also show the following metrics graphed?

  • rate(thanos_memcached_operations_total[1m])
  • histogram_quantile(1, rate(thanos_memcached_operation_duration_seconds_bucket[1m]))
  • histogram_quantile(0.9, rate(thanos_memcached_operation_duration_seconds_bucket[1m]))
  • sum by (item_type) (rate(thanos_store_index_cache_hits_total[1m]) / rate(thanos_store_index_cache_requests_total[1m]))

Random thoughts:

  • We run memcached inside the same Kubernetes cluster where Thanos is running (on GKE) and we see a 100th percentile latency <= 500ms for memcached operations
  • A timeout when fetching items may be caused by a very large number of keys fetched within a single getmulti request, which you could limit setting max_get_multi_batch_size: 1000 (1k items per single request looks a reasonable limit to me), but this wouldn't explain the timeout when storing items (items are stored individually)

@kamilhristov
Copy link
Author

I am using t3.medium instance type. It is small one but I thought that it should be sufficient since the in memory index cache size was 1GB.

Here's a screenshot of the graphs:
Screen Shot 2020-01-10 at 21 40 55

I can also test with m5.large or similar. Also happy to provide more info if needed.

@pracucci
Copy link
Contributor

So, from your graphs I can see the 90th percentile latency is good, while the 100th is not. In my experience with AWS, performances on t2/t3 are unpredictable, so as next step I would suggest to try with the m5 before reiterating on the root cause analysis. May you give it a try, please?

@kamilhristov
Copy link
Author

Agree that t2/t3 could cause strange issues :)

Just tried with m5.large and still there are timeouts for both get and set commands.

Here is a screenshot with the new instance type.

Screen Shot 2020-01-10 at 22 17 58

@pracucci
Copy link
Contributor

  1. Thanks for trying the m5
  2. May you show the same charts but with irate() instead of rate()?
  3. May you try to increase the max idle connections setting and set max_get_multi_batch_size: 1000?

@kamilhristov
Copy link
Author

kamilhristov commented Jan 10, 2020

I've increased max idle connections to 1k and max_get_multi_batch_size to 1k.

Now I am getting:

level=error ts=2020-01-10T20:46:07.001594337Z caller=memcached.go:126 msg="failed to cache series in memcached" err="the async buffer is full"

Along with:

level=warn ts=2020-01-10T20:46:09.910966063Z caller=memcached_client.go:277 msg="failed to fetch items from memcached" err="read tcp 172.17.0.4:52078->10.42.8.159:11211: i/o timeout"

Current connections graph. The drop is after restarting Thanos Store with new parameters.

Screen Shot 2020-01-10 at 22 52 48

And new screenshot with irates 📈

Screen Shot 2020-01-10 at 22 54 10

@pracucci
Copy link
Contributor

level=error ts=2020-01-10T20:46:07.001594337Z caller=memcached.go:126 msg="failed to cache series in memcached" err="the async buffer is full"

This can be easily fixed increasing max_async_buffer_size

@pracucci
Copy link
Contributor

level=warn ts=2020-01-10T20:46:09.910966063Z caller=memcached_client.go:277 msg="failed to fetch items from memcached" err="read tcp 172.17.0.4:52078->10.42.8.159:11211: i/o timeout"

I still have no clue about this. Do memcached metrics show anything catching your attention? Among the metrics directly exported by memcached you should see the items fetched (we batch reads on the thanos side): what's the rate there?

@pracucci
Copy link
Contributor

Another hypothesis is that the set operations are saturating the fetches. Just for testing, may you try with the following settings?

max_async_concurrency: 10
max_async_buffer_size: 1000000

@pracucci
Copy link
Contributor

I've also asked about the memcached instance type. May you also confirm Thanos store is not running on a t2/t3 instance, please? I don't think so, but just a confirmation.

@pracucci
Copy link
Contributor

I've run some benchmarks in our cluster, using two tools:

  1. memtier_benchmark by RedisLabs
  2. bench: a quick and dirty benchmark tool based on the Thanos MemcachedClient (code here)

Both benchmarks run against the same memcached server (1 instance), running in the same Kubernetes cluster where the benchmarking tool is running.

The results

$ memtier_benchmark --server=10.32.21.9 --port=11211 --protocol=memcache_text --requests=10000 --clients=200 --ratio=1:1 --data-size-range=100-200 --key-pattern=S:S --multi-key-get=100

4         Threads
200       Connections per thread
10000     Requests per client


ALL STATS
=========================================================================
Type         Ops/sec     Hits/sec   Misses/sec      Latency       KB/sec
-------------------------------------------------------------------------
Sets        77044.82          ---          ---      5.23000     13995.64
Gets        77044.82     77044.82         0.00      5.22700     15107.53
Waits           0.00          ---          ---      0.00000          ---
Totals     154089.65     77044.82         0.00      5.22900     29103.17

Request Latency Distribution
Type     <= msec         Percent
------------------------------------------------------------------------
[REDACTED]
SET      56.000       100.00
---
[REDACTED]
GET      56.000       100.00
---
$ ./bench
Clients: 200
Reqs / client: 10000
Keys / req: 100
Data size range: 100 - 200
Config MaxIdleConnections: 400

All clients started. Now waiting to complete.
SET ops: 2000000
GETMULTI avg: 5 ms max: 76 ms
GETMULTI ops: 2000000 ops/s: 36058.11037331169
Total time: 55.466023574s

Comparison on GetMulti()

  • Raw performances: 77044 ops/s, 5.2ms avg latency, 56ms max latency
  • Benchmark based on Thanos client: 36058 ops/s, 5ms avg latency, 76ms max latency

The performances of the Thanos client don't look bad to me. It's about 50% compared to the raw performances, which looks "reasonable" to me (it shows there's room for improvement, but not bad compared to a c++ tool designed to squeeze out the performances with no business logic around it).

What's next

Other than the latest comments above:

  • May you run the same benchmark?
  • May you make sure that your Thanos store doesn't hit the CPU limit? Another reason why you may experience the i/o timeout may be if the Thanos store doesn't have enough CPU resources

@kamilhristov
Copy link
Author

kamilhristov commented Jan 12, 2020

Sorry for the late response and thanks for spending time on this.

  1. After setting max_async_concurrency max_async_buffer_size and all other recommended settings, I noticed that there are no failing write commands - only reads are failing.

  2. Thanos Store is running on r5.large - not a huge one but it is perfectly utilized. Just in case I bumped the instance to xlarge - now the CPU usage is around 50% but there are still timeouts.

  3. Benchmark results are not as good yours but still seem OK for our scale.

r5.large

4         Threads
200       Connections per thread
10000     Requests per client


ALL STATS
=========================================================================
Type         Ops/sec     Hits/sec   Misses/sec      Latency       KB/sec
-------------------------------------------------------------------------
Sets        28019.71          ---          ---     14.19600      5089.95
Gets        28019.71     28019.71         0.00     14.19200      5494.32
Waits           0.00          ---          ---      0.00000          ---
Totals      56039.43     28019.71         0.00     14.19400     10584.26


Request Latency Distribution
Type     <= msec         Percent
------------------------------------------------------------------------
SET     130.000       100.00
---
GET     140.000       100.00


Clients: 200
Reqs / client: 10000
Keys / req: 100
Data size range: 100 - 200
Config MaxIdleConnections: 400

All clients started. Now waiting to complete.
SET ops: 2000000
GETMULTI avg: 18 ms max: 190 ms
GETMULTI ops: 2000000 ops/s: 10675.525173500942
Total time: 3m7.344413272s

Then with r5.xlarge max latency is similar but we get more throughput.

Clients: 200
Reqs / client: 10000
Keys / req: 100
Data size range: 100 - 200
Config MaxIdleConnections: 400

All clients started. Now waiting to complete.
SET ops: 2000000
GETMULTI avg: 9 ms max: 219 ms
GETMULTI ops: 2000000 ops/s: 20164.66655510673
Total time: 1m39.183390637s

CloudWatch is showing ~100,000,000 GET commands per minute when executing the benchmark.
While Thanos Store generates ~25,000/min when hitting a dashboard.

Might be something with our infrastructure that is causing this. I will investigate more on Monday but will most likely stick with the in-memory cache for now. It seems like better option for our scale. Perhaps when our Thanos data becomes bigger, it will make more sense to offload the index cache to external service.

No need to invest more time on this as it seems that it is not caused by Thanos Store itself and thanks again for helping 🙇

@pracucci
Copy link
Contributor

Thanks for all your help. I will continue spend a bit more of time on this by myself, and I will get back to you in case I find out anything.

but will most likely stick with the in-memory cache for now

Agree. The in-memory cache is the best option for the majority of the use cases.

@GiedriusS
Copy link
Member

I have played around with the new memcached support and my anecdotal evidence says that this happens when memcached cannot keep up because not enough memory has been allocated to it and/or the timeouts are too low. Increasing the memory size with -m helps or enabling large pages support with -L. I will go ahead and release 0.10.0 but we could definitely improve the documentation here with recommendations on how much resources are needed and suggest what values ought to be in the index cache configuration. Please let us know if you find out a stable way to reproduce this.

@kamilhristov
Copy link
Author

Could this be caused by the keys size ?
When max_item_size is 10MB, I am seeing object too large errors in Thanos Store:

level=warn ts=2020-01-13T08:23:13.964559815Z caller=memcached_client.go:266 msg="failed to store item to memcached" key=P:01DY46FXKKFSNF3D9TVXB13G1H:DcQKMLmxz9NRLFUvIYH7xHEXCRQCie5wnCttjitdnqw err="memcache: unexpected response line from \"set\": \"SERVER_ERROR object too large for cache\\r\\n\""

I was able to reproduce the timeouts when executing the benchmark script with larger data size range. Have in mind that this doesn't even hit the limit so the actual Thanos Store keys are even bigger.

Clients: 200
Reqs / client: 1000
Keys / req: 100
Data size range: 150000 - 300000
Config MaxIdleConnections: 400

All clients started. Now waiting to complete.
level=warn msg="failed to fetch items from memcached" err="write tcp 10.42.8.248:61243->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61161->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61409->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:60915->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61263->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61367->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61253->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61347->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61185->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:60967->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61551->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61449->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61099->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61419->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="write tcp 10.42.8.248:61631->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61239->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61079->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61501->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61531->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61007->10.42.8.159:11211: i/o timeout"
level=warn msg="failed to fetch items from memcached" err="read tcp 10.42.8.248:61293->10.42.8.159:11211: i/o timeout"

@stale
Copy link

stale bot commented Feb 12, 2020

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

@stale stale bot added the stale label Feb 12, 2020
@stale stale bot closed this as completed Feb 19, 2020
@kuberkaul
Copy link

Facing the same issue, any documentation or work done on this ? : @pracucci
Partitioned my bucket by time into 3 stores , all using a r5.large memcache each (3 nodes) [I have put the memcache endpoint and all node endpoints in the store config], facing a lot of timeout and async issues

@kakkoyun
Copy link
Member

kakkoyun commented Apr 8, 2020

@kuberkaul Could you check following tips in the following conversation https://cloud-native.slack.com/archives/CL25937SP/p1585919142164500?
You may find those tips useful.

Although I haven't tested it, I'm working on related configuration on thanos-io/kube-thanos#106.

If you still think this a valid bug, we can open the issue again.

@pracucci
Copy link
Contributor

pracucci commented Apr 8, 2020

I talked offline to @kuberkaul and shared some tips. Parallelly, I'm investigating a potential issue causing unfair prioritisation within the memcached client due to the usage of the gate. I will open a PR to fix if confirmed (will work on it shortly).

@joshulyne
Copy link

Hi, also facing the same issue 😞 Any further update on this? Been looking through some of the suggestions / tips from here and from the slack channel, and have been tweaking memcache configurations, yet I'm still running into timeout or "failed to wait for turn: context canceled" issues while trying to fetch items from memcache. Any advice / update would be much appreciated, thanks!

@pracucci
Copy link
Contributor

pracucci commented May 6, 2020

I'm still running into timeout or "failed to wait for turn: context canceled" issues

This log means that a query request has been cancelled. For example, if you open a Grafana dashboard and, while panels are loading, you do refresh it the previous queries are cancelled and if they're cancelled while waiting for a memcached connection slot then you get that error logged.

@joshulyne
Copy link

@pracucci Ah that makes sense -- thanks for the explanation. Im still running into the timeout errors though :/ any further development or thoughts on that?

@wiardvanrij
Copy link
Member

I'm also currently experiencing this issue while tuning Thanos Store components. Perhaps this issue can be re-opened?

@stale
Copy link

stale bot commented Aug 3, 2021

Closing for now as promised, let us know if you need this to be reopened! 🤗

@stale stale bot closed this as completed Aug 3, 2021
@GiedriusS GiedriusS removed the stale label Aug 3, 2021
@Antiarchitect
Copy link
Contributor

Antiarchitect commented Oct 1, 2021

Is thre any updates on this. Also suffering from this problem. Memcached 1.6.12, various configs of thanos cache and Memcached itself are tested.

stats                                                                                                                                     [57/1840]
STAT pid 1
STAT uptime 79
STAT time 1633097669
STAT version 1.6.12
STAT libevent 2.1.8-stable
STAT pointer_size 64
STAT rusage_user 9.217141
STAT rusage_system 25.720073
STAT max_connections 1024
STAT curr_connections 574
STAT total_connections 1291
STAT rejected_connections 0
STAT connection_structures 575
STAT response_obj_oom 0
STAT response_obj_count 61
STAT response_obj_bytes 147456
STAT read_buf_count 227
STAT read_buf_bytes 3719168
STAT read_buf_bytes_free 2572288
STAT read_buf_oom 0
STAT reserved_fds 20
STAT cmd_get 42291
STAT cmd_set 771773
STAT cmd_flush 0
STAT cmd_touch 0
STAT cmd_meta 0
STAT get_hits 30560
STAT get_misses 11731
STAT get_expired 0
STAT get_flushed 0
STAT delete_misses 0
STAT delete_hits 0
STAT incr_misses 0
STAT incr_hits 0
STAT decr_misses 0
STAT decr_hits 0
STAT cas_misses 0
STAT cas_hits 0
STAT cas_badval 0
STAT touch_hits 0
STAT touch_misses 0
STAT auth_cmds 0
STAT auth_errors 0
STAT bytes_read 2337979587
STAT bytes_written 469912325
STAT limit_maxbytes 34359738368
STAT accepting_conns 1
STAT listen_disabled_num 0
STAT time_in_listen_disabled_us 0
STAT threads 4
STAT conn_yields 0
STAT hash_power_level 17
STAT hash_bytes 1048576
STAT hash_is_expanding 0
STAT slab_reassign_rescues 0
STAT slab_reassign_chunk_rescues 0
STAT slab_reassign_evictions_nomem 0
STAT slab_reassign_inline_reclaim 0
STAT slab_reassign_busy_items 0
STAT slab_reassign_busy_deletes 0
STAT slab_reassign_running 0
STAT slabs_moved 0
STAT lru_crawler_running 0
STAT lru_crawler_starts 2
STAT lru_maintainer_juggles 120086
STAT malloc_fails 0
STAT log_worker_dropped 0
STAT log_worker_written 0
STAT log_watcher_skipped 0
STAT log_watcher_sent 0
STAT log_watchers 0
STAT unexpected_napi_ids 0
STAT round_robin_fallback 0
STAT bytes 449291309
STAT curr_items 129957
STAT total_items 771773
STAT slab_global_page_pool 0
STAT expired_unfetched 0
STAT evicted_unfetched 0
STAT evicted_active 0
STAT evictions 0
STAT reclaimed 0
STAT crawler_reclaimed 0
STAT crawler_items_checked 127371
STAT lrutail_reflocked 1481
STAT moves_to_cold 681095
STAT moves_to_warm 5587
STAT moves_within_lru 345
STAT direct_reclaims 0
STAT lru_bumps_dropped 0
END

@GiedriusS GiedriusS reopened this Oct 1, 2021
@GiedriusS
Copy link
Member

I am currently digging this one so reopening. Still valid

@Antiarchitect
Copy link
Contributor

Antiarchitect commented Oct 1, 2021

Some errors from my side:

level=warn ts=2021-10-01T14:36:29.347655096Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=92 firstKey=subrange:01FGVC4Q3N8WKDRRMKZ8V0P39Z/chunks/000151:344480000:344496000 err=EOF
level=warn ts=2021-10-01T14:36:29.347697715Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=392 firstKey=subrange:01FGVC4Q3N8WKDRRMKZ8V0P39Z/chunks/000151:428720000:428736000 err=EOF
level=warn ts=2021-10-01T14:36:29.348214182Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=72 firstKey=subrange:01FGVC4Q3N8WKDRRMKZ8V0P39Z/chunks/000151:349008000:349024000 err="memcache: unexpected line in get response: \"ERROR\\r\\n\""
level=warn ts=2021-10-01T14:36:29.348253084Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=61 firstKey=subrange:01FGVC4Q3N8WKDRRMKZ8V0P39Z/chunks/000151:450112000:450128000 err="memcache: unexpected line in get response: \"ERROR\\r\\n\""
level=warn ts=2021-10-01T14:36:29.348263002Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=95 firstKey=subrange:01FGVC4Q3N8WKDRRMKZ8V0P39Z/chunks/000151:346720000:346736000 err=EOF
level=warn ts=2021-10-01T14:36:29.348667642Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=236 firstKey=subrange:01FGVC4Q3N8WKDRRMKZ8V0P39Z/chunks/000151:451984000:452000000 err=EOF

There are also old good i/o timeout and read/write broken pipe.

@GiedriusS
Copy link
Member

#4742 this should greatly improve the situation. @Antiarchitect if you have time then maybe you could try running Thanos with that pull request?

@Antiarchitect
Copy link
Contributor

@GiedriusS - problematic in my situation as I use Bitnami Thanos images with a lot of magic inside. So hope your MR will be accepted and I'll try the new release where it's included asap.

@stale
Copy link

stale bot commented Jan 9, 2022

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Jan 9, 2022
@GiedriusS GiedriusS removed the stale label Jan 10, 2022
@uvaisibrahim
Copy link

Is there any update on this?

I am also getting the below error while trying to get the historical data.

level=warn ts=2022-03-08T21:07:40.894536137Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=6 firstKey=subrange:01FWYKM19J41JKFFY3A02K5BZ2/chunks/000002:342336000:342352000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.899116373Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=21 firstKey=subrange:01FWYKM19J41JKFFY3A02K5BZ2/chunks/000002:376128000:376144000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.899390159Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=101 firstKey=subrange:01FWYKM19J41JKFFY3A02K5BZ2/chunks/000002:364272000:364288000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.899632943Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=45 firstKey=subrange:01FWYKM19J41JKFFY3A02K5BZ2/chunks/000002:335872000:335888000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.899892628Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=78 firstKey=subrange:01FWYKM19J41JKFFY3A02K5BZ2/chunks/000002:339424000:339440000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.900115794Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=16 firstKey=subrange:01FWYKM19J41JKFFY3A02K5BZ2/chunks/000002:366992000:367008000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.901373702Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=3 firstKey=subrange:01FWYKM19J41JKFFY3A02K5BZ2/chunks/000002:366416000:366432000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.901437644Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=86 firstKey=subrange:01FWYKM19J41JKFFY3A02K5BZ2/chunks/000002:343088000:343104000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.901620772Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=6 firstKey=subrange:01FWYKM19J41JKFFY3A02K5BZ2/chunks/000002:354096000:354112000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.901786015Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=77 firstKey=subrange:01FWYKM19J41JKFFY3A02K5BZ2/chunks/000002:378976000:378992000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.901964644Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=53 firstKey=subrange:01FWYKM19J41JKFFY3A02K5BZ2/chunks/000002:400560000:400576000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.902197572Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=37 firstKey=subrange:01FWYKM19J41JKFFY3A02K5BZ2/chunks/000002:383280000:383296000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.902400874Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=18 firstKey=subrange:01FWYKM19J41JKFFY3A02K5BZ2/chunks/000002:377664000:377680000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.917716417Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=48 firstKey=subrange:01FPHZZWXRFEA8SAR1GFV929SK/chunks/000001:230000000:230016000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.917780801Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=113 firstKey=subrange:01FPHZZWXRFEA8SAR1GFV929SK/chunks/000001:238448000:238464000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.917982688Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=126 firstKey=subrange:01FPHZZWXRFEA8SAR1GFV929SK/chunks/000001:235616000:235632000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.918115958Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=51 firstKey=subrange:01FPHZZWXRFEA8SAR1GFV929SK/chunks/000001:225808000:225824000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.918333518Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=122 firstKey=subrange:01FPHZZWXRFEA8SAR1GFV929SK/chunks/000001:232976000:232992000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.947289769Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=104 firstKey=subrange:01FWYKM19J41JKFFY3A02K5BZ2/chunks/000002:388032000:388048000 err="memcache: connect timeout to 10.98.42.174:11211"
level=warn ts=2022-03-08T21:07:40.947434684Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=37 firstKey=subrange:01FWYKM19J41JKFFY3A02K5BZ2/chunks/000002:355552000:355568000 err="memcache: connect timeout to 10.98.42.174:11211"

I have two queries,

  1. Where the bucket cache config helps? Does it helps to get the historical data(which is not available in prometheus-thanos sidecar pod) quickly on querying? or Does it helps to load the indexes into memory while starting the thanos store pods?
  2. How to figure out the error happening above.

My thanos store memcached configurations are give below.
index caching config - memcached

type: memcached
config:
  addresses: [thanos-memcached.kts-plb-duck-plb-w2.svc.cluster.local:11211]
  timeout: 10s
  max_idle_connections: 400
  max_async_concurrency: 1000
  max_async_buffer_size: 0
  max_get_multi_concurrency: 0
  max_get_multi_batch_size: 0
  dns_provider_update_interval: 10s
  max_item_size: 0

bucket caching config - memcached

type: memcached
config:
  addresses: [thanos-memcached.kts-plb-duck-plb-w2.svc.cluster.local:11211]
  timeout: 10s
  max_idle_connections: 400
  max_async_concurrency: 1000
  max_async_buffer_size: 0
  max_item_size: 0
  max_get_multi_concurrency: 0
  max_get_multi_batch_size: 0
  dns_provider_update_interval: 10s
chunk_subrange_size: 16000
max_chunks_get_range_requests: 3
chunk_object_attrs_ttl: 24h
chunk_subrange_ttl: 24h
blocks_iter_ttl: 5m
metafile_exists_ttl: 2h
metafile_doesnt_exist_ttl: 15m
metafile_content_ttl: 24h

I had se value 0 for max_get_multi_batch_size, max_async_buffer_size, max_item_size, max_get_multi_concurrency to set unlimited capacity so that it should cause any issue because of the limits.

Thanks in advance

@stale
Copy link

stale bot commented Jun 12, 2022

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Jun 12, 2022
@kakkoyun kakkoyun removed the stale label Jun 13, 2022
@stale
Copy link

stale bot commented Aug 13, 2022

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Aug 13, 2022
@carlosjgp
Copy link

Still valid!

@stale stale bot removed the stale label Sep 21, 2022
@robert-becker-hs
Copy link

robert-becker-hs commented Nov 22, 2022

After trying all the suggestions from this post without any success, one of our engineers eventually found out that the issue was due to a connection issue with the Memcached nodes. We are running Thanos in EKS and are using Elasticache Memcached with 8 nodes. We have 18 thanos-store pods running on Kubernetes nodes.

Logs

level=warn ts=2022-03-08T21:07:40.918115958Z caller=memcached_client.go:412 name=caching-bucket msg="failed to fetch items from memcached" numKeys=51 firstKey=subrange:01FPHZZWXRFEA8SAR1GFV929SK/chunks/000001:225808000:225824000 err="memcache: connect timeout to 1.22.1.2:11211"

nslookup and netcat

## Using nslookup to get IP from Memcache cluster
$ nslookup thanos-production.xxx.cache.amazonaws.com
...
Non-authoritative answer:
Name:	thanos-production.xxx.cache.amazonaws.com
Address: 1.22.1.2

## Trying to ping IP and port using network
$ nc -zv 1.22.1.2 11211
# nc command hangs and there is no output

Outcome

As it turns out, we were sometimes able to run netcat successfully from our Kubernetes nodes, others would just hang and could not connect to the Memcached cluster. After restarting the problematic Memcached nodes, things started working and now all thanos-store pods are able to connect to Memcached. We don't know the root cause for the connection issue though but it's a step in the right direction. Maybe this helps some other folks!

@damnever
Copy link
Contributor

damnever commented Sep 7, 2023

Perhaps the network bandwidth has been exhausted; this is our situation.

jakubgs added a commit to status-im/infra-role-cortex that referenced this issue Mar 10, 2024
Based on comments in this Cortex issue:
thanos-io/thanos#1979

I figured lowering parallelism can fix timeout errors.

Signed-off-by: Jakub Sokołowski <[email protected]>
@percy78
Copy link

percy78 commented Jul 9, 2024

As this issue is still open I'm taking the opportunity to describe a similar fault that maybe somebody else come across. Last year we deployed our on-prem observability cluster for dev stg and prd. The three applications we're currently running to observe are loki for logs mimir for metrics and tempo for traces. Grafana server and keycloak are also part of the same cluster running backend mariadb in a galera cluster all in K8s RKE2.

As the cluster is situated in three different sites we have workernodes spread evenly in between the two main sites and then a small site catering for split-brain. Thus we're using the zone aware replication for loki mimir and tempo ingesters and mimir's store gw & alertmanager.

For backend we're storing this to minio deployed in the same cluster one minio cluster in each site and then running replication in between. Minio is stored only on two sites subsequently using active active replication between the two main sites.

All has been working fine up until a week ago when mimir suddenly in PROD started to behave badly, we first noticed this when the store gateway gave the first error message bucket-index file too old. Had to logon to minio with mc and delete bucket-index with previous versions which was a lot, surprisingly so when I thought I had incorporated a life cycle rule to only allow 5 versions for non current versions.

Anyhow, after deleting these we were able to query the data again although this time with horrendous performance. We have a 30 days retention in minio and although we can retrieve data it takes 4ever. The logs for the store gateway pods gives this.

2024-07-09T16:55:31.004840484+03:00 ts=2024-07-09T13:55:31.002044734Z caller=memcached_client.go:332 level=warn name=chunks-cache msg="failed to fetch items from memcached" numKeys=11 firstKey=subrange:anonymous/01J2AQ6M6338P7TVZ1CY791PKS/chunks/000001:34336000:34352000 err="memcache: connect timeout to 172.16.3.204:11211"

I've checked the memory and CPU on the worker nodes catering for minio and these have 4CPU 16GB each, four on each site and not one seems to give us the indication we're running out of memory or CPU

I have deleted mimir completely, deleted the cache data in the pvc for store gateway, deleted minio deployment to no avail. I'm lost with ideas now.

Could it be that the bucket which is now 5+TB is too big for the store gateway to trundle through which for me doesn't make sense as S3 buckets could be PB big

@robert-becker-hs
Copy link

@percy78 Try to restart your Memcached nodes one at a time or find out which node the above IP with the connection timeout belongs to. This "solved" the issue for us

@percy78
Copy link

percy78 commented Jul 10, 2024

@robert-becker-hs good idea, however tried it already. All worker nodes were rebooted last week. Basically we know that it works ish, it's just that it now take minutes to retrieve the metrics whereas it only took seconds before. Just upgraded minio to latest helm chart 5.2.0 from previous version 5.1.0 but no improvement. Checked mimir helm chart and here it's a bit outdated 5.0.0 and the latest one is 5.3.0 so will test this but not confident this will solve the issue.

@percy78
Copy link

percy78 commented Jul 10, 2024

I just restarted the inde-cache pods and now I can see this from the store gw.

2024-07-10T18:59:35.399162395+03:00 ts=2024-07-10T15:59:35.398956054Z caller=grpc_logging.go:64 level=warn duration=95.553445ms method=/gatewaypb.StoreGateway/Series err="rpc error: code = Internal desc = expand series set: preload series: read series range: get index range reader: Access Denied." msg=gRPC

@percy78
Copy link

percy78 commented Jul 15, 2024

My issue has now been resolved albeit with data loss. Simply created a new bucket in minio with active replication in between the two sites and then changed mimir to point to this new bucket. Subsequently we lost the historical data but now I can see metrics from Friday morning i.e three days back and it's quick again.

Note: I could potentially use the mc copy command using days to ingest the data from the old bucket to the new, thus the data would be there for the stakeholders to view again, however we agreed in the team not to pursue this option as I might also ingest the same issue into the new bucket.
Basically if it works don't fix it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests