add internal cache for /api/v1/services with scheduled update #1554

semyonslepov · 2017-04-10T10:43:29Z

We use Zipkin with ElasticSearch in AWS and have the following problem: on a big amount of data request to /api/v1/services takes too long time and often we have timeouts in UI.
There is a try to implement opt-in internal caching of getServiceNames() result with scheduled updates. Now we are using this patch in our external environment, it improves a bit our users' experience. Hope it can help for upstream (maybe in another way of implementation).

codefromthecrypt · 2017-04-10T12:10:17Z

Here are two related issues on this topic to review before we talk about new code and where: cache control: #718 (comment) elasticsearch performance: #1526 do you mind commenting inside each, particularly in #1526 note how much you are ingesting and if you are doing any sampling.

semyonslepov · 2017-04-10T12:31:37Z

Sure, I've seen these issues before, but I'll look for them again and put comments.

codefromthecrypt · 2017-04-10T14:23:15Z

one last thing to verify. we recently compensated for this in #1538
are you running latest?

semyonslepov · 2017-04-10T14:34:52Z

Yes, we're already using it, it's a great thing and helps a lot! But I suppose if we increase our workload significantly (and yes, we will do it), we will again face the same problem (and we won't be able to decrease QUERY_LOOPBACK because it's not affordable for our users).

codefromthecrypt · 2017-04-10T14:44:23Z

Yes, we're already using it, it's a great thing and helps a lot! But I suppose if we increase our workload significantly (and yes, we will do it), we will again face the same problem (and we won't be able to decrease QUERY_LOOPBACK because it's not affordable for our users).

cool. just covering the bases.. I think we'll end up with a decision about where to introduce a cache if we do (in UI code vs in the ES impl). Also, if that cache needs to be managed like it is here (the response is so slow that you can't rely on users to ever succeed). I kindof prefer not having machinery in-process, as it would be handy regardless of the cache impl for it to be user scheduled. maybe there's something we can do in the UI to not block, but not make repeated calls, when a user first calls for service and span names? then there's also the chance we toy with data format.. I forget which issue is tracking that one, but flattening service+span names inside ES might end up being a way out.

codefromthecrypt · 2017-04-11T03:57:18Z

moved my last comment to #1526 which was the issue I thought I was on!

codefromthecrypt · 2017-04-19T05:31:54Z

@semyonslepov assuming w/current perf we might be able to close this?

semyonslepov · 2017-04-19T16:34:13Z

So, results of the feature with new indexes are really good at response times. For our current setup and number of users, it's good enough.
But if there are tens/hundreds of users simultaneously opening Zipkin UI, it will be a disaster (because these requests are very expensive for ES). I wish it won't be the nearest future (:
If you see this type of machinery ugly for current architecture, we can close this PR now.

We will try to live with upstream version anyway (obviously it's much better than keep our local patches and synchronize them on new releases). In the bad case (I don't expect it now, but our setup is very young and we don't really know yet amount of our future users), we'll have to return to these patches or some similar machinery, because it doesn't depend on user count/RPS. (Or maybe there will be another better way to avoid such problems. For example, we thought about trying Cassandra instead of ES, haven't ever used it with Zipkin yet)

codefromthecrypt · 2017-04-20T00:42:21Z

I would love to hear if you end up with hundreds of users who all refresh browser cache/expire these headers :) would be a sign of a very effective deployment. I do think we will get better and probably simplest might be an caching intermediary as it could be applied to all and invalidation isnt terribly important with service names. Simpler than changing storage (plus not positive of perf differences here or cassandra either). We could also consider an optional caching decorator storagecomponent. I do think we will change the format stored in ES by year end though this will likely only help with trace queries as service span are now simple as possible. Meanwhile thanks for all the feedback. You have been very helpful in this!

add internal cache for /api/v1/services with scheduled update

9e6eaf3

codefromthecrypt closed this Apr 21, 2017

codefromthecrypt added enhancement wontfix server labels Oct 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add internal cache for /api/v1/services with scheduled update #1554

add internal cache for /api/v1/services with scheduled update #1554

semyonslepov commented Apr 10, 2017

codefromthecrypt commented Apr 10, 2017 via email

semyonslepov commented Apr 10, 2017

codefromthecrypt commented Apr 10, 2017

semyonslepov commented Apr 10, 2017

codefromthecrypt commented Apr 10, 2017 via email

codefromthecrypt commented Apr 11, 2017

codefromthecrypt commented Apr 19, 2017

semyonslepov commented Apr 19, 2017

codefromthecrypt commented Apr 20, 2017 via email

add internal cache for /api/v1/services with scheduled update #1554

add internal cache for /api/v1/services with scheduled update #1554

Conversation

semyonslepov commented Apr 10, 2017

codefromthecrypt commented Apr 10, 2017 via email

semyonslepov commented Apr 10, 2017

codefromthecrypt commented Apr 10, 2017

semyonslepov commented Apr 10, 2017

codefromthecrypt commented Apr 10, 2017 via email

codefromthecrypt commented Apr 11, 2017

codefromthecrypt commented Apr 19, 2017

semyonslepov commented Apr 19, 2017

codefromthecrypt commented Apr 20, 2017 via email