-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Linear response time on services query in elasticsearch #1526
Comments
another thing I thought of is that we should collect notes about what sort of data they are putting into zipkin. For example, historically spans had very few tags and people used sampling. Particularly OpenTracing tend to put a lot of tags (each of which ends up in the subquery for service names). It will be interesting to know how many spans and how big they tend to be when users experience problems. |
There are some statistics about our "bad" experience:
( - running on m3.medium.elasticsearch in AWS, if it matters) The main workload is writing and some users accidentally come to a web-interface for looking their traces. |
Thanks very interesting. You have very few but large tags. Is it safe to do
math to extrapolate span count from this?
…On 10 Apr 2017 11:42 pm, "Semyon Slepov" ***@***.***> wrote:
There are some statistics about our "bad" experience:
- ~6 Gb of traces in each of daily indices (keeping 5 last indices)
- ~1 Kb per span
- 3-4 tags per span
( - running on m3.medium.elasticsearch in AWS, if it matters)
The main workload is writing and some users accidentally come to a
web-interface for looking their traces.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1526 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAD615yRdvh0JxcWakH5BoLoEb4AkJwrks5ruk3lgaJpZM4MKsNW>
.
|
I have been thinking about this and believe adding a new index for service and span name is the best option. This would be just like cassandra where we keep a separate document for service/span lookup:
Backfilling this index would be relatively simple and could be done a myriad of ways. Unlike adding new data to the span index, this approach doesn't risk people accidentally clobbering their spans to support it. It also doesn't depend on any specific span representation (current or future). In implementation, easiest would be a daily index like we use for dependency links. We could set the document ID to service and span name so that the data doesn't grow unbounded. If daily is too course granularity, we could make the ID naming convention include the hour of the day or minute of the hour and still significantly bound the data size. We could make the read side look at both, and only execute the slow nested query when the ideal index contains no data. This means users don't have to do anything to start using it. Operationally, I like this approach as being similar to cassandra means less cognitive overhead for cross-storage maintainers. @openzipkin/elasticsearch @semyonslepov wdyt? |
This sounds great! A couple of questions:
|
- When is this new index populated? Is it when a span is written to
storage?
yes when written, although we can also provide a script to apply to old
data
- Having the ability to set the granularity of the index name would be
awesome. So, it would be nice to set the granularity to weekly or monthly
also. That way, we may not need to clean up this index if we so desire.
It is important to note that all of this assumes the index is the same as
the others.. this is just another document type in the existing index
(another beyond span and dependencyLink). The current strategy for cleaning
up any daily index would be curator or similar, and through that you can
choose to retain as much as you like (caveat being performance will
eventually degrade)
So the naming pattern I'm referring to is on the *document id* .. and doing
something different would only be needed if we want less than a day
granularity on queries.
ex. The proposal is that inside the zipkin-2017-04-18/serviceSpan
index/type, there would be documents with ids like.. accounting|get-users
accounting|update-users (this is the exact approach used for dependencyLink)
If this is good enough, we should leave it at that. If for some reason less
than daily is needed, we could add a timestamp, though it will make query
more complex and also visually looking at the data will be annoying. In
case you don't know.. I don't like this option :)
ex. for hourly
00-accounting|get-users,
01-accounting|get-users, 01-accounting|update-users,
02-accounting|get-users, ... 23-accounting|get-users
* to do this, we'd need to store a timestamp in the json, too, which
would make the query a bit more expensive because if we go daily we don't
need to do anything except grab everything in the daily bucket.
|
Thanks for the explanation. +1 in this case. I was asking a longer term index since it would mean cleaning up one more index when we purge the data. But, if this is added an additional index type on the daily zipkin index, daily granularity is an awesome solution. |
np and thanks for your feedback! will wait for other feedback before
heading towards impl.
|
Yes, it's fair enough. Actual counts may differ, but it's correct on average. About indexes: sounds good and transparent for users. Am I right that we will use the same index for spans' names lookup and it's going to be faster too? |
About indexes: sounds good and transparent for users. Am I right that we
will use the same index for spans' names lookup and it's going to be faster
too?
yes this index will support both. basically the query will be a lot
simpler, so it should be much faster. What won't be faster are trace
queries, as they still use nested stuff.
I won't be able to benchmark as I don't know how besides laptop test which
isn't likely very realistic.. hopefully you can help let us know how much
better it goes. is that ok?
|
So, we still don't have huge enough data, but of course, I will run some measurements on current and new versions and post the results. |
+1 for a new index, but is there any payload that is from existence check when persisting service/span names? |
+1 for a new index, but is there any payload that is from existence check
when persisting service/span names?
yes, there would have to be a payload so we can do the span name query. it
would be a document including the service name and the span name
ex. for id = accounting|update-users
the document would be:
{ "serviceName": "accounting", "spanName": "update-users" }
That way, when we do a span name query, we can retrieve the spanName keys
with a term query where serviceName is the input.
This is needed because you cannot do a query against an id field.
make sense?
|
Great, this improvement on service/span name query will greatly improve user experience. |
ok moving to impl as no one is against. thx for feedback, folks
|
will take a bit to refactor properly, but the more important code change at query time will look something like this... service name query - SearchRequest.Filters filters = new SearchRequest.Filters();
- filters.addRange("timestamp_millis", beginMillis, endMillis);
- SearchRequest request = SearchRequest.forIndicesAndType(indices, SPAN)
- .filters(filters)
- .addAggregation(Aggregation.nestedTerms("annotations.endpoint.serviceName"))
- .addAggregation(Aggregation.nestedTerms("binaryAnnotations.endpoint.serviceName"));
+ SearchRequest request = SearchRequest.forIndicesAndType(indices, SERVICE_SPAN)
+ .addAggregation(Aggregation.terms("serviceName", Integer.MAX_VALUE)); span name query - SearchRequest.Filters filters = new SearchRequest.Filters();
- filters.addRange("timestamp_millis", beginMillis, endMillis);
- filters.addNestedTerms(asList(
- "annotations.endpoint.serviceName",
- "binaryAnnotations.endpoint.serviceName"
- ), serviceName.toLowerCase(Locale.ROOT));
- SearchRequest request = SearchRequest.forIndicesAndType(indices, SPAN)
- .filters(filters)
- .addAggregation(Aggregation.terms("name", Integer.MAX_VALUE));
+ SearchRequest request = SearchRequest.forIndicesAndType(indices, SERVICE_SPAN)
+ .term("serviceName", serviceName.toLowerCase(Locale.ROOT))
+ .addAggregation(Aggregation.terms("spanName", Integer.MAX_VALUE)); |
#1560 is ready, just looking to make a backport script or at least pseudocode for adding "servicespan" into old indexes |
@semyonslepov @tramchamploo @mansu @devinsba @jcarres-mdsol can any of you try zipkin 1.23? I'd like to get someone to use it in real life before propagating to other projects. |
if any of you do get a chance, along with response time on /api/v1/services and /api/v1/spans please also report back any differences in storage usage, too. A lot of mapping options are now turned off, so it should be less though I don't know what percentage real life usage will end up as. |
I'll need to update the zipkin-aws image to be based off this in order for me to try it, gonna do a local build some time today |
@devinsba try 0.2.2 :) |
There are some result from my tests: /api/v1/services (1 RPS): Test 1, ~1000000 documents in index, ~250 service names with 1 span name for each: zipkin 1.23.0: Test 2, ~1500000 in index, ~450 service names with 1 span name for each: zipkin 1.21.0: zipkin 1.23.0: (same as Test 1) When data amount increases, this difference becomes more. /api/v1/spans (10 RPS): (one test, something about ~5000000 documents in index, ~250 service names, 1-2 span names for each service) zipkin 1.21.0: zipkin 1.23.0: I saw no significant difference in storage usage for these types of requests with my test conditions. |
@semyonslepov thanks for the feedback! fair to say overall better? :) yeah I wouldn't expect zipkin CPU to go down based on this change, though I would expect elasticsearch's CPU to go down. zipkin is actually doing slightly more, but if its CPU load when consuming becomes an issue we can probably profile a bit. |
Made some more test today to prove earlier theories. Regarding ES CPU, I don't see that it's going down with 1.23.0 release. Even more, in my tests it's going up on the same workload (sending ~1500 spans per second) (I'm not sure about the cleanness of my tests, maybe there is a noise, but I have done them twice for each release. Will be fine to hear something from another users). On the same ES configuration I saw something about 50-55% load of ES CPU with 1.21.0 release and 70% on 1.23.0 (70% vs 95% on another test). Actually, Zipkin CPU doesn't increase so fast on more powerful configurations (m3.large instance with 2 CPU cores and 7.5GB RAM) and as it was in my yesterday tests (t2.micro instance with 1 CPU core and 1GB RAM, today CPUs are slighlty more powerful). So, this difference is not so critical. |
For contrast, at ~800 RPS peak for our ES cluster I am seeing no discernible change in the CPU usage (maybe 1% increase, likely just increased load), our max CPU is only like 15% though because we have more instances than we need. And none on the app either. We are ES 2.3 with 5 m3.large data nodes |
Ahh.. i think i know what might increase the load on ES. It is probably
less about finer tuned indexing on spans, rather more about having to index
service and span name separately as it wasnt before. If this becomes an
issue or we just want to sort it out.. we can employ a deduping approach
used in cassandra. Essentially we dont write the same names twice from the
same node (rather than overwriting). If your tests are clean enough (as
they seem to be) we could test that deduping brings load back down..
|
Getting some reports of eventual timeouts on service and span name queries on elasticsearch. Ex at 80GB/day you eventually will time out (reported by @pauldraper)
Ultimately one big issue is the shape of the json, which requires nested queries to access the service name of a span. The new reporting format (#1499) will allow us to eventually fix this part, but we can't pin to a span format-based implementation to fix a performance problem!
In cassandra3, we started with a materialized view, but had to switch to manual indexing for similar performance reasons (#1374). The downside is that it amplifies writes and also is less straightforward for those who might be simply writing json to ES (as opposed to via a collector).
Regardless of what we do, it would be nice to have a perf test which can showcase evidence of linear latency patterns on service or span name queries. This would catch any system having indexes which aren't ideal.
We have some next steps to consider...
cc @openzipkin/elasticsearch
The text was updated successfully, but these errors were encountered: