The limit problem #1142

luoyongjiee · 2016-06-24T09:21:26Z

hello,when i query in zipkin-ui(the data storage in cassandra),
if the limit param is 10 ,it shows Showing: 6 of 6 ;
if the limit param is 20 ,it shows Showing: 12 of 12;
What is the problem? Thank you!

codefromthecrypt · 2016-06-24T09:28:52Z

Can you try using the http api like this (at the same time, which would reduce late arriving spans from skewing things)? $ curl -s 'localhost:9411/api/v1/traces?serviceName=ycf-search&limit=10'|jq . $ curl -s 'localhost:9411/api/v1/traces?serviceName=ycf-search&limit=20'|jq . The api always returns in descending timestamp order. looking at the output, you might be able to explain something..

luoyongjiee · 2016-06-24T10:13:19Z

I found it that if no request comes,
when i query ,
the number of the json will returns in descending timestamp order,Why?
It looks strange,but if storage in mysql ,it will always return the correct number.

codefromthecrypt · 2016-06-24T10:57:09Z

I found it that if no request comes,
when i query ,
the number of the json will returns in descending timestamp order,Why?

I don't 100% understand the nature of your question, so I'll assume you are
asking only about why descending on timestamp (vs something else related to
"no request comes")

Not all Zipkin storage backends include arbitrary ordering capabilities.
timestamps are a natural partition, as well something straight-forward to
sort and apply further predicates to. In other words, it is implementable.
Descending based on a timestamp provides more stable data than ascending
when we keep in mind that the default timestamp is now. (for example, if
you using zipkin you are often responding to an issue vs clicking refresh
until an issue occurs).

The server-side sort order is a fairly well documented and tested part of
zipkin (other predicates directly apply given this is ordering assumption,
including limit, duration etc). The good news is that you can assume it
works and report bugs if it doesn't
http://zipkin.io/zipkin-api/#/paths/%252Ftraces

It looks strange,but if storage in mysql ,it will always return the
correct number.

Similar to the other issue you've raised, we need a way to test this. I
understand you have screen shots, but until you can produce json that
another person can use to reproduce the problem, we cannot validate or
solve an issue.

I'd recommend using the api, and returning the json you mention that works
in mysql, but doesn't in cassandra. Even better if you can send a failing
test patch for
https://github.com/openzipkin/zipkin/blob/master/zipkin/src/test/java/zipkin/storage/SpanStoreTest.java

yurishkuro · 2016-06-24T15:32:24Z

fwiw, I know exactly why LIMIT is not working correctly with Cassandra. In MySQL all the data is in one place, so however complex the query is, it is first satisfied against all AND clauses and then a limit is applied. With Cassandra, each AND condition may need to be resolved against a different index table, by doing direct shard key lookup. So instead of (x AND y AND z) % LIMIT, the Cassandra SpanStore implementation does (x % LIMIT) AND (y % LIMIT) AND (z % LIMIT). The resulting intersection can easily produce < LIMIT results, quite often 0 if you have many AND clauses and LIMIT is small.

codefromthecrypt · 2016-06-25T01:23:32Z

@yurishkuro You are right to mention the above when multiple conditions exist. We should document this somewhere besides the code, probably cassandra's README I guess. We should also make a failing test that we can skip in the cassandra module.

All that said, I still think we need failing json for the issue as reported, because it isn't a complex query. Unless you know otherwise, it still seems unexpected to return less than limit when the query is simple, right?

In the above screen shots, there's no query conditions except serviceName, and the cassandra logic appears to only use limit once.. for select-trace-ids-by-service-name. The select-traces (which gets the traces given ids) a different limit, a very large one maxTraceCols, which defaults to 100000.

Ex. here's a trace for a simple serviceName query in cassandra:

yurishkuro · 2016-06-25T02:32:03Z

Yes, making the LIMIT issue reproducable would be nice.

On a simple query, I wonder if this is because the same trace ID gets returned multiple times. Most index tables in Cassandra allow dups of (search_key -> trace_id) because timestamps are used to differentiate the records. Doing otherwise would've resulted in lots of tombstones, degrading performance. The LIMIT clause does not know that the same trace_id is being returned.

codefromthecrypt · 2016-06-25T03:02:27Z

I think you may be onto something.. I think this is testable. For example, store RPC spans like they would arrive from instrumentation (server, then client). This would ensure that mid-tier spans would all have 2 span blobs. If this repeats the limit issue on simple query, we can test any remedy.

codefromthecrypt · 2016-06-25T03:03:52Z

for us to see this in json, we would have needed the "raw" query parameter, as that would show separate span parts

codefromthecrypt · 2016-06-25T06:51:11Z

so looks like I can reproduce this issue as it came up here #1141 (comment)

codefromthecrypt · 2016-06-25T07:44:41Z

Here's the summary of what I "think" is going on.

The service_name_index needs will store only one trace_id per:

(bucket, timestamp (millisecond), service_name)

This means it can miss traces that happen against the same service in the same millisecond.

@luoyongjiee can you check your data to see if this is the case?
@yurishkuro @michaelsembwever @danchia can you verify what I'm understanding above makes sense?

I'm able to reproduce this by issuing identical spans that vary only on ids and timestamps (ex in #1141). I use the following query to validate.

SELECT bucket,blobAsBigint(timestampAsBlob(ts)),trace_id FROM service_name_index WHERE bucket in (0,1,2,3,4,5,6,7,8,9) and service_name='semper';

yurishkuro · 2016-06-25T17:57:39Z

This means it can miss traces that happen against the same service in the same millisecond.

Yep, sounds right, given PRIMARY KEY ((service_name, bucket), ts). Assuming they also hit the same bucket.

Would've been better with this key:

PRIMARY KEY ((service_name, bucket), ts, trace_id)

michaelsembwever · 2016-06-26T23:54:27Z

In Cassandra primary keys: timeuuid should be used instead of timestamp.
This avoids the problem of clobbering traces within the same millisecond.

Regarding having to query individual partitions to get limits for each, this feature has been introduced in newer versions of Cassandra with the " … PER PARTITION LIMIT x" cql syntax.

danchia · 2016-06-27T01:11:27Z

Agree that timeuuid is the typical pattern in cassandra for this. However, since the timestamps and trace_ids here are application assigned and cannot be changed, in my opinion it would be better to promote the trace_id field to the PRIMARY KEY as suggested by @yurishkuro

codefromthecrypt · 2016-06-28T03:27:48Z

@luoyongjiee update. I've reproduced this problem in a unit test (important so it doesn't creep back in). Yuri's suggestion works fine, but the index needs to be created. We'll have a release with the fix out by tomorrow.

A schema bug resulted in Cassandra not indexing more than bucket count (10) trace ids per millisecond+search input. This manifested as less traces retrieved by UI search or Api query than expected. For example, if you had 1000 traces that happened on the same service in the same millisecond, only 10 would return. The indexes affected are `service_span_name_index`, `service_name_index` and `annotations_index` and this was a schema-only change. Those with existing zipkin installations should recreate these indexes to solve the problem. Fixes #1142

codefromthecrypt · 2016-06-29T04:42:47Z

Reverted 0d51d90 as it needs more work. We need the query to return only unique trace ids, or repeat the query up to limit.

Ex.

cqlsh:zipkin> SELECT bucket,blobAsBigint(timestampAsBlob(ts)),trace_id FROM service_name_index WHERE bucket in (0,1,2,3,4,5,6,7,8,9) and service_name='zipkin-server' limit 10;

 bucket | system.blobasbigint(system.timestampasblob(ts)) | trace_id
--------+-------------------------------------------------+----------------------
      0 |                                   1467174838732 |  4603096813731895486
      0 |                                   1467174836587 | -1336993720941103457
      0 |                                   1467174787202 |  6361320438414089915
      0 |                                   1467174787201 |  6361320438414089915
      0 |                                   1467174787201 |  9078688442428604055
      0 |                                   1467174781499 |  5991165789192187628
      0 |                                   1467174778689 | -8155071232048349133
      0 |                                   1467174778448 |  -395579676103804384
      0 |                                   1467174777530 |  3288432430321170335
      0 |                                   1467174776014 |  1836663428203143817

(10 rows)

codefromthecrypt · 2016-06-29T05:45:15Z

so.. I'll wait for experts to get an idea.. I want to get distinct(trace_id), as opposed to a row for every span reported in the trace (often 2 rows per span). @michaelsembwever @yurishkuro @danchia rescue request :)

codefromthecrypt · 2016-06-29T14:20:12Z

I think this is the last update I have on this issue.

QueryRequest.limit always is higher than results from Cassandra queries, regardless of the schema adjustment we made (though the schema change does actually help with precision). That's because limit is applied to redundant index entries, which are deduped client side. The redundant index count is related to span count per trace.

It is a fools errand to attempt to deduplicate trace ids on the (cassandra) server side because trace ids aren't a partition key. We can only do distinct clauses on partition keys. The only way left is to compensate on the (cassandra) client side: zipkin in this case.

Here's two concrete proposals:

Change CassandraSpanConsumer to cache trace id indexes locally

By caching trace id indexes locally, we can ensure that at least in the same collector shard, we don't write the same unique input more than once per trace. This will be most effective for those who run a single collector or consistently route trace ids to collector instances. Even randomly routed collectors should see smaller indexes, when spans in the same trace are bundled when reported from tracers.

Change CassandraSpanStore to fetch more trace ids than `QueryRequest.limit`

The trace id query returns very little data: trace id and timestamp. One option is to just prefetch more than limit and dedupe client side. Since this issue is amplified at low trace counts, we can simply make a floor of 100 and dedupe to QueryRequest.limit. This could be a first step before we do something like multiple expanding queries.

The previous code had a mechanism to reduce writes to two indexes: `service_name_index` and `service_span_name_index`. This mechanism would prevent writing the same names multiple times. However, it is only effective on a per-thread basis (as names were stored in thread locals). In practice, this code is invoked at collection, and collectors have many request threads per transport. By changing to a shared loading cache, we can extend the deduplication to all threads. By extracting a class to do this, we can test the edge cases and make it available for future work, such as the other indexes. TODO: write unit tests See #1142

codefromthecrypt · 2016-07-09T03:52:36Z

For the second remediation (fetch more ids), I've thought a lot and I think the best way is to have a multiplier. For example, fetch 10x more ids than you want (relates to how much variance in span data there is per trace). This is nice because system people can adjust it as they see fit and break the pattern of users always asking for more than they need.

codefromthecrypt · 2016-07-09T05:52:27Z

While in some sites it will need to be higher, the lowest multiplier I could find that leads to unsurprising results is 3.

Even when optimized, cassandra indexes will have more rows than distinct (trace_id, timestamp) needed to satisfy query requests. This side-effect in most cases is that users get less than `QueryRequest.limit` results back. Lacking the ability to do any deduplication server-side, the only opportunity left is to address this client-side. This over-fetches by a multiplier `CASSANDRA_INDEX_FETCH_MULTIPLIER`, which defaults to 3. For example, if a user requests 10 traces, 30 rows are requested from indexes, but only 10 distinct trace ids are queried for span data. To disable this feature, set `CASSANDRA_INDEX_FETCH_MULTIPLIER=1` Fixes #1142

A schema bug resulted in Cassandra not indexing more than bucket count (10) trace ids per millisecond+search input. This manifested as less traces retrieved by UI search or Api query than expected. For example, if you had 1000 traces that happened on the same service in the same millisecond, only 10 would return. The indexes affected are `service_span_name_index`, `service_name_index` and `annotations_index` and this was a schema-only change. Those with existing zipkin installations should recreate these indexes to solve the problem. Fixes #1142

codefromthecrypt · 2016-07-09T06:24:24Z

final change in for this issue. will cut a release post-merge, probably tomorrow #1177

…a index (#1177) * Over-fetches cassandra trace indexes to improve UX Even when optimized, cassandra indexes will have more rows than distinct (trace_id, timestamp) needed to satisfy query requests. This side-effect in most cases is that users get less than `QueryRequest.limit` results back. Lacking the ability to do any deduplication server-side, the only opportunity left is to address this client-side. This over-fetches by a multiplier `CASSANDRA_INDEX_FETCH_MULTIPLIER`, which defaults to 3. For example, if a user requests 10 traces, 30 rows are requested from indexes, but only 10 distinct trace ids are queried for span data. To disable this feature, set `CASSANDRA_INDEX_FETCH_MULTIPLIER=1` Fixes #1142 * Fixes Cassandra indexes that lost traces in the same millisecond (#1153) A schema bug resulted in Cassandra not indexing more than bucket count (10) trace ids per millisecond+search input. This manifested as less traces retrieved by UI search or Api query than expected. For example, if you had 1000 traces that happened on the same service in the same millisecond, only 10 would return. The indexes affected are `service_span_name_index`, `service_name_index` and `annotations_index` and this was a schema-only change. Those with existing zipkin installations should recreate these indexes to solve the problem. Fixes #1142

codefromthecrypt mentioned this issue Jun 24, 2016

Should we add the "json button" to the main search view? #1143

Closed

codefromthecrypt mentioned this issue Jun 25, 2016

Why is the data lost in elasticsearch ? #1141

Closed

codefromthecrypt mentioned this issue Jun 27, 2016

Performance test for integrated span collection #1148

Open

codefromthecrypt mentioned this issue Jun 28, 2016

Fixes Cassandra indexes that lost traces in the same millisecond #1153

Merged

codefromthecrypt closed this as completed in #1153 Jun 29, 2016

codefromthecrypt reopened this Jun 29, 2016

This was referenced Jul 1, 2016

Extracts DeduplicatingExecutor, preventing redundant Cassandra indexing #1157

Merged

How many Partition of kafka should I have? #1159

Closed

codefromthecrypt mentioned this issue Jul 9, 2016

Over-fetches cassandra trace indexes to improve UX and fixes Cassandra index #1177

Merged

codefromthecrypt closed this as completed in #1177 Jul 10, 2016

codefromthecrypt added the cassandra label Oct 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The limit problem #1142

The limit problem #1142

luoyongjiee commented Jun 24, 2016

codefromthecrypt commented Jun 24, 2016 via email

luoyongjiee commented Jun 24, 2016

codefromthecrypt commented Jun 24, 2016 •

edited

Loading

yurishkuro commented Jun 24, 2016

codefromthecrypt commented Jun 25, 2016

yurishkuro commented Jun 25, 2016

codefromthecrypt commented Jun 25, 2016 via email

codefromthecrypt commented Jun 25, 2016 via email

codefromthecrypt commented Jun 25, 2016

codefromthecrypt commented Jun 25, 2016

yurishkuro commented Jun 25, 2016

michaelsembwever commented Jun 26, 2016 •

edited

Loading

danchia commented Jun 27, 2016

codefromthecrypt commented Jun 28, 2016 •

edited

Loading

codefromthecrypt commented Jun 29, 2016

codefromthecrypt commented Jun 29, 2016

codefromthecrypt commented Jun 29, 2016 •

edited

Loading

codefromthecrypt commented Jul 9, 2016

codefromthecrypt commented Jul 9, 2016

codefromthecrypt commented Jul 9, 2016

The limit problem #1142

The limit problem #1142

Comments

luoyongjiee commented Jun 24, 2016

codefromthecrypt commented Jun 24, 2016 via email

luoyongjiee commented Jun 24, 2016

codefromthecrypt commented Jun 24, 2016 • edited Loading

yurishkuro commented Jun 24, 2016

codefromthecrypt commented Jun 25, 2016

yurishkuro commented Jun 25, 2016

codefromthecrypt commented Jun 25, 2016 via email

codefromthecrypt commented Jun 25, 2016 via email

codefromthecrypt commented Jun 25, 2016

codefromthecrypt commented Jun 25, 2016

yurishkuro commented Jun 25, 2016

michaelsembwever commented Jun 26, 2016 • edited Loading

danchia commented Jun 27, 2016

codefromthecrypt commented Jun 28, 2016 • edited Loading

codefromthecrypt commented Jun 29, 2016

codefromthecrypt commented Jun 29, 2016

codefromthecrypt commented Jun 29, 2016 • edited Loading

Change CassandraSpanConsumer to cache trace id indexes locally

Change CassandraSpanStore to fetch more trace ids than QueryRequest.limit

codefromthecrypt commented Jul 9, 2016

codefromthecrypt commented Jul 9, 2016

codefromthecrypt commented Jul 9, 2016

codefromthecrypt commented Jun 24, 2016 •

edited

Loading

michaelsembwever commented Jun 26, 2016 •

edited

Loading

codefromthecrypt commented Jun 28, 2016 •

edited

Loading

codefromthecrypt commented Jun 29, 2016 •

edited

Loading

Change CassandraSpanStore to fetch more trace ids than `QueryRequest.limit`