Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The limit problem #1142

Closed
luoyongjiee opened this issue Jun 24, 2016 · 20 comments · Fixed by #1177
Closed

The limit problem #1142

luoyongjiee opened this issue Jun 24, 2016 · 20 comments · Fixed by #1177

Comments

@luoyongjiee
Copy link

zipkin-pic1
zipkin-pic2
hello,when i query in zipkin-ui(the data storage in cassandra),
if the limit param is 10 ,it shows Showing: 6 of 6 ;
if the limit param is 20 ,it shows Showing: 12 of 12;
What is the problem? Thank you!

@codefromthecrypt
Copy link
Member

codefromthecrypt commented Jun 24, 2016 via email

@luoyongjiee
Copy link
Author

I found it that if no request comes,
when i query ,
the number of the json will returns in descending timestamp order,Why?
It looks strange,but if storage in mysql ,it will always return the correct number.

@codefromthecrypt
Copy link
Member

codefromthecrypt commented Jun 24, 2016

I found it that if no request comes,
when i query ,
the number of the json will returns in descending timestamp order,Why?

I don't 100% understand the nature of your question, so I'll assume you are
asking only about why descending on timestamp (vs something else related to
"no request comes")

Not all Zipkin storage backends include arbitrary ordering capabilities.
timestamps are a natural partition, as well something straight-forward to
sort and apply further predicates to. In other words, it is implementable.
Descending based on a timestamp provides more stable data than ascending
when we keep in mind that the default timestamp is now. (for example, if
you using zipkin you are often responding to an issue vs clicking refresh
until an issue occurs).

The server-side sort order is a fairly well documented and tested part of
zipkin (other predicates directly apply given this is ordering assumption,
including limit, duration etc). The good news is that you can assume it
works and report bugs if it doesn't
http://zipkin.io/zipkin-api/#/paths/%252Ftraces

It looks strange,but if storage in mysql ,it will always return the
correct number.

Similar to the other issue you've raised, we need a way to test this. I
understand you have screen shots, but until you can produce json that
another person can use to reproduce the problem, we cannot validate or
solve an issue.

I'd recommend using the api, and returning the json you mention that works
in mysql, but doesn't in cassandra. Even better if you can send a failing
test patch for
https://github.com/openzipkin/zipkin/blob/master/zipkin/src/test/java/zipkin/storage/SpanStoreTest.java

@yurishkuro
Copy link
Contributor

fwiw, I know exactly why LIMIT is not working correctly with Cassandra. In MySQL all the data is in one place, so however complex the query is, it is first satisfied against all AND clauses and then a limit is applied. With Cassandra, each AND condition may need to be resolved against a different index table, by doing direct shard key lookup. So instead of (x AND y AND z) % LIMIT, the Cassandra SpanStore implementation does (x % LIMIT) AND (y % LIMIT) AND (z % LIMIT). The resulting intersection can easily produce < LIMIT results, quite often 0 if you have many AND clauses and LIMIT is small.

@codefromthecrypt
Copy link
Member

@yurishkuro You are right to mention the above when multiple conditions exist. We should document this somewhere besides the code, probably cassandra's README I guess. We should also make a failing test that we can skip in the cassandra module.

All that said, I still think we need failing json for the issue as reported, because it isn't a complex query. Unless you know otherwise, it still seems unexpected to return less than limit when the query is simple, right?

In the above screen shots, there's no query conditions except serviceName, and the cassandra logic appears to only use limit once.. for select-trace-ids-by-service-name. The select-traces (which gets the traces given ids) a different limit, a very large one maxTraceCols, which defaults to 100000.

Ex. here's a trace for a simple serviceName query in cassandra:

screen shot 2016-06-25 at 9 19 32 am

@yurishkuro
Copy link
Contributor

Yes, making the LIMIT issue reproducable would be nice.

On a simple query, I wonder if this is because the same trace ID gets returned multiple times. Most index tables in Cassandra allow dups of (search_key -> trace_id) because timestamps are used to differentiate the records. Doing otherwise would've resulted in lots of tombstones, degrading performance. The LIMIT clause does not know that the same trace_id is being returned.

@codefromthecrypt
Copy link
Member

codefromthecrypt commented Jun 25, 2016 via email

@codefromthecrypt
Copy link
Member

codefromthecrypt commented Jun 25, 2016 via email

@codefromthecrypt
Copy link
Member

so looks like I can reproduce this issue as it came up here #1141 (comment)

@codefromthecrypt
Copy link
Member

Here's the summary of what I "think" is going on.

The service_name_index needs will store only one trace_id per:

(bucket, timestamp (millisecond), service_name)

This means it can miss traces that happen against the same service in the same millisecond.

@luoyongjiee can you check your data to see if this is the case?
@yurishkuro @michaelsembwever @danchia can you verify what I'm understanding above makes sense?

I'm able to reproduce this by issuing identical spans that vary only on ids and timestamps (ex in #1141). I use the following query to validate.

SELECT bucket,blobAsBigint(timestampAsBlob(ts)),trace_id FROM service_name_index WHERE bucket in (0,1,2,3,4,5,6,7,8,9) and service_name='semper';

@yurishkuro
Copy link
Contributor

This means it can miss traces that happen against the same service in the same millisecond.

Yep, sounds right, given PRIMARY KEY ((service_name, bucket), ts). Assuming they also hit the same bucket.

Would've been better with this key:

PRIMARY KEY ((service_name, bucket), ts, trace_id)

@michaelsembwever
Copy link
Member

michaelsembwever commented Jun 26, 2016

In Cassandra primary keys: timeuuid should be used instead of timestamp.
This avoids the problem of clobbering traces within the same millisecond.

Regarding having to query individual partitions to get limits for each, this feature has been introduced in newer versions of Cassandra with the " … PER PARTITION LIMIT x" cql syntax.

@danchia
Copy link

danchia commented Jun 27, 2016

Agree that timeuuid is the typical pattern in cassandra for this. However, since the timestamps and trace_ids here are application assigned and cannot be changed, in my opinion it would be better to promote the trace_id field to the PRIMARY KEY as suggested by @yurishkuro

@codefromthecrypt
Copy link
Member

codefromthecrypt commented Jun 28, 2016

@luoyongjiee update. I've reproduced this problem in a unit test (important so it doesn't creep back in). Yuri's suggestion works fine, but the index needs to be created. We'll have a release with the fix out by tomorrow.

codefromthecrypt pushed a commit that referenced this issue Jun 28, 2016
A schema bug resulted in Cassandra not indexing more than bucket count
(10) trace ids per millisecond+search input. This manifested as less
traces retrieved by UI search or Api query than expected. For example,
if you had 1000 traces that happened on the same service in the same
millisecond, only 10 would return.

The indexes affected are `service_span_name_index`, `service_name_index`
and `annotations_index` and this was a schema-only change. Those with
existing zipkin installations should recreate these indexes to solve the
problem.

Fixes #1142
codefromthecrypt pushed a commit that referenced this issue Jun 28, 2016
A schema bug resulted in Cassandra not indexing more than bucket count
(10) trace ids per millisecond+search input. This manifested as less
traces retrieved by UI search or Api query than expected. For example,
if you had 1000 traces that happened on the same service in the same
millisecond, only 10 would return.

The indexes affected are `service_span_name_index`, `service_name_index`
and `annotations_index` and this was a schema-only change. Those with
existing zipkin installations should recreate these indexes to solve the
problem.

Fixes #1142
codefromthecrypt pushed a commit that referenced this issue Jun 29, 2016
A schema bug resulted in Cassandra not indexing more than bucket count
(10) trace ids per millisecond+search input. This manifested as less
traces retrieved by UI search or Api query than expected. For example,
if you had 1000 traces that happened on the same service in the same
millisecond, only 10 would return.

The indexes affected are `service_span_name_index`, `service_name_index`
and `annotations_index` and this was a schema-only change. Those with
existing zipkin installations should recreate these indexes to solve the
problem.

Fixes #1142
@codefromthecrypt
Copy link
Member

Reverted 0d51d90 as it needs more work. We need the query to return only unique trace ids, or repeat the query up to limit.

Ex.

cqlsh:zipkin> SELECT bucket,blobAsBigint(timestampAsBlob(ts)),trace_id FROM service_name_index WHERE bucket in (0,1,2,3,4,5,6,7,8,9) and service_name='zipkin-server' limit 10;

 bucket | system.blobasbigint(system.timestampasblob(ts)) | trace_id
--------+-------------------------------------------------+----------------------
      0 |                                   1467174838732 |  4603096813731895486
      0 |                                   1467174836587 | -1336993720941103457
      0 |                                   1467174787202 |  6361320438414089915
      0 |                                   1467174787201 |  6361320438414089915
      0 |                                   1467174787201 |  9078688442428604055
      0 |                                   1467174781499 |  5991165789192187628
      0 |                                   1467174778689 | -8155071232048349133
      0 |                                   1467174778448 |  -395579676103804384
      0 |                                   1467174777530 |  3288432430321170335
      0 |                                   1467174776014 |  1836663428203143817

(10 rows)

@codefromthecrypt
Copy link
Member

so.. I'll wait for experts to get an idea.. I want to get distinct(trace_id), as opposed to a row for every span reported in the trace (often 2 rows per span). @michaelsembwever @yurishkuro @danchia rescue request :)

@codefromthecrypt
Copy link
Member

codefromthecrypt commented Jun 29, 2016

I think this is the last update I have on this issue.

QueryRequest.limit always is higher than results from Cassandra queries, regardless of the schema adjustment we made (though the schema change does actually help with precision). That's because limit is applied to redundant index entries, which are deduped client side. The redundant index count is related to span count per trace.

It is a fools errand to attempt to deduplicate trace ids on the (cassandra) server side because trace ids aren't a partition key. We can only do distinct clauses on partition keys. The only way left is to compensate on the (cassandra) client side: zipkin in this case.

Here's two concrete proposals:

Change CassandraSpanConsumer to cache trace id indexes locally

By caching trace id indexes locally, we can ensure that at least in the same collector shard, we don't write the same unique input more than once per trace. This will be most effective for those who run a single collector or consistently route trace ids to collector instances. Even randomly routed collectors should see smaller indexes, when spans in the same trace are bundled when reported from tracers.

Change CassandraSpanStore to fetch more trace ids than QueryRequest.limit

The trace id query returns very little data: trace id and timestamp. One option is to just prefetch more than limit and dedupe client side. Since this issue is amplified at low trace counts, we can simply make a floor of 100 and dedupe to QueryRequest.limit. This could be a first step before we do something like multiple expanding queries.

codefromthecrypt pushed a commit that referenced this issue Jul 1, 2016
The previous code had a mechanism to reduce writes to two indexes:
`service_name_index` and `service_span_name_index`. This mechanism would
prevent writing the same names multiple times. However, it is only
effective on a per-thread basis (as names were stored in thread locals).

In practice, this code is invoked at collection, and collectors have
many request threads per transport. By changing to a shared loading
cache, we can extend the deduplication to all threads. By extracting a
class to do this, we can test the edge cases and make it available for
future work, such as the other indexes.

TODO: write unit tests

See #1142
@codefromthecrypt
Copy link
Member

For the second remediation (fetch more ids), I've thought a lot and I think the best way is to have a multiplier. For example, fetch 10x more ids than you want (relates to how much variance in span data there is per trace). This is nice because system people can adjust it as they see fit and break the pattern of users always asking for more than they need.

@codefromthecrypt
Copy link
Member

While in some sites it will need to be higher, the lowest multiplier I could find that leads to unsurprising results is 3.

codefromthecrypt pushed a commit that referenced this issue Jul 9, 2016
Even when optimized, cassandra indexes will have more rows than distinct
(trace_id, timestamp) needed to satisfy query requests. This side-effect
in most cases is that users get less than `QueryRequest.limit` results
back. Lacking the ability to do any deduplication server-side, the only
opportunity left is to address this client-side.

This over-fetches by a multiplier `CASSANDRA_INDEX_FETCH_MULTIPLIER`,
which defaults to 3. For example, if a user requests 10 traces, 30 rows
are requested from indexes, but only 10 distinct trace ids are queried
for span data.

To disable this feature, set `CASSANDRA_INDEX_FETCH_MULTIPLIER=1`

Fixes #1142
codefromthecrypt pushed a commit that referenced this issue Jul 9, 2016
A schema bug resulted in Cassandra not indexing more than bucket count
(10) trace ids per millisecond+search input. This manifested as less
traces retrieved by UI search or Api query than expected. For example,
if you had 1000 traces that happened on the same service in the same
millisecond, only 10 would return.

The indexes affected are `service_span_name_index`, `service_name_index`
and `annotations_index` and this was a schema-only change. Those with
existing zipkin installations should recreate these indexes to solve the
problem.

Fixes #1142
@codefromthecrypt
Copy link
Member

final change in for this issue. will cut a release post-merge, probably tomorrow #1177

codefromthecrypt pushed a commit that referenced this issue Jul 10, 2016
…a index (#1177)

* Over-fetches cassandra trace indexes to improve UX

Even when optimized, cassandra indexes will have more rows than distinct
(trace_id, timestamp) needed to satisfy query requests. This side-effect
in most cases is that users get less than `QueryRequest.limit` results
back. Lacking the ability to do any deduplication server-side, the only
opportunity left is to address this client-side.

This over-fetches by a multiplier `CASSANDRA_INDEX_FETCH_MULTIPLIER`,
which defaults to 3. For example, if a user requests 10 traces, 30 rows
are requested from indexes, but only 10 distinct trace ids are queried
for span data.

To disable this feature, set `CASSANDRA_INDEX_FETCH_MULTIPLIER=1`

Fixes #1142

* Fixes Cassandra indexes that lost traces in the same millisecond (#1153)

A schema bug resulted in Cassandra not indexing more than bucket count
(10) trace ids per millisecond+search input. This manifested as less
traces retrieved by UI search or Api query than expected. For example,
if you had 1000 traces that happened on the same service in the same
millisecond, only 10 would return.

The indexes affected are `service_span_name_index`, `service_name_index`
and `annotations_index` and this was a schema-only change. Those with
existing zipkin installations should recreate these indexes to solve the
problem.

Fixes #1142
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants