Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should we batch writes to Cassandra into small, unlogged batches? #961

Closed
codefromthecrypt opened this issue Feb 11, 2016 · 4 comments
Closed

Comments

@codefromthecrypt
Copy link
Member

aggregating a side-bar into its own topic:

TODO: summarize discussion once it settles

@codefromthecrypt
Copy link
Member Author

Harvested from comments on #956

@danchia

one improvement I had planned to test was batching writes to Cassandra into small, unlogged batches, which I think would improvement performance.

@yurishkuro

I've been told quite the opposite by the team with more experience with Cassandra. A batch can span multiple nodes, so batching puts more load on the coordinating node, rather than that work to be done by a client with token-aware connection.

@danchia

I was basing it off: https://dzone.com/articles/efficient-cassandra-write, which could mean there is some benefit to be had if we have locality.

I think the other thing that was motivating me to batch was that AFAIK we wait for the each trace to be stored before processing the next one, at least for Kafka. This greatly reduces potential throughput than say, if we allowed for up to X number of store operations in flight in the collector.

@yurishkuro

Thanks for the link, Daniel. It seems the articles agrees with what I said, that fan-out from the client is more efficient then batching and sending to a coordination node. But if collector reads KAFKA and writes to C one span at a time, then yes it would be bad and batching might help. It's just you'd be optimizing the wrong end. We don't use collectors over KAFKA yet (actually working on it for another use case), but our collectors (in Go) have an internal queue which is feeding N go-routines, over a token-aware connection. The same thing can be easily done in Scala/Java collectors.

Also worth noting that Cassandra connections are multiplexing, I think up to 128 or 256 streams, but we haven't gone that high in the # of threads.

@danchia
Copy link

danchia commented Feb 11, 2016

I'll have to look at the article again, but I think if we have a lot of partition key locality the article suggests a (very modest) improvement. But you're probably right in that there will be no improvement or even an performance decrease.

With regarding to batching kafka writes, each StreamProcessor does indeed wait for C* to finish writing, which would limit throughput a lot if you don't have enough partitions / latency isn't particularly good.

https://github.com/openzipkin/zipkin/blob/master/zipkin-receiver-kafka/src/main/scala/com/twitter/zipkin/receiver/kafka/KafkaStreamProcessor.scala#L21

We could fix this by allows for more than 1 process invocation in flight to increase throughput.

@yurishkuro
Copy link
Contributor

I haven't looked deep into how kafka consumers work and whether running >1 threads will increase read throughput. But what I was suggesting is putting an intermediate in-process queue

{kafka reader(s)} => {queue} => {executor pool of Cassandra writers}

If reader(s) can read fast enough, you can tune this to saturate Cassandra write throughput just by increasing the # of threads on the right. Queue can be blocking, to cause back pressure.

The other advantage of this approach is in allowing the collectors to receive data via different input sources.

@jorgheymans
Copy link
Contributor

I'm going to out on a limb and close this discussion for now. There was no conclusion which approach is more beneficial, and it's been four years since so we can assume that the current approach works good enough for most sites. Incidentally, work is being done by Grand Moft Cole to upgrade the Cassandra driver to 4.0 (#3154 and others), but i doubt this will have any consequences for this discussion.

If anybody still wants to get into the fine details on this feel free to reopen and provide the hard numbers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants