Should we batch writes to Cassandra into small, unlogged batches? #961

codefromthecrypt · 2016-02-11T06:14:50Z

aggregating a side-bar into its own topic:

TODO: summarize discussion once it settles

codefromthecrypt · 2016-02-11T06:15:34Z

Harvested from comments on #956

one improvement I had planned to test was batching writes to Cassandra into small, unlogged batches, which I think would improvement performance.

@yurishkuro

I've been told quite the opposite by the team with more experience with Cassandra. A batch can span multiple nodes, so batching puts more load on the coordinating node, rather than that work to be done by a client with token-aware connection.

@danchia

I was basing it off: https://dzone.com/articles/efficient-cassandra-write, which could mean there is some benefit to be had if we have locality.

I think the other thing that was motivating me to batch was that AFAIK we wait for the each trace to be stored before processing the next one, at least for Kafka. This greatly reduces potential throughput than say, if we allowed for up to X number of store operations in flight in the collector.

@yurishkuro

Thanks for the link, Daniel. It seems the articles agrees with what I said, that fan-out from the client is more efficient then batching and sending to a coordination node. But if collector reads KAFKA and writes to C one span at a time, then yes it would be bad and batching might help. It's just you'd be optimizing the wrong end. We don't use collectors over KAFKA yet (actually working on it for another use case), but our collectors (in Go) have an internal queue which is feeding N go-routines, over a token-aware connection. The same thing can be easily done in Scala/Java collectors.

Also worth noting that Cassandra connections are multiplexing, I think up to 128 or 256 streams, but we haven't gone that high in the # of threads.

danchia · 2016-02-11T07:37:14Z

I'll have to look at the article again, but I think if we have a lot of partition key locality the article suggests a (very modest) improvement. But you're probably right in that there will be no improvement or even an performance decrease.

With regarding to batching kafka writes, each StreamProcessor does indeed wait for C* to finish writing, which would limit throughput a lot if you don't have enough partitions / latency isn't particularly good.

https://github.com/openzipkin/zipkin/blob/master/zipkin-receiver-kafka/src/main/scala/com/twitter/zipkin/receiver/kafka/KafkaStreamProcessor.scala#L21

We could fix this by allows for more than 1 process invocation in flight to increase throughput.

yurishkuro · 2016-02-11T16:55:24Z

I haven't looked deep into how kafka consumers work and whether running >1 threads will increase read throughput. But what I was suggesting is putting an intermediate in-process queue

{kafka reader(s)} => {queue} => {executor pool of Cassandra writers}

If reader(s) can read fast enough, you can tune this to saturate Cassandra write throughput just by increasing the # of threads on the right. Queue can be blocking, to cause back pressure.

The other advantage of this approach is in allowing the collectors to receive data via different input sources.

jorgheymans · 2020-08-01T21:14:51Z

I'm going to out on a limb and close this discussion for now. There was no conclusion which approach is more beneficial, and it's been four years since so we can assume that the current approach works good enough for most sites. Incidentally, work is being done by Grand Moft Cole to upgrade the Cassandra driver to 4.0 (#3154 and others), but i doubt this will have any consequences for this discussion.

If anybody still wants to get into the fine details on this feel free to reopen and provide the hard numbers.

codefromthecrypt mentioned this issue Feb 11, 2016

Cassandra driver's connections per host option #956

Closed

This was referenced Feb 12, 2016

Zipkin collector performance issues #940

Closed

multiple spans and kafka collectors #979

Closed

codefromthecrypt mentioned this issue Jun 27, 2016

Performance test for integrated span collection #1148

Open

jorgheymans added the cassandra label May 22, 2020

jorgheymans closed this as completed Aug 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should we batch writes to Cassandra into small, unlogged batches? #961

Should we batch writes to Cassandra into small, unlogged batches? #961

codefromthecrypt commented Feb 11, 2016

codefromthecrypt commented Feb 11, 2016

danchia commented Feb 11, 2016

yurishkuro commented Feb 11, 2016

jorgheymans commented Aug 1, 2020

Should we batch writes to Cassandra into small, unlogged batches? #961

Should we batch writes to Cassandra into small, unlogged batches? #961

Comments

codefromthecrypt commented Feb 11, 2016

codefromthecrypt commented Feb 11, 2016

danchia commented Feb 11, 2016

yurishkuro commented Feb 11, 2016

jorgheymans commented Aug 1, 2020