-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should we batch writes to Cassandra into small, unlogged batches? #961
Comments
Harvested from comments on #956
I think the other thing that was motivating me to batch was that AFAIK we wait for the each trace to be stored before processing the next one, at least for Kafka. This greatly reduces potential throughput than say, if we allowed for up to X number of store operations in flight in the collector.
Also worth noting that Cassandra connections are multiplexing, I think up to 128 or 256 streams, but we haven't gone that high in the # of threads. |
I'll have to look at the article again, but I think if we have a lot of partition key locality the article suggests a (very modest) improvement. But you're probably right in that there will be no improvement or even an performance decrease. With regarding to batching kafka writes, each StreamProcessor does indeed wait for C* to finish writing, which would limit throughput a lot if you don't have enough partitions / latency isn't particularly good. We could fix this by allows for more than 1 |
I haven't looked deep into how kafka consumers work and whether running >1 threads will increase read throughput. But what I was suggesting is putting an intermediate in-process queue
If reader(s) can read fast enough, you can tune this to saturate Cassandra write throughput just by increasing the # of threads on the right. Queue can be blocking, to cause back pressure. The other advantage of this approach is in allowing the collectors to receive data via different input sources. |
I'm going to out on a limb and close this discussion for now. There was no conclusion which approach is more beneficial, and it's been four years since so we can assume that the current approach works good enough for most sites. Incidentally, work is being done by Grand Moft Cole to upgrade the Cassandra driver to 4.0 (#3154 and others), but i doubt this will have any consequences for this discussion. If anybody still wants to get into the fine details on this feel free to reopen and provide the hard numbers. |
aggregating a side-bar into its own topic:
TODO: summarize discussion once it settles
The text was updated successfully, but these errors were encountered: