ADR-038: Kafka streaming serviced format and architecture #10351

i-norden · 2021-10-12T22:36:33Z

i-norden
Oct 12, 2021

@egaxhaj-figure, @iramiller, and I had a meeting today to discuss the Kafka streaming service plugin for ADR-038

The primary take aways from this meeting are:

@egaxhaj-figure is going to begin work on a Kafka plugin that satisfies the ADR-038 streaming service interface, this work will be based off of this branch https://github.com/i-norden/cosmos-sdk/tree/plugin_arch. This service will perform "bulk streaming" of ABCI messages and their resultant state changes, with minimal processing other than the delineation necessary to maintain the grouping of state changes to the ABCI requests that caused them and the ABCI responses they affected. This is similar to the delineation used in the file streaming service, but the exact format is TBD and hammering out the details of that format (aka #10337) is one of the purposes of this discussion thread.

Specific to Kafka, in addition to the format of the data itself, there is the question of how many topics and/or partitions to publish to. We need to maintain proper ordering of our data so our topic(s) will only use a single partition. We could publish each type of ABCI grouping (req + state changes + res) to its own topic (BeginBlock, EndBlock, DeliverTx). For BeginBlock and EndBlock this doesn't complicate things since there is only one req/res cycle per block and so the Kafka record offset would correspond to the blockheight. However with DeliverTx since there is an indeterminate/variable number of DeliverTx messages per block we would need additional delineation to maintain proper grouping of a set of DeliverTx records and their ordering relative to the BeginBlock and EndBlock records. Alternatively, we could write all the data out to a single topic.

For many systems, further decomposition of this "bulk stream" into more granular Kafka topics will be desired. To this end, @iramiller is specing out an auxiliary service for this purpose. This is handled outside of the SDK because 1) we want to minimize backpressure on the regular processes of the Tendermint/Cosmos blockchain+application and 2) this decomposition will be application/system specific. This would fall into a separate ADR, or outside the scope of the SDK altogether.

robert-zaremba · 2021-10-29T11:58:59Z

robert-zaremba
Oct 29, 2021
Collaborator

How it's going with the design. @egaxhaj-figure, @i-norden i propose to make an update to ADR-038. Can anyone start a draft PR?

4 replies

egaxhaj Nov 2, 2021

Hi @robert-zaremba, @i-norden, @iramiller. I created a draft PR for the Kafka plugin.

I have some concerns of the current design of ADR-038. Because of the realtime push model, listeners will be unable to save state change events and will end up with an out of sync state, when:

Running out of storage space (file)
Experiencing a network error (Kafka, NFS, DB, etc.)
Scheduled downtime (Kafka, NFS, DB)

In addition, the push model of ADR-038 assumes that listeners are able to keep up with events being pushed to them.

iramiller Nov 2, 2021

If through error propagation the streaming service (file Kafka, or something else) is able to stop the node before the block is committed then possibly the node could be restarted to continue where it left off... this could allow for a more reliably complete index to be created.

robert-zaremba Nov 2, 2021
Collaborator

Those are great questions and I think we need to have guidance for them.

I think we should not put that problems to the App side. It's a standard thing for event streaming, so it's rather a setup / devops problem.
Specifically:

For running out of storage: there has to be a storage monitor or a policy (you can do it in Kafka and Rabbitmq) to remove data
- In a file based model, we can backup everything and also we need to have an active storage routing or storage "extender"
Network error - plugin have to define a policy. Plugin should operate in a separate go routine and have it's context / worker to stream the events. Streaming can be piped: app -> file -> kafka -> ... or replicated app -> ( file & kafka).
scheduled downtime:
a. all queuing tools can handle DB downtime (with policies what to do if we are running of of the memory)
b. kafka: we can either add one more intermediate thing into pipeline - a switch, or maybe even do it on a network level (re-routing).

I would like to reduce as much as possible of that responsibility from the app and offload it to admin tools, data pipeline tools or plugin.

It seams we need to clarify that each plugin needs to have policies to define how to deal with issues (memory , space, network).

iramiller Nov 2, 2021

If we agree that an external event store with incomplete data is less useful and possibly unsuitable for many applications ... then the design of the event streaming system should make an effort to handle these error cases.

I view the examples above (out of disk space, network connectivity, external system maintenance periods) as examples that a producer may be interrupted. We certainly can't and shouldn't try to solve external system issues... but we can try and ensure that our design is reliable and robust to random error conditions on write that would prevent the system from meeting its design goal of replicating all state to these listeners.

robert-zaremba · 2021-11-02T17:28:00Z

robert-zaremba
Nov 2, 2021
Collaborator

I would love to see a draft PR for the ADR-038 update specifying the architecture design:

Data packet format
required plugin policies (roughly discussed this above)
pipelining
potential issues / not resolved topics
update to pros / cons
plugin requirements: eg that a event streaming platform needs to deal with ordering or add an additional sequencer to deal with that (maybe a plugin can attach a sequence data?).

@i-norden , @egaxhaj-figure - are you going to handle the ADR update? I think that will be the best way to discuss the design more in detail.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ADR-038: Kafka streaming serviced format and architecture #10351

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

ADR-038: Kafka streaming serviced format and architecture #10351

i-norden Oct 12, 2021

Replies: 2 comments · 4 replies

robert-zaremba Oct 29, 2021 Collaborator

egaxhaj Nov 2, 2021

iramiller Nov 2, 2021

robert-zaremba Nov 2, 2021 Collaborator

iramiller Nov 2, 2021

robert-zaremba Nov 2, 2021 Collaborator

i-norden
Oct 12, 2021

Replies: 2 comments 4 replies

robert-zaremba
Oct 29, 2021
Collaborator

robert-zaremba Nov 2, 2021
Collaborator

robert-zaremba
Nov 2, 2021
Collaborator