PDP 44: Single Routing Key Transactions

Motivation

Pravega provides transactions for atomic update to a batch of events across routing keys. The transactions are designed such that when users create a new transaction, a new set of “shadow” segments are created for the transaction. These segments are separate from the stream’s active segments set and mimic their key range distribution such that events written into the transaction are routed to corresponding shadow segments as they would have routed if they were directly ingested into the stream. When a commit request is issued on the transaction, pravega controller service coordinates merging of shadow transaction segments into their stream segments.

This transaction abstraction is very powerful as it allows users to get atomic guarantees for events written across multiple routing keys. However, the lifecycle of transactions during both create and commit phases requires distributed coordination across multiple segments which could be spread across multiple segment store instances.

All the distributed coordination performed by controller service makes the transactions costly where the cost is not justified or amortized if there are small number of events in the transaction.

If we could avoid the overheads of centralized coordination then that would make the transactions very fast to work with. However, inherently allowing for multiple routing keys in a transactions means they could be mapped to different segments whereby some form of coordination across multiple segments for atomic guarantee is inevitable.

There is a popular use case where users may want to write several related events for a single routing key atomically. And using the aforesaid transactions for atomic guarantees could prove to be costly, particularly if the number of events being written is also small. We anticipate this to be a common use case and users may desire write performance equivalent to writing non-transactional data into the stream for small single routing key transactions.

By limiting the scope for atomicity to a single routing key, since all data would be written to a single segment, it will require zero coordination while providing atomic update guarantees for a batch of events.

Proposal

We propose to introduce a new writer api for writing batch of events for the same writing key atomically. This abstraction is scoped to a single routing key and promises to provide atomic writes guarantees for batch of events with same routing key. The mechanism for batch updates is conceived by leveraging the Append Batch wire protocol that pravega writers use for writing data to pravega. The append batch protocol is initiated when a writer sets up a new append with the segment store and then serialized bytes corresponding to the events are transmitted to the segment store in Append Block packets until an Append Block End packet is sent indicating the end of the batch. The append blocks are buffered on the server until an append block end is send, which is when all the data is written into the segment.

So we propose to introduce a new API in pravega writers for writing a batch event.

    /**
     * Write an ordered list of events to the stream atomically for a given routing key. 
     * Events written with the same routing key will be read by readers in exactly the same order they were written. 
     * The maximum size of the serialized event individually should be {@link Serializer#MAX_EVENT_SIZE} and the 
     * collective batch should be less than twice the {@link Serializer#MAX_EVENT_SIZE}. 
     *
     * Note that the implementation provides retry logic to handle connection failures and service
     * host failures. Internal retries will not violate the exactly once semantic so it is better to
     * rely on this than to wrap this method with custom retry logic.
     *
     * @param routingKey A free form string that is used to route messages to readers. Two events written with
     *        the same routingKey are guaranteed to be read in order. Two events with different routing keys
     *        may be read in parallel. 
     * @param events The batch of events to be written to the stream (Null is disallowed)
     * @return A completableFuture that will complete when the event has been durably stored on the configured
     *         number of replicas, and is available for readers to see. This future may complete exceptionally
     *         if this cannot happen, however these exceptions are not transient failures. Failures that occur 
     *         as a result of connection drops or host death are handled internally with multiple retires and
     *         exponential backoff. So there is no need to attempt to retry in the event of an exception.
     */
    CompletableFuture<Void> writeEvents(String routingKey, List<Type> events);

This write events api will take a routing key and list of events and write them into the writer's pending batch of events atomically. This requires changing the Pending event mechanism in SegmentOutputWriter to allow for more than one event encapsulated as a pending event so that all events are included in a single batch. These events are always bundled together from transport layer's perspective into a single pending event object and are always included as part of a single append batch. We also propose to impose the per event limit of 8 mb on each of the events in the batch events.

Handling failure scenarios:

Intermediate connection failures:

Since the batch is completed only when the writer sends append batch end, any partially created batch is discarded by segment store in case of connection failures and the client will establish a new connection with the new setup append and retransmit all the events.

In case the client had issued an append block end, it would reuse the same writer id while reconnecting to the segment and as part of setup append, it would know the event sequence number till which it had written the data and if that sequence number is ahead of the starting sequence number for the current batch, then the current batch is included.

Segment Sealed

If the writer had already issued AppendBlockEnd but there was a connection glitch followed by a segment sealed, the writer will check the state with the sealed segment to determine the last event sequence number and if the batch is not written already, then it’s events will be retransmitted to successor segment.

Unhandled failure case:

There is a corner case which also plagues pravega’s exactly once write guarantees – client encounters network glitch after issuing Setup block end and when it reconnects it finds the segment to have been deleted. In such cases there is no way for the client to determine if the previous attempt to write was successful or not. But this is very improbable and can happen only if during the network glitch time the stream is scaled and then truncated which deletes the said segment. The exactly once guarantee for writes is impossible to ascertain in this situation for both regular writes and batch writes as they rely on segment attributes of previous segment to determine if their write had succeeded or not.

Limitations:

The drawback of singleRoutingKey batch approach is that since this is a client side only approach, the client is required to maintain all the state of batch in its memory, which includes buffering all the inflight events until user calls writeBatch. We need to do this primarily to tackle failure scenarios like network glitches or segment sealed events. This means we ought to impose some limits on the amount of data in a single batch otherwise it could overwhelm the client memory and would need to be buffered on server.

There is an equivalent cost on the segment store also because it buffers the append blocks until an append block end is received. Without a limit on the size of an append block, arbitrarily large amounts of data could potentially be required to be buffered on the server, which would be detrimental.

Currently the writer batching size is bounded by a fixed upper bound of 16 MB. But that size limit doesn’t impact the user’s business logic, instead if only affects when the batch is flushed to segment store. With a single routing key batch we intend to accumulate the batch locally and then perform a single flush, so limiting this to 16 MBs ensures that the append batches are contained within that size.

So users using the new batch write api will need to ensure that their events are actually within the 16mb limit, otherwise an exception of MaxSizeExceeded is thrown back to the application.

Pravega - Streaming as a new software defined storage primitive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly