Skip to content

PDP 47: Pravega Streams: Consumption Based Retention

Prajakta Belgundi edited this page Oct 7, 2020 · 20 revisions

Introduction

Pravega Streams provide support for only time and space-based truncation.  We do not track if the messages being truncated have been consumed by Readers (Subscribers) or not. A Consumption based retention mechanism for a Pravega Stream would ensure:

  1. The stream carries only unconsumed data and is space efficient.  2. Messages are not deleted before they are consumed by all consumer(subscribers).  This helps particularly when using a Pravega Stream as a message queue and for edge use-cases where we may not need to keep data in the Stream after it has been pushed to the Core.

Terminology

Reader Group (RG) - A Pravega Reader Group that reads from one or more Pravega Streams. A Reader Group can have multiple Readers each reading from 1 or more Segments in the Stream.

Retention Policy (RP) - A Stream Configuration that tells on what basis the Stream will be truncated. Currently supported policies are Time and Space based policies. These truncate data in a Stream if it is older than a configured time interval or if Stream size is beyond a certain configured size limit.

Consumption Based Retention Policy (CBR) - A Retention Policy on a Pravega Stream that requires data to be truncated from the Stream only after all Subscribers have read it. A Stream can have both "durable" and "non-durable" Subscribers and read positions of only "durable" subscribers will be tracked by CBR policy for truncation.

Position - A 'Position' object representing the position of a Reader in a Pravega Stream. It is a map of segments owned by the Reader and their read offsets. A Stream-Cut can be generated from a set of "Position" objects for all Readers in the Reader Group.

Stream-Cut - A set of pairs for a single stream that represent a consistent position in the Stream. A Stream-Cut covers the complete key-space at a given point in time. The offset always points to the event boundary and hence there will be no offset pointing to an incomplete event.

Checkpoint - A system event that causes all Readers in a Reader Group to persist their current read positions to State Synchronizer. A Checkpoint is necessary for a Reader Group to be able to resume reading from the Stream after it goes down. Check-pointing is prerequisite for assignment of unassigned segments in the Stream. For a Reader Group, checkpoints can be generated automatically (periodically) or explicitly on user request. Automated check-pointing is enabled by default. Every time a new checkpoint is generated its value is persisted in the latestCheckpoint field in RGState in State Synchronizer.

Automated Checkpoint - An automated checkpoint is a checkpoint generated by Pravega Client at periodic intervals to ensure that Readers persist their read positions. This process is done without any user interaction. The name of auto-generated checkpoints is not exposed to users.

Manual Checkpoint - A manual checkpoint is generated on user request and its name returned to the User once the checkpoint completes.

Subscriber Reader Group - n "push" based messaging systems, a "durable subscriber" is a message consumer that is guaranteed to receive all messages published to a queue including messages published while it was inactive. The Message Broker stores messages (subject to space/time constraints) till the Subscriber is back and then delivers those messages to the Subscriber.

For Pravega MQ, a "durable subscriber" (referred to as "Subscriber" in the rest of this document) is a Reader Group whose read positions are considered for computing the truncation point for a Stream with Consumption Based Retention. When such a Subscribing Reader Group goes down, the system will truncate the Stream based on last published checkpoint of this Reader Group and will not truncate any data not read by this Reader Group. However if a new checkpoint is not updated by the Reader Group within a certain "timeout" time units, it last updated checkpoint will be ignored and truncation will happen based on read positions of rest of the Subscribers.

Non-Subscriber Reader Group - In "push" based messaging systems, a "non-durable" subscriber is a message consumer that receives only those messages that published to the queue when it is online. Messages published while the subscriber was inactive are lost.

For Pravega MQ, a non-durable subscriber is a Reader Group that can pull messages as long as it is up. However, the read position of this Reader Group won't be considered when computing the truncation point for a Stream with Consumption based retention policy. When this Reader Group goes down and comes back up, it may have lost some messages in the queue because those were deleted while it was away. In a way, all current Reader Groups are non-durable subscribers of Pravega Streams.

Publisher - An Pravega writer application that writes messages to a Pravega Stream, when it is used as a Message Queue.

Design

Clone this wiki locally