Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs] PG compatible logical replication architecture #23220

Merged
merged 8 commits into from
Jul 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
---
title: Logical replication in YugabyteDB
headerTitle: CDC - logical replication
linkTitle: CDC - logical replication
description: Learn how YugabyteDB supports asynchronous replication of data changes (inserts, updates, and deletes) to external databases or applications.
headContent: Asynchronous replication of data changes (inserts, updates, and deletes) to external databases or applications
badges: ea
menu:
preview:
parent: architecture-docdb-replication
identifier: architecture-docdb-replication-cdc-logical-replication
weight: 500
type: docs
---

Change data capture (CDC) in YugabyteDB provides technology to ensure that any changes in data due to operations such as inserts, updates, and deletions are identified, captured, and made available for consumption by applications and other tools.

CDC in YugabyteDB is based on the Postgres Logical Replication model. The fundamental concept here is that of the Replication Slot. A Replication Slot represents a stream of changes that can be replayed to the client in the order they were made on the origin server in a manner that preserves transactional consistency. This is the basis for the support for Transactional CDC in YugabyteDB. Where the strict requirements of Transactional CDC are not present, multiple replication slots can be used to stream changes from unrelated tables in parallel.

## Architecture

![Logical-Replication-Architecture](/images/architecture/logical_replication_architecture.png)

The following are the main components of the Yugabyte CDC solution -

1. Walsender - A special purpose PG backend responsible for streaming changes to the client and handling acknowledgments.

2. Virtual WAL (VWAL) - Assembles changes from all the shards of user tables (under the publication) to maintain transactional consistency.

3. CDCService - Retrieves changes from the WAL of a specified shard starting from a given checkpoint.

### Data Flow

Logical replication starts by copying a snapshot of the data on the publisher database. Once that is done, changes on the publisher are streamed to the server as they occur in near real time.

To setup Logical Replication, an application will first have to create a replication slot. When a replication slot is created a “boundary” is established between the snapshot data and the streaming changes. This “boundary” or “consistent_point” is a consistent state of the source database. It corresponds to a commit time (HybridTime value). Data from transactions with commit time <= commit time corresponding to the consistent_point are consumed as part of the initial snapshot. Changes from transactions with commit time > commit time of the consistent_point are consumed in the streaming phase in transaction commit time order.

#### Initial Snapshot

The initial snapshot data for each table is consumed by executing a corresponding snapshot query (SELECT statement) on that table. This snapshot query should be executed as of the database state corresponding to the consistent_point. This database state is represented by a value of HybridTime. 

First, a `SET LOCAL yb_read_time TO ‘<consistent_point commit time> ht’` command should be executed on the connection (session). The SELECT statement corresponding to the snapshot query should then be executed as part of the same transaction.

The HybridTime value to use in the `SET LOCAL yb_read_time `command is the value of the `snapshot_name` field that is returned by the `CREATE_REPLICATION_SLOT` command. Alternatively, it can be obtained by querying the `pg_replication_slots` view.

During Snapshot consumption, the snapshot data from all tables will be from the same consistent state (consistent_point). At the end of Snapshot consumption, the state of the target system is at/based on the consistent_point. History of the tables as of the consistent_point is retained on the source until the snapshot is consumed.

#### Streaming Data Flow

YugabyteDB automatically splits user tables into multiple shards (also called tablets) using either a hash- or range-based strategy. The primary key for each row in the table uniquely identifies the location of the tablet in the row.

Each tablet has its own WAL. WAL is NOT in-memory, but it is disk persisted. Each WAL preserves the information on the changes involved in the transactions (or changes) for that tablet as well as additional metadata related to the transactions.

**Step 1 - Data flow from the tablets’ WAL to the VWAL**

![CDCService-VWAL](/images/architecture/cdc_service_vwal_interaction.png)

Each tablet sends changes in transaction commit time order. Further, within a transaction, the changes are in the order in which the operations were performed in the transaction.

**Step 2 - Sorting in the VWAL and sending transactions to the Walsender**

![VWAL-Walsender](/images/architecture/vwal_walsender_interaction.png)

VWAL collects changes across multiple tablets, assembles the transactions, assigns LSN to each change and transaction boundary (BEGIN, COMMIT) record and sends the changes to the Walsender in transaction commit time order.

**Step 3 - Walsender to client**

Walsender sends changes to the output plugin, which filters them according to the slot's publication and converts them into the client's desired format. These changes are then streamed to the client using the appropriate streaming replication protocols determined by the output plugin. Yugabyte follows the same streaming replication protocols as defined in PostgreSQL.

{{< note title="Note" >}}
<!--TODO (Siddharth): Fix the Link to the protocol section. -->
Please refer to [Replication Protocol](../../../explore/logical-replication/#Streaming-Protocol) section for more details.

{{< /note >}}

{{< tip title="Explore" >}}
<!--TODO (Siddharth): Fix the Link to the getting started section. -->
See [Getting Started with Logical Replication](../../../explore/logical-replication/getting-started) in Explore to setup Logical Replication in YugabyteDB.

{{< /tip >}}
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: Change data capture (CDC) in YugabyteDB
headerTitle: Change data capture (CDC)
linkTitle: Change data capture (CDC)
headerTitle: CDC - gRPC replication
linkTitle: CDC - gRPC replication
description: Learn how YugabyteDB supports asynchronous replication of data changes (inserts, updates, and deletes) to external databases or applications.
badges: ea
aliases:
Expand All @@ -10,7 +10,7 @@ menu:
preview:
parent: architecture-docdb-replication
identifier: architecture-docdb-replication-cdc
weight: 500
weight: 600
type: docs
---

Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.