What makes a connector #834

chubei · 2023-02-08T12:45:51Z

chubei
Feb 8, 2023

What makes a connector

Validate Connection

During migration and before starting ingestion, validate_connection is called. Connector should test the connection and return the Result. Connector may perform extra validation steps such as whether a needed feature of the external database is available.

List Schemas

Lists all schemas available from a connector.

Validate/Get Schemas

In validate_schemas method, the connector should fetch the actual source schema from database and then check:

do all requested tables and columns (from source configuration in dozer-config.yaml) exist in database
all used types from database are supported by dozer
the table's replication changes type

In get_schemas method, the connector should map the external database schema to Dozer schema.

Replication changes tracking types

Replication changes tracking type is defined based on schema and primary key/unique key availability.

Type
FullChanges	Connector gets old record on delete/update operations
OnlyPK	Connector only gets PK of old record on delete/update operations
Nothing	Connector cannot get any info about old record. In other words, the table is append only

Start

Once started, a connector should begin to output messages to the ingestor that was passed in start, until there is no more data to output. start should never return unless there's an error.

The output message is ((u64, u64), IngestionMessage). The (u64, u64) pair is the identifier of a message, which should be unique and monotically increasing for every message it sent. There're more contracts that must be respected between different runs of a connector regarding this message identifier, which is discussed in the Checkpointing section.

IngestionMessage is an enum, the variants are OperationEvent, which contains a Insert, Delete or Update operation, and SnapshottingDone, which is used to indicate that snapshotting is done. See Checkpointing section for information on snapshotting.

For Delete and Update operations, the connector needs to output the old record before update or deletion. OnlyPK connectors can fill the old record with Nulls except for the fields that are part of the primary key (see Schema section).

last_checkpoint parameter is related to checkpointing and is discussed in Checkpointing section.

Checkpointing

Dozer use the checkpointing specification of a connector to decide what to do when it's stopped and restarted. Dozer considers a connector to be fully defined by its configuration. As long as the connector configuration doesn't change from last run, Dozer treats the connector as unchanged and expects certain checkpointing guarantees from the connector.

A connector must report its checkpointing capacity as one of the following

Full
None

Full Checkpointing

A Full Checkpointing connector outputs operations from the external database in two phases: Snapshotting and Streaming.

In the Snapshotting phase, the connector creates a snapshot of the external database and outputs Insert operations. The snapshot phase may contain zero operations. When the snapshot finishes, the connector sends a SnapshottingDone message, whose identifier is the snapshot identifier.

In the Streaming phase maintains a unique mapping from message identifers to operations (the snapshot identifier should also be tracked).

Dozer may choose to start the connector from the snapshot identifier, from any message identifier in the Streaming phase, or start a new snapshot, indicated by the last_checkpoint parameter of start. Dozer will never ask the connector to start from a message identifier in the Snapshotting phase.

When asked to start a new snapshot, the connector should begin the snapshotting phase all over again. The connector may abandon any previous message identifiers. Dozer will not try to use them.

When started from the snapshot identifier, the connector should output all operations as if the Streaming phase is started fresh. The connector may abandon any message identifiers that're greater than the snapshot identifier and reuse them.

When started from a message identifier in the Streaming phase, the connector should output operations from corresponding operation, exclusively. The connector may abandon any message identifiers that're greater than said identifier and reuse them.

If a connector is successfully started from the specified message identifier, Dozer guarantees that the API endpoint is always consistent with the external database (up to the streaming delay).

If starting from the specified message identifier cannot be done, Dozer will ask the connector to start a new snapshot. In this case, Dozer API endpoint stays at its last state until the new snapshotting is done.

Full Checkpointing connector example

For example, a Full Checkpointing connector can output following messages in Snapshotting phase:

Message Identifier	Operation
(0, 1)	Insert 0
(0, 2)	Insert 1
(1, 0)	SnapshottingDone

And following messages in Streaming phase:

Message Identifier	Operation
(2, 0)	Update 0
(2, 1)	Insert 2
(2, 2)	Delete 1

When started from (1, 0), the connector should output the following (whole Streaming phase):

Message Identifier	Operation
(2, 0)	Update 0
(2, 1)	Insert 2
(2, 2)	Delete 1

When started from (2, 1), the connector should output the following:

Message Identifier	Operation
(2, 2)	Delete 1

None Checkpointing

A None Checkpointing connector doesn't keep track of the message identifiers. Between restarts, it only needs to guarantee that the message identifiers are monotically increasing. This kind of connector should only send Insert operations. With None Checkpointing connector, Dozer makes no guarantee of consistency between API endpoint and external database. However, operations that're ACKed by Dozer are never lost. See Dozer ACK section.

Dozer ACK

Dozer periodically sends ACKs to connector to acknowledge the successful processing of an operation sent from the connector. The ACK frequency can be configured, minimum being 1, meaning that every operation will be ACKed. Do note that high ACK frequency implies lower processing throughput.

Also note that ACK of an operation doens't mean it immediately shows up in API queries. There's the streaming delay between the two events.

Code

/// A `Connector` connects to an external database and streams data to Dozer.
pub trait Connector: Send + Sync {
    /// Validates if the connection can be made. May perform extra validation steps at the connection level.
    ///
    /// Returns `Ok` if the connection is valid, `Err` otherwise.
    fn validate_connection(&self) -> Result<(), ConnectorError>;

    /// Lists all schemas in the external database that are supported by Dozer.
    fn list_schemas(&self) -> Result<Vec<SchemaWithChangesType>, ConnectorError>;

    /// Validates if requested `tables` are valid. A table is valid if:
    ///
    /// - The table and all requested columns exist in the external database.
    /// - All used types are supported by Dozer.
    /// - The table's `ReplicationChangesTrackingType` can be determined.
    fn validate_schemas(&self, tables: &[TableInfo]) -> Result<ValidationResults, ConnectorError>;

    /// Gets the schemas that describe `tables`.
    fn get_schemas(
        &self,
        tables: Vec<TableInfo>,
    ) -> Result<Vec<SchemaWithChangesType>, ConnectorError>;

    /// Starts streaming data from external database to Dozer.
    ///
    /// # Arguments
    ///
    /// * `ingestor` - Data should be sent to this ingestor.
    /// * `tables` - Tables to stream data from. If `None`, all supported tables should be streamed.
    /// * `last_checkpoint` - The last checkpoint to start from. If `None`, the connector should start a new snapshot.
    fn start(
        &self,
        ingestor: Arc<RwLock<Ingestor>>,
        tables: Option<Vec<TableInfo>>,
        last_checkpoint: Option<(u64, u64)>,
    ) -> Result<!, ConnectorError>;
}

! is experimental. We'll use () in actual code.
Dozer ACK is not included in this trait definition.
Checkpointing capacity report is not included in this trait definition.
ReplicationChangesTrackingType (or part of it) will become part of Schema in future.

v3g42 · 2023-02-08T15:34:53Z

v3g42
Feb 8, 2023
Maintainer

This looks great @chubei.

I m thinking we should consider having AppendOnly on a table or a source instead of a connector.
Kafka and others refer to this as Checkpointing. Maybe we use the same terminology for connectors?
SnapshotThenSeekable -> I believe we can implement it in a way checkpoint is only created on connectors when initial snapshot is taken. So then we don't have to have different type of connector.
Lastly, I was thinking the only reason we ever need to replay is if schema changes. I believe this should be covered under a different concept called migration.

3 replies

chubei Feb 9, 2023
Author

I thought about this too. There's the problem that currently the message identifier is shared across the whole connector, for all output ports. I asked @snork-alt about this and it seems we can't use separate message identifiers for different tables, at least it's the case for postgres. If we want to define AppendOnly as a table level concept, we need to further ask connector to specify if the message identifier is at the connector level or table level.
I'll change Seekable to Checkpointed in main post.
You are right! Seekable connectors can just output an empty snapshot. I'll modify the post.
We also need replay if Dozer is shutdown but some operations aren't committed. If the connector can't replay, uncommited operations are lost.

snork-alt Feb 9, 2023
Maintainer

I'd a section on message ACKs, specifying how the connector should ACKs the source that a message has been processed and persisted to the pipeline state. Would also be useful to finalize the trait in this doc.

chubei Feb 9, 2023
Author

I'd a section on message ACKs, specifying how the connector should ACKs the source that a message has been processed and persisted to the pipeline state. Would also be useful to finalize the trait in this doc.

Will you add the section or @karolisg will?

chloeminkyung · 2023-02-10T04:16:05Z

chloeminkyung
Feb 10, 2023

Followings are my thoughts:

Having AppendOnly in source or schema level
For Checkpointing, agreed on using Full or None
Maximum number of operations that can be lost should be indicated towards user when they run with None

1 reply

chubei Feb 10, 2023
Author

AppendOnly: That's specified in replication changes tracking type Nothing.
Maximum number of operations that can be lost: This is the ACK frequency. We can have a default value and print it at the beginnning.

mediuminvader · 2023-02-10T05:25:46Z

mediuminvader
Feb 10, 2023

For Delete and Update operations, the connector needs to output the old record before update or deletion. OnlyPK connectors can fill the old record with Nulls expect for the fields that are part of the primary key (see Schema section).

Can you describe me a scenario where the NULL filling could happen?
should we allow to send the old record with all Nulls, but Pk? duplicates are not allowed from connectors?

7 replies

mediuminvader Feb 10, 2023

If duplicates are not allowed it's not a problem, how do we guarantee this?

chubei Feb 10, 2023
Author

I think the connector should guarantee there's no duplicates.

karolisg Feb 10, 2023

How we can guarantee this in connector?

chubei Feb 10, 2023
Author

If connector faithfully replicates external database, there should be no duplicates right?

karolisg Feb 10, 2023

Yes. I don't think we need to validate data in connector. If external database doesn't have PK or for any reason it can be duplicated, we should use Nothing as replication changes tracking type.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What makes a connector #834

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 11 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

What makes a connector #834

chubei Feb 8, 2023

What makes a connector

Validate Connection

List Schemas

Validate/Get Schemas

Replication changes tracking types

Start

Checkpointing

Full Checkpointing

Full Checkpointing connector example

None Checkpointing

Dozer ACK

Code

Replies: 3 comments · 11 replies

v3g42 Feb 8, 2023 Maintainer

chubei Feb 9, 2023 Author

snork-alt Feb 9, 2023 Maintainer

chubei Feb 9, 2023 Author

chloeminkyung Feb 10, 2023

chubei Feb 10, 2023 Author

mediuminvader Feb 10, 2023

mediuminvader Feb 10, 2023

chubei Feb 10, 2023 Author

karolisg Feb 10, 2023

chubei Feb 10, 2023 Author

karolisg Feb 10, 2023

chubei
Feb 8, 2023

Replies: 3 comments 11 replies

v3g42
Feb 8, 2023
Maintainer

chubei Feb 9, 2023
Author

snork-alt Feb 9, 2023
Maintainer

chubei Feb 9, 2023
Author

chloeminkyung
Feb 10, 2023

chubei Feb 10, 2023
Author

mediuminvader
Feb 10, 2023

chubei Feb 10, 2023
Author

chubei Feb 10, 2023
Author