Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Umbrella] InLong offline synchronization feature #9779

Closed
30 of 35 tasks
aloyszhang opened this issue Mar 5, 2024 · 0 comments · Fixed by #10579
Closed
30 of 35 tasks

[Umbrella] InLong offline synchronization feature #9779

aloyszhang opened this issue Mar 5, 2024 · 0 comments · Fixed by #10579
Assignees

Comments

@aloyszhang
Copy link
Contributor

aloyszhang commented Mar 5, 2024

Motivation

Currently, InLong provides real-time data synchronization based on the Flink engine, which has the advantage of low latency. Compared to real-time synchronization, offline data synchronization(not supported yet) pays more attention to synchronization throughput and efficiency.

To enhance the usage scenarios of InLong, we plan to add support for offline data synchronization capability in InLong. The implementation is based on the Flink computing engine uniformly. Real-time synchronization tasks run in the manner of Flink stream tasks, while offline synchronization runs in the manner of Flink batch tasks. This approach can ensure the consistency of real-time and offline synchronization tasks' code as much as possible, reducing maintenance costs.

Solution

The offline synchronization feature of the InLong dataset integration provides sources and sinks for processing data, corresponding to data sources and destinations, and combines with the scheduling system to synchronize full or incremental data from the data source to the data target.

InLong supports scheduling offline synchronization tasks by setting specific trigger times(including year, month, day, hour, and minute) through the scheduling system.

Offline synchronization tasks are created by the Manager (including scheduling information), and the specific data synchronization logic is implemented through the InLong Sort module.

Logical Architecture

image

Key Competency

Job Configuration: Support Wizard Mode(Configuration through page wizard) and OpenAPI mode.

Scheduling Configuration: Support Wizard Mode(Configuration through page wizard) and OpenAPI mode

Job Type: Support Periodic Incremental Synchronization and Periodic Full Synchronization

Scheduling: Built-in simple periodic scheduling capability, complex capabilities such as task dependencies are supported by third-party scheduling systems.

Data Source: RMDB, Message Queue and Big data storage(Hive,StarRocks,Iceberg etc.)

Data Sink: RMDB, Message Queue and Big data storage(Hive,StarRocks,Iceberg etc.)

Compute Engine: Flink

Offline Job Operation and Maintenance: Job start,stop and running status monitoring

Special Handling: Dirty Data Processing Capability

Data Flow Architecture

image

  1. The user creates an offline synchronization task.
  2. The manager saves task information and scheduling information in the DB.
  3. After task approval, the offline synchronization task information is encapsulated.
  4. Register scheduling information with the scheduling system; InLong has a built-in simple scheduling solution (Quartz), while complete scheduling capabilities rely on third-party scheduling systems (DolphinScheduler, US, etc.).
  5. The scheduling system regularly generates scheduling instances.
  6. For the initial run, the manager constructs a Flink batch job.
  7. Submit the Flink batch job to the Flink cluster.

Task list

new dev branch

Since this is a big feature for InLong, so, create a new branch for development, and after development and testing are completed, merge it back to master.

Manager

Offline Synchronization Task Management: Definition and Management of Offline Synchronization Tasks

Scheduling Management: Scheduling task definition, scheduling instance definition, scheduling task management (CRUD)

Offline Task Submission

Offline Task Operation and Maintenance

  • Start (task submission), stop
  • Retrieve running status
  • Task logs, exceptions

Sort

Flink Task Encapsulation: Add support for Flink environment in batch mode

Flink Batch Capability Support

InLong Component

Other for not specified component

Are you willing to submit PR?

  • Yes, I am willing to submit a PR!

Code of Conduct

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant