You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, InLong provides real-time data synchronization based on the Flink engine, which has the advantage of low latency. Compared to real-time synchronization, offline data synchronization(not supported yet) pays more attention to synchronization throughput and efficiency.
To enhance the usage scenarios of InLong, we plan to add support for offline data synchronization capability in InLong. The implementation is based on the Flink computing engine uniformly. Real-time synchronization tasks run in the manner of Flink stream tasks, while offline synchronization runs in the manner of Flink batch tasks. This approach can ensure the consistency of real-time and offline synchronization tasks' code as much as possible, reducing maintenance costs.
Solution
The offline synchronization feature of the InLong dataset integration provides sources and sinks for processing data, corresponding to data sources and destinations, and combines with the scheduling system to synchronize full or incremental data from the data source to the data target.
InLong supports scheduling offline synchronization tasks by setting specific trigger times(including year, month, day, hour, and minute) through the scheduling system.
Offline synchronization tasks are created by the Manager (including scheduling information), and the specific data synchronization logic is implemented through the InLong Sort module.
Logical Architecture
Key Competency
Job Configuration: Support Wizard Mode(Configuration through page wizard) and OpenAPI mode.
Scheduling Configuration: Support Wizard Mode(Configuration through page wizard) and OpenAPI mode
Job Type: Support Periodic Incremental Synchronization and Periodic Full Synchronization
Scheduling: Built-in simple periodic scheduling capability, complex capabilities such as task dependencies are supported by third-party scheduling systems.
Data Source: RMDB, Message Queue and Big data storage(Hive,StarRocks,Iceberg etc.)
Data Sink: RMDB, Message Queue and Big data storage(Hive,StarRocks,Iceberg etc.)
Compute Engine: Flink
Offline Job Operation and Maintenance: Job start,stop and running status monitoring
Special Handling: Dirty Data Processing Capability
Data Flow Architecture
The user creates an offline synchronization task.
The manager saves task information and scheduling information in the DB.
After task approval, the offline synchronization task information is encapsulated.
Register scheduling information with the scheduling system; InLong has a built-in simple scheduling solution (Quartz), while complete scheduling capabilities rely on third-party scheduling systems (DolphinScheduler, US, etc.).
The scheduling system regularly generates scheduling instances.
For the initial run, the manager constructs a Flink batch job.
Submit the Flink batch job to the Flink cluster.
Task list
new dev branch
Since this is a big feature for InLong, so, create a new branch for development, and after development and testing are completed, merge it back to master.
Motivation
Currently, InLong provides real-time data synchronization based on the Flink engine, which has the advantage of low latency. Compared to real-time synchronization, offline data synchronization(not supported yet) pays more attention to synchronization throughput and efficiency.
To enhance the usage scenarios of InLong, we plan to add support for offline data synchronization capability in InLong. The implementation is based on the Flink computing engine uniformly. Real-time synchronization tasks run in the manner of Flink stream tasks, while offline synchronization runs in the manner of Flink batch tasks. This approach can ensure the consistency of real-time and offline synchronization tasks' code as much as possible, reducing maintenance costs.
Solution
The offline synchronization feature of the InLong dataset integration provides sources and sinks for processing data, corresponding to data sources and destinations, and combines with the scheduling system to synchronize full or incremental data from the data source to the data target.
InLong supports scheduling offline synchronization tasks by setting specific trigger times(including year, month, day, hour, and minute) through the scheduling system.
Offline synchronization tasks are created by the Manager (including scheduling information), and the specific data synchronization logic is implemented through the InLong Sort module.
Logical Architecture
Key Competency
Job Configuration: Support Wizard Mode(Configuration through page wizard) and OpenAPI mode.
Scheduling Configuration: Support Wizard Mode(Configuration through page wizard) and OpenAPI mode
Job Type: Support Periodic Incremental Synchronization and Periodic Full Synchronization
Scheduling: Built-in simple periodic scheduling capability, complex capabilities such as task dependencies are supported by third-party scheduling systems.
Data Source: RMDB, Message Queue and Big data storage(Hive,StarRocks,Iceberg etc.)
Data Sink: RMDB, Message Queue and Big data storage(Hive,StarRocks,Iceberg etc.)
Compute Engine: Flink
Offline Job Operation and Maintenance: Job start,stop and running status monitoring
Special Handling: Dirty Data Processing Capability
Data Flow Architecture
Task list
new dev branch
Since this is a big feature for InLong, so, create a new branch for development, and after development and testing are completed, merge it back to master.
Manager
Offline Synchronization Task Management: Definition and Management of Offline Synchronization Tasks
Scheduling Management: Scheduling task definition, scheduling instance definition, scheduling task management (CRUD)
Offline Task Submission
Offline Task Operation and Maintenance
Sort
Flink Task Encapsulation: Add support for Flink environment in batch mode
Flink Batch Capability Support
InLong Component
Other for not specified component
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: