[Feature] dss single point operation is modified to be multi-active #1095

wxyn · 2023-07-12T06:31:43Z

Search before asking

I had searched in the issues and found no similar feature requirement.

Problem Description

At present, the multiple Microservices deployed by DSS in each environment are single nodes. No matter service exceptions or host exceptions, there is a great risk of service unavailability, which affects the availability of the entire product. And during version upgrades, all services need to be stopped for 1-2 hours each time, which can also affect the user experience to a certain extent.
Therefore, it is necessary to transform all Microservices of the DSS into a multi live mode to ensure that the DSS service is still available when an exception occurs at a node.

Description

Realize the multi active deployment of DSS to ensure that during the maintenance of a certain set of service machines, the services of other machines can be used as usual without affecting users and without their perception. Based on this, a complete multi activity deployment plan needs to be provided.
If a certain service is abnormal during the publishing process, an error message will be returned indicating that the system has taken a nap. Please try again later.

Use case

No response

solutions

1. Overall design
To transform DSS from only supporting single node deployment to supporting multi node multi activity deployment, the points to be considered include: data sharing and synchronization, data consistency, load balancing and failover, and service discovery and registration. The latter two can directly reuse the existing functions of Linkis. DSS needs to care about two parts: whether cache is involved in the process of service invocation, and whether cache is involved in each Microservices, To avoid data inconsistency; The second is the tasks executed in the service, such as executing workflow or node tasks, publishing workflow tasks, copying workflow or project tasks, workflow import and export tasks, etc., to prevent abnormal task states of nodes from being returned to users.
1.1 Technical Architecture

DataSphereStudio	category	lectotype	version
Microservices module	Microservices governance	Spring Cloud	Finchley.RELEASE
Service Registration and Discovery	Nacos	Not involved yet
Unified Configuration Center	Managis	1.3.5
Gateway Routing	Spring Cloud Gateway	2.0.1.RELEASE
service call	OpenFeign	2.0.0.RELEASE
Service Security Certification	UC	Under planning
Interface Document Engine	GITBOOK(Swagger)	Not involved yet
Service application monitoring	Spring Cloud Admin	Not involved yet
Service Link Tracking	Skywalking	Under planning
The service is degraded, fused, or restricted	Organize and compare Sentinel/Hystrix	Under planning
Load balancing between services	Spring Cloud ribbon	2.0.0.RELEASE
Basic universal module	database	Mysql	5.1.34(Driver version)
Data access persistence	Mybatis	3.4.6
MVC	Spring mvc	1.19.1
Load Balance	Nginx	1.16.1
Project build and Management Tools	Maven	3.0+
Distributed locks	Tentative DB implementation
Unified Distributed cache	Research when needed	Not involved yet
Unified log collection and storage	Tentative ELK	Under planning
Message queue	Research when needed	Not involved yet
distributed transaction	Research when needed	Not involved yet
Log printing	Log4j2 + self4j	2.17.1
Front end frame	TypeScriptis	3.5.3

1.2 Business architecture

From the user's perspective, there is no perception of whether the backend service is a single node or multiple nodes, so the business architecture remains unchanged.

2. Module design

Since the service has been merged into two services in the Microservices merging, and there are no cache related calls between the two services, the cache problem does not need to be considered. Therefore, the focus is on the various tasks executed in a single service, because when a node has certain executing tasks, if the node encounters an exception at this time, it must provide feedback to the user that the task has failed through other nodes. Here, a regular inspection method is adopted to check the status of the task and save the status to the database for return to the user. This scheduled task is controlled through parameter configuration and is executed every 60 seconds by default.

2.1 Workflow Publishing Tasks
2.1.1 Open source workflow conversion
Due to the fact that there is no publish operation in the open source version and only the DSS workflow is converted into a scheduling system workflow, it is necessary to save the task state in the OrchestratorConversionJob. As the existing code only saves the state of the job in the cache, the job state needs to be stored in the database. Here, dss_ orchestrator_ job_ info table is reused. The scheduled task at this location is CheckOrchestratorConversionJobTask, defined in the Orchestrator server module.

In the first step, if all the instances obtained are alive, you can return directly, otherwise save the instance information; The second step is to obtain information about tasks being executed or initialized from the dss_orchestrator_job_info table. The third step is to compare the instance information. If the instance of a task that is being executed does not exist on Eureka, then the status of these tasks needs to be updated to failed. Step 4 Update the task status information; Step 5: If a node is abnormal, you need to send an alarm message to the developer, including the information about failed tasks on the node.

It should be noted that in the ConvertOrchestration method of OrchestratorPluginServiceImpl, the current instance needs to be obtained through the Sender.getThisInstance method and saved to the table dss_ Orchestrator_ Job_Info At the same time, this table will save the information of the conversion workflow task, and then update the information of the conversion workflow task in the OrchestratorConversionJob.

The existing dss_orchestrator_job_info table is used. Changes to the table add two fields, instance_name and status, and change the updated_time field to update_time.

2.2 Open Source workflow Executes tasks
The existing table dss_workflow_task is used here to write the instance information, and the timed task is CheckWorkflowExecuteTask, defined in the flow-execution-server module. The entire implementation process is similar to 2.1.1.

The persist method in WorkflowPersistenceEngine saves instance information, while the change method updates workflow execution information.

2.3 Open source workflow copy task
Using existing table dss_orchestrator_copy_info here, in which writing instance information, timing task for CheckOrchestratorCopyTask, defined in the framework-orchestrator-server module. The entire implementation process is similar to 2.1.1.

The copyOrchestrator method in OrchestratorFrameworkServiceImpl saves instance information, while OrchestratorCopyJob updates workflow copy task information.

2.4 Determine whether the scheduled cleaning of CS tasks is supported in a multi active state.

3. Data structure/storage design (determine which field to follow and modify the initialization statement for the first installation)

3.1 Workflow Publishing Tasks
3.1.1 Add fields instance_name and status in table dss_orchestrator_job_info, and change updated_time to update_time

ALTER TABLE `dss_orchestrator_job_info` ADD `instance_name` varchar(128) DEFAULT NULL COMMENT 'An instance of executing a task';
ALTER TABLE `dss_orchestrator_job_info` ADD `status` varchar(128) DEFAULT NULL COMMENT 'Transition Task Status';
ALTER TABLE `dss_orchestrator_job_info` ADD `error_msg` varchar(2048) DEFAULT NULL COMMENT 'Conversion task exception information';
ALTER TABLE `dss_orchestrator_job_info` CHANGE `updated_time` `update_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP;
ALTER TABLE `dss_orchestrator_job_info` MODIFY `job_id` varchar(64) DEFAULT NULL COMMENT 'task id';

3.2 New field instance_name in table dss_workflow_task

ALTER TABLE `dss_workflow_task` ADD `instance_name` varchar(128) DEFAULT NULL COMMENT 'An instance of executing a task' AFTER `status`;

3.3 Add instance_name in table dss_orchestrator_copy_info

ALTER TABLE `dss_orchestrator_copy_info` ADD `instance_name` varchar(128) DEFAULT NULL COMMENT 'An instance of executing a task' AFTER `status`;

3.4 DDL statements of related tables must be updated at the same time for the initial installation

Anything else

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

zqburde · 2023-08-09T03:01:14Z

Add it in DSS1.1.2

wxyn added the enhancement New feature or request label Jul 12, 2023

wushengyeyouya closed this as completed Aug 3, 2023

zqburde added type=NewFeature and removed enhancement New feature or request labels Aug 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] dss single point operation is modified to be multi-active #1095

[Feature] dss single point operation is modified to be multi-active #1095

wxyn commented Jul 12, 2023 •

edited

Loading

Anything else

Are you willing to submit a PR?

zqburde commented Aug 9, 2023

[Feature] dss single point operation is modified to be multi-active #1095

[Feature] dss single point operation is modified to be multi-active #1095

Comments

wxyn commented Jul 12, 2023 • edited Loading

Search before asking

Problem Description

Description

Use case

solutions

Anything else

Are you willing to submit a PR?

zqburde commented Aug 9, 2023

wxyn commented Jul 12, 2023 •

edited

Loading