Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] dss single point operation is modified to be multi-active #1095

Closed
2 tasks done
wxyn opened this issue Jul 12, 2023 · 1 comment
Closed
2 tasks done

[Feature] dss single point operation is modified to be multi-active #1095

wxyn opened this issue Jul 12, 2023 · 1 comment

Comments

@wxyn
Copy link
Contributor

wxyn commented Jul 12, 2023

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Problem Description

At present, the multiple Microservices deployed by DSS in each environment are single nodes. No matter service exceptions or host exceptions, there is a great risk of service unavailability, which affects the availability of the entire product. And during version upgrades, all services need to be stopped for 1-2 hours each time, which can also affect the user experience to a certain extent.
Therefore, it is necessary to transform all Microservices of the DSS into a multi live mode to ensure that the DSS service is still available when an exception occurs at a node.

Description

Realize the multi active deployment of DSS to ensure that during the maintenance of a certain set of service machines, the services of other machines can be used as usual without affecting users and without their perception. Based on this, a complete multi activity deployment plan needs to be provided.
If a certain service is abnormal during the publishing process, an error message will be returned indicating that the system has taken a nap. Please try again later.

Use case

No response

solutions

1. Overall design
To transform DSS from only supporting single node deployment to supporting multi node multi activity deployment, the points to be considered include: data sharing and synchronization, data consistency, load balancing and failover, and service discovery and registration. The latter two can directly reuse the existing functions of Linkis. DSS needs to care about two parts: whether cache is involved in the process of service invocation, and whether cache is involved in each Microservices, To avoid data inconsistency; The second is the tasks executed in the service, such as executing workflow or node tasks, publishing workflow tasks, copying workflow or project tasks, workflow import and export tasks, etc., to prevent abnormal task states of nodes from being returned to users.
1.1 Technical Architecture

DataSphereStudio category lectotype version
Microservices module Microservices governance Spring Cloud Finchley.RELEASE
Service Registration and Discovery Nacos Not involved yet
Unified Configuration Center Managis 1.3.5
Gateway Routing Spring Cloud Gateway 2.0.1.RELEASE
service call OpenFeign 2.0.0.RELEASE
Service Security Certification UC Under planning
Interface Document Engine GITBOOK(Swagger) Not involved yet
Service application monitoring Spring Cloud Admin Not involved yet
Service Link Tracking Skywalking Under planning
The service is degraded, fused, or restricted Organize and compare Sentinel/Hystrix Under planning
Load balancing between services Spring Cloud ribbon 2.0.0.RELEASE
Basic universal module database Mysql 5.1.34(Driver version)
Data access persistence Mybatis 3.4.6
MVC Spring mvc 1.19.1
Load Balance Nginx 1.16.1
Project build and Management Tools Maven 3.0+
Distributed locks Tentative DB implementation  
Unified Distributed cache Research when needed Not involved yet
Unified log collection and storage Tentative ELK Under planning
Message queue Research when needed Not involved yet
distributed transaction Research when needed Not involved yet
Log printing Log4j2 + self4j 2.17.1
Front end frame TypeScriptis 3.5.3

1.2 Business architecture

From the user's perspective, there is no perception of whether the backend service is a single node or multiple nodes, so the business architecture remains unchanged.

2. Module design

Since the service has been merged into two services in the Microservices merging, and there are no cache related calls between the two services, the cache problem does not need to be considered. Therefore, the focus is on the various tasks executed in a single service, because when a node has certain executing tasks, if the node encounters an exception at this time, it must provide feedback to the user that the task has failed through other nodes. Here, a regular inspection method is adopted to check the status of the task and save the status to the database for return to the user. This scheduled task is controlled through parameter configuration and is executed every 60 seconds by default.

2.1 Workflow Publishing Tasks
2.1.1 Open source workflow conversion
Due to the fact that there is no publish operation in the open source version and only the DSS workflow is converted into a scheduling system workflow, it is necessary to save the task state in the OrchestratorConversionJob. As the existing code only saves the state of the job in the cache, the job state needs to be stored in the database. Here, dss_ orchestrator_ job_ info table is reused. The scheduled task at this location is CheckOrchestratorConversionJobTask, defined in the Orchestrator server module.
convertjob

In the first step, if all the instances obtained are alive, you can return directly, otherwise save the instance information; The second step is to obtain information about tasks being executed or initialized from the dss_orchestrator_job_info table. The third step is to compare the instance information. If the instance of a task that is being executed does not exist on Eureka, then the status of these tasks needs to be updated to failed. Step 4 Update the task status information; Step 5: If a node is abnormal, you need to send an alarm message to the developer, including the information about failed tasks on the node.

It should be noted that in the ConvertOrchestration method of OrchestratorPluginServiceImpl, the current instance needs to be obtained through the Sender.getThisInstance method and saved to the table dss_ Orchestrator_ Job_Info At the same time, this table will save the information of the conversion workflow task, and then update the information of the conversion workflow task in the OrchestratorConversionJob.

The existing dss_orchestrator_job_info table is used. Changes to the table add two fields, instance_name and status, and change the updated_time field to update_time.

2.2 Open Source workflow Executes tasks
The existing table dss_workflow_task is used here to write the instance information, and the timed task is CheckWorkflowExecuteTask, defined in the flow-execution-server module. The entire implementation process is similar to 2.1.1.
executetask
The persist method in WorkflowPersistenceEngine saves instance information, while the change method updates workflow execution information.

2.3 Open source workflow copy task
Using existing table dss_orchestrator_copy_info here, in which writing instance information, timing task for CheckOrchestratorCopyTask, defined in the framework-orchestrator-server module. The entire implementation process is similar to 2.1.1.
copyOrchestrator
The copyOrchestrator method in OrchestratorFrameworkServiceImpl saves instance information, while OrchestratorCopyJob updates workflow copy task information.

2.4 Determine whether the scheduled cleaning of CS tasks is supported in a multi active state.

3. Data structure/storage design (determine which field to follow and modify the initialization statement for the first installation)

3.1 Workflow Publishing Tasks
3.1.1 Add fields instance_name and status in table dss_orchestrator_job_info, and change updated_time to update_time

ALTER TABLE `dss_orchestrator_job_info` ADD `instance_name` varchar(128) DEFAULT NULL COMMENT 'An instance of executing a task';
ALTER TABLE `dss_orchestrator_job_info` ADD `status` varchar(128) DEFAULT NULL COMMENT 'Transition Task Status';
ALTER TABLE `dss_orchestrator_job_info` ADD `error_msg` varchar(2048) DEFAULT NULL COMMENT 'Conversion task exception information';
ALTER TABLE `dss_orchestrator_job_info` CHANGE `updated_time` `update_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP;
ALTER TABLE `dss_orchestrator_job_info` MODIFY `job_id` varchar(64) DEFAULT NULL COMMENT 'task id';

3.2 New field instance_name in table dss_workflow_task

ALTER TABLE `dss_workflow_task` ADD `instance_name` varchar(128) DEFAULT NULL COMMENT 'An instance of executing a task' AFTER `status`;

3.3 Add instance_name in table dss_orchestrator_copy_info

ALTER TABLE `dss_orchestrator_copy_info` ADD `instance_name` varchar(128) DEFAULT NULL COMMENT 'An instance of executing a task' AFTER `status`;

3.4 DDL statements of related tables must be updated at the same time for the initial installation

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@wxyn wxyn added the enhancement New feature or request label Jul 12, 2023
@zqburde zqburde added type=NewFeature and removed enhancement New feature or request labels Aug 9, 2023
@zqburde
Copy link
Contributor

zqburde commented Aug 9, 2023

Add it in DSS1.1.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants