[Documentation] New controller and agent design docs (futurewei-cloud…

…#156) This PR adds quite a few design docs including * Key System Flows * Alcor Controller Microservices - Mac Manager * Alcor Database and Cache services * Alcor Control Agent - major components design * Communication - Fast path, normal path and rescue path * System Monitoring * Communication Protocol with Compute
zhangml · Apr 10, 2020 · f2831eb · f2831eb
1 parent 0ba6274
commit f2831eb
Show file tree

Hide file tree

Showing 23 changed files with 1,058 additions and 36 deletions.
diff --git a/README.md b/README.md
@@ -91,5 +91,4 @@ We currently support integration with Kubernetes (via CNI plugin) and Mizar Data
 We will continue to integrate with other orchestration systems and data plane implementations.
 
 As a reference, Alcor supports a high performance cloud data plane [Mizar](https://github.com/futurewei-cloud/Mizar),
-which is a complementary project of Alcor.
-
+which is a complementary project of Alcor.
diff --git a/docs/visionary_design/controller.adoc b/docs/visionary_design/controller.adoc
@@ -1,3 +1,121 @@
-= Control Design
+= Alcor Regional Controller Design
 
-// image::images/controller.jpg["Controller architecture", width=1024, link="images/controller.jpg"]
+//== Project Scope
+
+== High-Level Architecture
+
+image::images/controller.JPG["Controller architecture", width=1024, link="images/controller.JPG"]
+
+=== Design Principles
+
+* Regional Scope, AZ resilience
+* Simple network resource abstraction
+* Loosely coupled components for flexible partitioning and easy scale out
+* Top-down configuration driving towards eventual consistency
+* Decoupling among services
+** Database access only through service
+** Isolation of database access and cache on the service level
+** Enable flexible partitioning for various services
+
+== Micro-Service Framework
+
+. One controller instance is one Kubernetes application
+. One microservice is one Kubernetes service
+. One microservice could consist of multiple service instances (stateless or stateful) to improve availability, scalability and performance
+
+[#ReviewDatabase]
+=== Micro-service Snapshot
+
+[width="100%",options="header"]
+|====================
+|Category|Name|Short Description|Type
+
+.8+^.^|Resource Management Services|VPC Manager| VPC lifecycle management|Stateless
+|Subnet Manager| Subnet lifecycle management |Stateless
+|Port Manager| Port lifecycle management |Stateless
+|Route Manager| Route table and rule management |Stateless
+|Private IP Manager| VPC private IP lifecycle management (IPv4/6) |Stateless
+|Virtual MAC Manager| Virtual MAC pool management |Stateless
+|DNS Manager| DNS/DHCP record management |Stateless
+|Virtual IP Manager| Public virtual IP management |Stateless
+
+.4+^.^|Infrastructure Services|Node Manager|Physical nodes/machines management for control plane, including in/out of services, health status maint|Stateless
+|Data Plane Manager|Responsible of sending network configuration to nodes|Stateless
+|Gateway Manager|Responsible of managing gateway|Stateless
+|Resource Pre-Provisioning Manager| TBD |Stateless
+
+.2+^.^|Messaging Services|API Gateway| Responsible of request routing, composition, and protocol translation |Stateless
+|Apache Kafka| Messaging services for controller and agent communication |Stateful
+
+.1+^.^|Cache/Database Services|Apache Ignite| Database services to store resource states |Stateful
+
+|====================
+
+=== Concurrency and Event Ordering
+
+Four types of concurrent network resource update:
+
+[width="100%",options="header"]
+|====================
+|Concurrent Event Types|Example|Approach
+
+| Operation on decoupled resources
+| CURD of resources under two different/unpeered VPCs
+| Free to update simultaneously
+
+| Operation on loosely relevant resources
+| Add one port, and delete the other in the same subnet
+a|
+- No conflict on resource management
+- Network conf programming: Network conf versioning + version-awareness at ACA
+
+| Operation on directly coupled resources
+| Delete a VPC and create a subnet for an empty VPC
+a|
+- Timestamp issued by API gateway
+- Check associated resource status
+- DB cleanup for unstaged transactions
+
+| Operation on the same resource
+| Update operation and delete operation on the same port
+a|
+- Customer experience: may have different experience if executed in different order
+- Resource management: no conflict (using DB concurrency + timestamp versioning)
+- Network configuration programming: no conflict
+
+|====================
+
+== Availability Zone Resilience
+
+TBD
+
+== Service-to-Service Communication
+
+TBD
+
+//== Design Proposals
+//
+//=== Proposal A: Database centric design
+//
+//OpenStack
+//Various business logics (implemented via plugin) access to the same database.
+//Each service accesses to SQL database with DAO/ADO library.
+//
+//=== Proposal B: API server centric design
+//
+//Kubenetes
+//Various business logics access to one (partitioned) database through API services.
+//
+//=== Proposal C: Service centric design
+//
+//Service mesh
+//
+//=== Proposal Comparison & Decision
+//
+//[width="100%",options="header"]
+//|====================
+//|Design|Pros|Cons
+//|Option 1: Database centric design |Business logic coupling causing maintainence/upgrade challenges, business intra-interference and deep database coupling |
+//|Option 2: API server centric design | | Simplied database access by standard API calls
+//|Option 3: Service centric design| |
+//|====================
diff --git a/docs/visionary_design/controller_monitoring.adoc b/docs/visionary_design/controller_monitoring.adoc
@@ -0,0 +1,38 @@
+= Controller Service Monitoring
+
+== Architectural Design
+
+* Architectural diagram
+
+=== Microservice Health Metrics
+
+* Counter
+* Gauge
+* Histogram
+* Summary
+
+== Service Mesh Monitoring
+
+* Distributed tracing with Istio/Prometheus
+** Enable Jaeger engine in Prometheus
+** Use trace/span for failure detection
+
+* Useful scenarios
+** Grey release
+** service governance
+** Service security
+
+== Configure Prometheus/Grafana
+
+Utilize the following dashboard for service monitoring:
+
+* Istio mesh dashboard
+* Istio service dashboard
+* Istio workload dashboard
+* Istio performance dashboard
+
+== Monitoring System Performance Tuning
+
+* Trade-off between collection frequency and performance impact
+
+== Summary
diff --git a/docs/visionary_design/data_store.adoc b/docs/visionary_design/data_store.adoc
@@ -3,12 +3,6 @@ Liguang Xie <[email protected]>
 v0.1, 2019-10-27
 :toc: right
 
-[width="100%",options="header"]
-|====================
-|Title|Authors|Version
-|Data Store and Cache Subsystem Design|@xieus|0.1
-|====================
-
 == Overview
 
 //[.lead]
@@ -420,6 +414,17 @@ Based on <<system-requirements>> and <<FeatureComp>>, Apache Ignite provides a v
 * Collocated joins and non-collocated joins
 * In-memory indexing
 
+Regarding performance and storage size,
+the benchmark results with Yardstick <<ignite_benchmark>> shows that Ignite could reach up to 1/3 million Ops and less than 1 millisecond latency with four average server machines (2x Xeon E5-2609 v4 1.7GHz, 96 GB RAM).
+The catch is that the benchmark is conducted by only one client node with 128 client threads, which does not consider network round trip time in the scenarios where 2-phase commit is applied.
+
+The comparison results with Cassandra <<ignite_cassandra>> used a more distributed benchmark YCSB with three server nodes (same server configuration as used in Yardstick).
+In a 256 client threads setup, Ignite could reach up to 300K READ Ops and 150K READ+UPDATE Ops.
+
+In short, Ignite fits into read-intensive and mixed workloads.
+With data shading support, the throughput and latency data is expected to meet our system requirements.
+Its maximum reliable dataset size could reach up to hundreds of TBs, which provides sufficient margin to support fast-growing pace of public cloud.
+
 TIP: To get more details about how to scale Ignite cluster to meet the storage requirements,
 refer to <<capacity>>.
 
@@ -755,4 +760,5 @@ Mythbusting Database Deployment Options for Big Data https://www.scylladb.com/wp
 - [[[crossaz,14]]] Gridgain data center replication: https://www.gridgain.com/products/software/enterprise-edition/data-center-replication
 - [[[ignite_affinity_apis,15]]] Apache Ignite AffinityFunction: https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/cache/affinity/AffinityFunction.html
 - [[[ignite_replication,16]]] Apache Ignite Partitioning and Replication: https://apacheignite.readme.io/docs/cache-modes
-- [[[ignite_capacity,17]]] Apache Ignite Capacity Planning: https://apacheignite.readme.io/docs/capacity-planning
+- [[[ignite_capacity,17]]] Apache Ignite Capacity Planning: https://apacheignite.readme.io/docs/capacity-planning
+- [[[ignite_benchmark,18]]] GridGain Benchmarks Results: https://www.gridgain.com/resources/benchmarks/gridgain-benchmarks-results
diff --git a/docs/visionary_design/dataplane_abstraction.adoc b/docs/visionary_design/dataplane_abstraction.adoc
@@ -0,0 +1,95 @@
+= Dataplane Abstraction
+Eric Li <sze[email protected]>
+v0.1, 2020-03-01
+:toc: right
+
+== Introduction
+
+TBD
+
+== Assumptions
+
+. Performance, both control plane programming throughput and latency, is a requirement and priority
+. *** UPDATED *** A VPC project network may support more than one type of dataplane (e.g. VXLAN_OVS and VLAN_OVS)
+
+== Current ACA Layered Design
+
+The current ACA implementation already have the core GS parsing logic in one place regardless of the dataplane, until the actual processing of the VPC/Subnet/port configuration which is dataplane dependent, thanks to the implementation of parallel programming workitem design. We can simply put the dataplane dependent code into different class, or do stuff like load library/plugin during runtime.
+
+== Basic flow applicable to all options
+
+. Customer creates a VM from UI or API
+. Customer picks an existing VPC and Subnet, or create one on the creation wizard (UI)
+. The VM creation API call goes to Nova scheduler pick an appropriate compute host to place the VM
+. Nova compute will call Alcor Controller (was Neutron server) to allocate network
+.. *** UPDATED *** see if we can have Nova compute to pass down the host info to Alcor Controller
+. Nova host agent will add tap device to connect the VM to OVS br_int
+. Alcor controller push down goal state to the corresponding compute host(s) and network node(s)
+. Two major endpoint host goal state updates to point out:
+.. Port operation: CREATE: to setup the network device and start dataplane programming
+.. Port operation: FINALIZE: complete rest of dataplane programming and mark the device as ready to use
+.. Investigate to see if we can combine the two port operations
+
+== Abstraction at Alcor Control Plane *** UPDATED ***
+
+. ACA and Alcor Controller will be configured with supported network type and default during startup time (e.g. config file)
+.. Note that this approach does support concurrent network type (e.g. VXLAN and VLAN) on the host 
+. When Nova compute calls Alcor Controller to allocate network, it can either
+.. NOT specify the dataplane type, so it will use the Alcor Controller default
+.. Specify the network type, Alcor controller will leverage that for processing, ACA running on compute host will try to see if it is supported, else return UNSUPPORTED_DATAPLANE
+.. *** QUESTION *** Do we want to go with always explicit approach?
+. Alcor Controller can specify which network type to program or just use default when sending down Port operation: CREATE
+.. Note that Neutron server is always explicit on the network type when sending down port detail
+. When ACA receives goal state update with Port operation: CREATE (in one shot, not calling operation FINALIZE anymore)
+.. It will setup the network device according to the specified network type or use default type if not specified
+.. Alcor Controller is responsible to aggregate all the port/router update status to provide the final "port state up" status
+
+*src/schema/proto3/goalstate.proto*
+[source,java]
+------------------------------------------------------------
+enum NetworkType { // ***NEW*** *** UPDATED ***
+    VXLAN = 0;      // use the default type configured in compute host ACA
+    VLAN = 1;
+    GRE = 2;
+    GENEVE = 3;
+}
+
+/* snipped out */
+
+message GoalState {
+    NetworkType network_type = 1; // ***NEW***
+    repeated VpcState vpc_states = 2;
+    repeated SubnetState subnet_states = 3;
+    repeated PortState port_states = 4;
+    repeated SecurityGroupState security_group_states = 5;
+}
+------------------------------------------------------------
+
+*src/schema/proto3/goalstateprovisioner.proto*
+[source,java]
+------------------------------------------------------------
+enum OperationStatus {
+    SUCCESS = 0;
+    FAILURE = 1;
+    INVALID_ARG = 2;
+    UNSUPPORTED_NETWORK_TYPE = 3 // ***NEW***
+}
+
+/* snipped out */
+
+message GoalStateOperationReply {
+
+    repeated GoalStateOperationStatus operation_statuses = 1;
+    uint32 message_total_operation_time = 2;
+
+    message GoalStateOperationStatus {
+        string resource_id = 1;
+        ResourceType resource_type = 2;
+        OperationType operation_type = 3;
+        OperationStatus operation_status = 4;
+        uint32 dataplane_programming_time = 5;
+        uint32 network_configuration_time = 6;
+        uint32 state_elapse_time = 7;
+    }
+}
+------------------------------------------------------------
diff --git a/docs/visionary_design/deployment.adoc b/docs/visionary_design/deployment.adoc
@@ -8,7 +8,6 @@
 
 === Compatibility with OpenStack, Kubernetes, and OPNFV
 
-=== Multiple data plan Support
-
+=== Multiple Data Plane Support
 * Mizar (eBPF/Geneve)
 * OVS/VxLan (in design).
diff --git a/docs/visionary_design/fast_path.adoc b/docs/visionary_design/fast_path.adoc
@@ -1,4 +1,7 @@
 = Control Plane Fast Path
+Liguang Xie <lxie@futurewei.com>
+v1.0, 2019-08-15
+:toc: right
 
 == Introduction
 
@@ -26,7 +29,7 @@ With this fast path, time-critical applications will benefit from the low latenc
 The following diagram illustrates the architecture of the network control plane fast path,
 and the bi-directional communication channel between the Network Controllers and Control Agents.
 
-image::images/FastPath.GIF["Fast path architecture", width=1024, link="images/FastPath.GIF"]
+image::images/fast_path.GIF["Fast path architecture", width=1024, link="images/fast_path.GIF"]
 
 === Bi-directional Communication Channel
 The top-down communication channel from Controllers to Agents can be used in many scenarios which require low E2E latency for network configuration update: