Skip to content

Commit

Permalink
[Documentation] New controller and agent design docs (futurewei-cloud…
Browse files Browse the repository at this point in the history
…#156)

This PR adds quite a few design docs including

* Key System Flows

* Alcor Controller Microservices - Mac Manager

* Alcor Database and Cache services

* Alcor Control Agent - major components design

* Communication - Fast path, normal path and rescue path

* System Monitoring

* Communication Protocol with Compute
  • Loading branch information
Liguang Xie authored Apr 10, 2020
1 parent 0ba6274 commit f2831eb
Show file tree
Hide file tree
Showing 23 changed files with 1,058 additions and 36 deletions.
3 changes: 1 addition & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,5 +91,4 @@ We currently support integration with Kubernetes (via CNI plugin) and Mizar Data
We will continue to integrate with other orchestration systems and data plane implementations.

As a reference, Alcor supports a high performance cloud data plane [Mizar](https://github.com/futurewei-cloud/Mizar),
which is a complementary project of Alcor.

which is a complementary project of Alcor.
122 changes: 120 additions & 2 deletions docs/visionary_design/controller.adoc
Original file line number Diff line number Diff line change
@@ -1,3 +1,121 @@
= Control Design
= Alcor Regional Controller Design

// image::images/controller.jpg["Controller architecture", width=1024, link="images/controller.jpg"]
//== Project Scope

== High-Level Architecture

image::images/controller.JPG["Controller architecture", width=1024, link="images/controller.JPG"]

=== Design Principles

* Regional Scope, AZ resilience
* Simple network resource abstraction
* Loosely coupled components for flexible partitioning and easy scale out
* Top-down configuration driving towards eventual consistency
* Decoupling among services
** Database access only through service
** Isolation of database access and cache on the service level
** Enable flexible partitioning for various services

== Micro-Service Framework

. One controller instance is one Kubernetes application
. One microservice is one Kubernetes service
. One microservice could consist of multiple service instances (stateless or stateful) to improve availability, scalability and performance

[#ReviewDatabase]
=== Micro-service Snapshot

[width="100%",options="header"]
|====================
|Category|Name|Short Description|Type

.8+^.^|Resource Management Services|VPC Manager| VPC lifecycle management|Stateless
|Subnet Manager| Subnet lifecycle management |Stateless
|Port Manager| Port lifecycle management |Stateless
|Route Manager| Route table and rule management |Stateless
|Private IP Manager| VPC private IP lifecycle management (IPv4/6) |Stateless
|Virtual MAC Manager| Virtual MAC pool management |Stateless
|DNS Manager| DNS/DHCP record management |Stateless
|Virtual IP Manager| Public virtual IP management |Stateless

.4+^.^|Infrastructure Services|Node Manager|Physical nodes/machines management for control plane, including in/out of services, health status maint|Stateless
|Data Plane Manager|Responsible of sending network configuration to nodes|Stateless
|Gateway Manager|Responsible of managing gateway|Stateless
|Resource Pre-Provisioning Manager| TBD |Stateless

.2+^.^|Messaging Services|API Gateway| Responsible of request routing, composition, and protocol translation |Stateless
|Apache Kafka| Messaging services for controller and agent communication |Stateful

.1+^.^|Cache/Database Services|Apache Ignite| Database services to store resource states |Stateful

|====================

=== Concurrency and Event Ordering

Four types of concurrent network resource update:

[width="100%",options="header"]
|====================
|Concurrent Event Types|Example|Approach

| Operation on decoupled resources
| CURD of resources under two different/unpeered VPCs
| Free to update simultaneously

| Operation on loosely relevant resources
| Add one port, and delete the other in the same subnet
a|
- No conflict on resource management
- Network conf programming: Network conf versioning + version-awareness at ACA

| Operation on directly coupled resources
| Delete a VPC and create a subnet for an empty VPC
a|
- Timestamp issued by API gateway
- Check associated resource status
- DB cleanup for unstaged transactions

| Operation on the same resource
| Update operation and delete operation on the same port
a|
- Customer experience: may have different experience if executed in different order
- Resource management: no conflict (using DB concurrency + timestamp versioning)
- Network configuration programming: no conflict

|====================

== Availability Zone Resilience

TBD

== Service-to-Service Communication

TBD

//== Design Proposals
//
//=== Proposal A: Database centric design
//
//OpenStack
//Various business logics (implemented via plugin) access to the same database.
//Each service accesses to SQL database with DAO/ADO library.
//
//=== Proposal B: API server centric design
//
//Kubenetes
//Various business logics access to one (partitioned) database through API services.
//
//=== Proposal C: Service centric design
//
//Service mesh
//
//=== Proposal Comparison & Decision
//
//[width="100%",options="header"]
//|====================
//|Design|Pros|Cons
//|Option 1: Database centric design |Business logic coupling causing maintainence/upgrade challenges, business intra-interference and deep database coupling |
//|Option 2: API server centric design | | Simplied database access by standard API calls
//|Option 3: Service centric design| |
//|====================
38 changes: 38 additions & 0 deletions docs/visionary_design/controller_monitoring.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
= Controller Service Monitoring

== Architectural Design

* Architectural diagram

=== Microservice Health Metrics

* Counter
* Gauge
* Histogram
* Summary

== Service Mesh Monitoring

* Distributed tracing with Istio/Prometheus
** Enable Jaeger engine in Prometheus
** Use trace/span for failure detection

* Useful scenarios
** Grey release
** service governance
** Service security

== Configure Prometheus/Grafana

Utilize the following dashboard for service monitoring:

* Istio mesh dashboard
* Istio service dashboard
* Istio workload dashboard
* Istio performance dashboard

== Monitoring System Performance Tuning

* Trade-off between collection frequency and performance impact

== Summary
20 changes: 13 additions & 7 deletions docs/visionary_design/data_store.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,6 @@ Liguang Xie <[email protected]>
v0.1, 2019-10-27
:toc: right

[width="100%",options="header"]
|====================
|Title|Authors|Version
|Data Store and Cache Subsystem Design|@xieus|0.1
|====================

== Overview

//[.lead]
Expand Down Expand Up @@ -420,6 +414,17 @@ Based on <<system-requirements>> and <<FeatureComp>>, Apache Ignite provides a v
* Collocated joins and non-collocated joins
* In-memory indexing

Regarding performance and storage size,
the benchmark results with Yardstick <<ignite_benchmark>> shows that Ignite could reach up to 1/3 million Ops and less than 1 millisecond latency with four average server machines (2x Xeon E5-2609 v4 1.7GHz, 96 GB RAM).
The catch is that the benchmark is conducted by only one client node with 128 client threads, which does not consider network round trip time in the scenarios where 2-phase commit is applied.

The comparison results with Cassandra <<ignite_cassandra>> used a more distributed benchmark YCSB with three server nodes (same server configuration as used in Yardstick).
In a 256 client threads setup, Ignite could reach up to 300K READ Ops and 150K READ+UPDATE Ops.

In short, Ignite fits into read-intensive and mixed workloads.
With data shading support, the throughput and latency data is expected to meet our system requirements.
Its maximum reliable dataset size could reach up to hundreds of TBs, which provides sufficient margin to support fast-growing pace of public cloud.

TIP: To get more details about how to scale Ignite cluster to meet the storage requirements,
refer to <<capacity>>.

Expand Down Expand Up @@ -755,4 +760,5 @@ Mythbusting Database Deployment Options for Big Data https://www.scylladb.com/wp
- [[[crossaz,14]]] Gridgain data center replication: https://www.gridgain.com/products/software/enterprise-edition/data-center-replication
- [[[ignite_affinity_apis,15]]] Apache Ignite AffinityFunction: https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/cache/affinity/AffinityFunction.html
- [[[ignite_replication,16]]] Apache Ignite Partitioning and Replication: https://apacheignite.readme.io/docs/cache-modes
- [[[ignite_capacity,17]]] Apache Ignite Capacity Planning: https://apacheignite.readme.io/docs/capacity-planning
- [[[ignite_capacity,17]]] Apache Ignite Capacity Planning: https://apacheignite.readme.io/docs/capacity-planning
- [[[ignite_benchmark,18]]] GridGain Benchmarks Results: https://www.gridgain.com/resources/benchmarks/gridgain-benchmarks-results
95 changes: 95 additions & 0 deletions docs/visionary_design/dataplane_abstraction.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
= Dataplane Abstraction
Eric Li <sze[email protected]>
v0.1, 2020-03-01
:toc: right

== Introduction

TBD

== Assumptions

. Performance, both control plane programming throughput and latency, is a requirement and priority
. *** UPDATED *** A VPC project network may support more than one type of dataplane (e.g. VXLAN_OVS and VLAN_OVS)

== Current ACA Layered Design

The current ACA implementation already have the core GS parsing logic in one place regardless of the dataplane, until the actual processing of the VPC/Subnet/port configuration which is dataplane dependent, thanks to the implementation of parallel programming workitem design. We can simply put the dataplane dependent code into different class, or do stuff like load library/plugin during runtime.

== Basic flow applicable to all options

. Customer creates a VM from UI or API
. Customer picks an existing VPC and Subnet, or create one on the creation wizard (UI)
. The VM creation API call goes to Nova scheduler pick an appropriate compute host to place the VM
. Nova compute will call Alcor Controller (was Neutron server) to allocate network
.. *** UPDATED *** see if we can have Nova compute to pass down the host info to Alcor Controller
. Nova host agent will add tap device to connect the VM to OVS br_int
. Alcor controller push down goal state to the corresponding compute host(s) and network node(s)
. Two major endpoint host goal state updates to point out:
.. Port operation: CREATE: to setup the network device and start dataplane programming
.. Port operation: FINALIZE: complete rest of dataplane programming and mark the device as ready to use
.. Investigate to see if we can combine the two port operations

== Abstraction at Alcor Control Plane *** UPDATED ***

. ACA and Alcor Controller will be configured with supported network type and default during startup time (e.g. config file)
.. Note that this approach does support concurrent network type (e.g. VXLAN and VLAN) on the host
. When Nova compute calls Alcor Controller to allocate network, it can either
.. NOT specify the dataplane type, so it will use the Alcor Controller default
.. Specify the network type, Alcor controller will leverage that for processing, ACA running on compute host will try to see if it is supported, else return UNSUPPORTED_DATAPLANE
.. *** QUESTION *** Do we want to go with always explicit approach?
. Alcor Controller can specify which network type to program or just use default when sending down Port operation: CREATE
.. Note that Neutron server is always explicit on the network type when sending down port detail
. When ACA receives goal state update with Port operation: CREATE (in one shot, not calling operation FINALIZE anymore)
.. It will setup the network device according to the specified network type or use default type if not specified
.. Alcor Controller is responsible to aggregate all the port/router update status to provide the final "port state up" status

*src/schema/proto3/goalstate.proto*
[source,java]
------------------------------------------------------------
enum NetworkType { // ***NEW*** *** UPDATED ***
VXLAN = 0; // use the default type configured in compute host ACA
VLAN = 1;
GRE = 2;
GENEVE = 3;
}
/* snipped out */
message GoalState {
NetworkType network_type = 1; // ***NEW***
repeated VpcState vpc_states = 2;
repeated SubnetState subnet_states = 3;
repeated PortState port_states = 4;
repeated SecurityGroupState security_group_states = 5;
}
------------------------------------------------------------

*src/schema/proto3/goalstateprovisioner.proto*
[source,java]
------------------------------------------------------------
enum OperationStatus {
SUCCESS = 0;
FAILURE = 1;
INVALID_ARG = 2;
UNSUPPORTED_NETWORK_TYPE = 3 // ***NEW***
}
/* snipped out */
message GoalStateOperationReply {
repeated GoalStateOperationStatus operation_statuses = 1;
uint32 message_total_operation_time = 2;
message GoalStateOperationStatus {
string resource_id = 1;
ResourceType resource_type = 2;
OperationType operation_type = 3;
OperationStatus operation_status = 4;
uint32 dataplane_programming_time = 5;
uint32 network_configuration_time = 6;
uint32 state_elapse_time = 7;
}
}
------------------------------------------------------------
3 changes: 1 addition & 2 deletions docs/visionary_design/deployment.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@

=== Compatibility with OpenStack, Kubernetes, and OPNFV

=== Multiple data plan Support

=== Multiple Data Plane Support
* Mizar (eBPF/Geneve)
* OVS/VxLan (in design).
5 changes: 4 additions & 1 deletion docs/visionary_design/fast_path.adoc
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
= Control Plane Fast Path
Liguang Xie <lxie@futurewei.com>
v1.0, 2019-08-15
:toc: right

== Introduction

Expand Down Expand Up @@ -26,7 +29,7 @@ With this fast path, time-critical applications will benefit from the low latenc
The following diagram illustrates the architecture of the network control plane fast path,
and the bi-directional communication channel between the Network Controllers and Control Agents.

image::images/FastPath.GIF["Fast path architecture", width=1024, link="images/FastPath.GIF"]
image::images/fast_path.GIF["Fast path architecture", width=1024, link="images/fast_path.GIF"]

=== Bi-directional Communication Channel
The top-down communication channel from Controllers to Agents can be used in many scenarios which require low E2E latency for network configuration update:
Expand Down
Loading

0 comments on commit f2831eb

Please sign in to comment.