diff --git a/img/orchagent-workflow.png b/img/orchagent-workflow.png new file mode 100644 index 0000000000..b215afe97c Binary files /dev/null and b/img/orchagent-workflow.png differ diff --git a/img/performance.png b/img/performance.png new file mode 100644 index 0000000000..1720b5cecf Binary files /dev/null and b/img/performance.png differ diff --git a/img/pipeline-mode.png b/img/pipeline-mode.png new file mode 100644 index 0000000000..f2589a63a7 Binary files /dev/null and b/img/pipeline-mode.png differ diff --git a/img/pipeline-timeline.png b/img/pipeline-timeline.png new file mode 100644 index 0000000000..952c49417a Binary files /dev/null and b/img/pipeline-timeline.png differ diff --git a/img/sonic-workflow.png b/img/sonic-workflow.png new file mode 100644 index 0000000000..33b0fc29e0 Binary files /dev/null and b/img/sonic-workflow.png differ diff --git a/img/syncd-workflow.jpg b/img/syncd-workflow.jpg new file mode 100644 index 0000000000..128cc393c3 Binary files /dev/null and b/img/syncd-workflow.jpg differ diff --git a/img/zebra.jpg b/img/zebra.jpg new file mode 100644 index 0000000000..3a3eaf2a32 Binary files /dev/null and b/img/zebra.jpg differ diff --git a/routing-loading-time-enhancement-HLD.md b/routing-loading-time-enhancement-HLD.md new file mode 100644 index 0000000000..6e3198a86a --- /dev/null +++ b/routing-loading-time-enhancement-HLD.md @@ -0,0 +1,305 @@ + +# SONiC Routing Loading Time Enhancement + +## High Level Design Document + +### Rev 0.1 + + +## Table of Content + +- [1. Scope](#1-scope) +- [2. Definitions/Abbreviations](#2-definitionsabbreviations) +- [3. Overview and Bottleneck](#3-overview-and-bottleneck) + - [3.1 Bottleneck in orchagent](#31-bottleneck-in-orchagent) + - [3.2 Bottleneck in syncd](#32-bottleneck-in-syncd) + - [3.3 Bottleneck in APPL\_DB/Redis](#33-bottleneck-in-appl_dbredis) + - [3.4 Bottleneck in zebra](#34-bottleneck-in-zebra) +- [4. Requirements](#4-requirements) +- [5. Architecture Design](#5-architecture-design) +- [6. High-Level Design](#6-high-level-design) + - [6.1. Fpmsyncd](#61-fpmsyncd) + - [6.2. Orchagent](#62-orchagent) + - [6.3. Syncd](#63-syncd) + - [6.4. APPL\_DB/Redis](#64-appl_dbredis) + - [6.4.1 Producerstatetable](#641-producerstatetable) + - [6.4.2 Consumerstatetable](#642-consumerstatetable) + - [6.5. Zebra](#65-zebra) +- [7. WarmRestart Design Impact](#7-warmrestart-design-impact) +- [8. Restrictions/Limitations](#8-restrictionslimitations) +- [9. Testing Requirements/Design](#9-testing-requirementsdesign) + - [9.1 System test](#91-system-test) + - [9.2 Performance measurements](#92-performance-measurements) + + +### Revision +| Rev | Date | Author | Change Description | +|:---:|:-----------:|:------------------:|-----------------------------------| +| 0.1 | Aug 16 2023 | Yang FengSheng | Initial Draft | + + +## About this Manual +This document provides general information about the routing loading time enhancement in SONiC. + +## 1. Scope +This document describes end to end optimization to speed up BGP loading time. + +## 2. Definitions/Abbreviations + +| Definitions/Abbreviation | Description | +| ------------------------ | --------------------------------------- | +| ASIC | Application specific integrated circuit | +| BGP | Border Gateway Protocol | +| SWSS | Switch state service | +| SYNCD | ASIC synchronization service | +| FPM | Forwarding Plane Manager | +| SAI | Switch Abstraction Interface | +| HW | Hardware | +| SW | Software | + +## 3. Overview and Bottleneck + +With the growth of network scale, the routing loading time of SONiC also increases. For small-scale routing, the routing loading time of tens of seconds is acceptable. However, when the routing scale increases further, it is necessary to optimize the routing loading performance. The following figure shows the sonic route loading workflow using BGP as an example. + + +##### Figure 1. SONiC route loading workflow + +
+ +
+ +1. Bgpd recieves and parses the new packet from bgp's socket, process the bgp-update and notifies zebra of the existence of this new prefix and its associated protocol next-hop. +2. Zebra decodes the message from bgpd, and delivers this netlink-route message to fpmsyncd. +3. Fpmsyncd processes the netlink message and pushes this state into APPL_DB. +4. Being orchagent an APPL_DB subscriber, it consumes the routing information pushed to APPL_DB. +5. After processing the received information, orchagent will invoke sairedis APIs to push the route information into ASIC_DB. +6. Being syncd an ASIC_DB subscriber, it will receive the new state generated by orchagent. +7. Syncd will process the information and invoke SAI APIs to inject this state into the corresponding asic-driver. Finally new route is pushed to ASIC. + +**NOTE**: This is not the [whole workflow for routing](https://github.com/sonic-net/SONiC/wiki/Architecture#routing-state-interactions), we ignore the kernel part since we only focus on sonic's side performance enhancement. + +We have measured the end to end BGP loading performance on Alibaba platform. + +| Module | Speed(kilo-routes per second) | +| ------------------------ | -------------------------------| +| Zebra |+ +
+ +0. RouteOrch create a ConsumerStateTable to subscribe to the ROUTE_TABLE_CHANNEL event. When fpmsyncd publish route channel, select is triggered. +1. Orchagent call pops function to fetch data from APPL_DB including the following operations: + - Pop prefix from ROUTE_TABLE_SET + - Traverse these prefixes and retrieve the temporary key data of _ROUTE_TABLE corresponding to the prefix. + - set key in ROUTE_TABLE + - delete temporary key in _ROUTE_TABLE +2. Orchagent call addToSync function to record the data to a local file swss.rec. +3. Orchagent call doTask function to parse data one by one to create routes. Routes are temporarily recorded in the EntityBulker. After parsing all the routes, doTask function flush the EntityBulker and push the routes to ASIC_DB. + + +These three main process in orchagent are running in serial, which will consume a lot of time, since the routes are processed one by one. This is the main problem which causes low performance in loading routes. + +### 3.2 Bottleneck in syncd + +##### Figure 3. Syncd workflow ++ +
+ +Same scenario happens in syncd, when syncd pops routes from ASIC_DB and install routes to ASIC through SAI. + +### 3.3 Bottleneck in APPL_DB/Redis +Sonic uses redis to deliver data between modules. When the routing loading performance is improved, redis performance will also need improvement. Redis will become the bottleneck if there are too many redis I/O operations, since fpmsyncd and orchagent are working busily on APPL_DB at the same time. + + +The lua script for orchagent to pop data from APPL_DB not only do pop operations but also deletes and sets some keys in the table. The delete and set operations in orchagent won't contribute to route loading process but take extra time and slow down fpmsyncd's redis operations in return. + +Redis operations in fpmsyncd also need optimization. Fpmsyncd flushes its redis pipeline every time it receives a message from zebra to make sure no data is left in pipeline; and it will publish ROUTE_TABLE_CHANNEL for every single piece of data pushed to APPL_DB. In a word, fpmsyncd flushes redis pipeline too often and publishes redis channel for too many times which will also cause low performance in both redis and orchagent. + +### 3.4 Bottleneck in zebra + +##### Figure 4. Zebra flame graph ++ +
+Zebra receive message from bgpd and send route information to kernel and fpmsyncd. Bottleneck has been observed in zapi_route_decode function, which causes slow start for zebra. Also zebra master thread will process route install result from kernel which will effect the performance of sending routes to fpmsyncd in the same thread. + +## 4. Requirements + +High level requirements: + +- Alleviate the bottleneck in orchagent +- Alleviate the bottleneck in syncd +- Alleviate the bottleneck caused by redis flush in fpmsyncd +- Alleviate the bottleneck caused by redis I/O operations in APPL_DB +- Alleviate the bottleneck in zebra (TBD) +- All modifications should maintain the time sequence of route loading +- All modules should support the warm restart operations after modified +- After optimization, end to end BGP loading performance should be improved + +Restrictions/Limitations: + +- SAI/ASIC performance is out of scope of this HLD document. + +## 5. Architecture Design + + +##### Figure 5. Pipeline architecture +The figure below shows the high level architecture of software pipeline enhancement. ++ +
+ +- Add flush_timer in fpmsyncd to flush routes in bigger size. +- Add pipeline architecture in Orchagent and Syncd to process route in parallel. +- Use ring buffer to deliver message between threads. Ring buffer is better than mutex buffer in performance, since switching between kernel state and user state is not needed. + +##### Figure 6. Pipeline timeline +The figure below shows the high level architecture of software pipeline enhancement. +comparing to the original architecture: + ++ +
+ +Using pipeline inside orchagent and syncd can theoretically double the routing loading performance, if redis I/O speed can keep up with overall routing loading speed. + +## 6. High-Level Design + +### 6.1. Fpmsyncd +Fpmsyncd needs a timer thread to flush pipeline. *FLUSH_INTERVAL* controls the interval between two flush operations. An appropriate *FLUSH_INTERVAL* should be set to ensure route is buffered in redis pipeline and is not delayed for too long. The original flush operations in the main loop may not be needed any more. Since timer thread and main thread both operate on the same redis pipeline object, a mutex lock is needed here. + +Since data are now buffered in redis pipeline, we need to set a larger size for redis pipeline rather than 128 as default value. Redis pipeline will flush itself when it's full, so a larger size may also reduce redis I/O pressure. *10000* to *15000* is an appropriate size range in our use case. + +*FLUSH_INTERVAL* and *REDIS_PIPELINE_SIZE* should be configured by user for different use case. + +### 6.2. Orchagent + +The three main jobs ```table->pops(entries)```、```Consumer->addToSync(entries)```、```Consumer->drain()``` should work in parallel. +- ```table->pops(entries)``` should be called by the master thread to maintain the time sequence. +- A new thread is added to call ```Consumer->addToSync(entries)``` and ```Consumer->drain()```. +- A ring buffer is used here to deliver ```entries``` popped by main thread to the new thread. + +**NOTE:** ```Consumer->addToSync(entries)``` can be called in master thread or the new thread depending on the ```pops()``` and ```doTask()``` performance. + +The ring buffer is not only used to deliver data, but also to buffer data. Since SAI doesn't work well on small piece of data, the new thread should check data size in ring buffer before it calls ```Consumer->addToSync(entries)```. Routes will still be cached in *Consumer->m_toSync* rather than ring buffer if routeorch fails to push route to ASIC_DB. + +A new Consumer class is defined to work in pipeline architecture. +```c++ +class Consumer_pipeline : public Consumer { + public: + /** + * Table->pops() should be in execute(). + * Called by master thread to maintain time sequence. + */ + void execute() override; + /** + * Main function for the new thread. + */ + void drain() override; + /** + * Need modified to support warm restart + */ + void dumpPendingTasks(std::vector+ +
+ +- Time zebra uses to receive 500k routes from bgpd and send to fpmsyncd, while other modules takes no action. +- Time fpmsyncd uses to receive 500k routes from zebra and push to APPL_DB, while other modules takes no action. +- Time orchagent uses to pops 500k routes from APPL_DB and push to ASIC_DB, while other modules takes no action. +- Time Syncd uses to pops 500k routes from ASIC_DB and install to ASIC, while other modules takes no action. +- Time the whole system uses to loading 500k routes from another bgp peer. \ No newline at end of file diff --git a/routing-loading-time-enhancement-HLD.pdf b/routing-loading-time-enhancement-HLD.pdf new file mode 100644 index 0000000000..09bc39e5f0 Binary files /dev/null and b/routing-loading-time-enhancement-HLD.pdf differ