Test dynamic reconfiguration of 1M upsream sites #680

krizhanovsky · 2017-02-11T22:33:29Z

We should guarantee that we can dynamically add 1,000,001st upsream site with a downtime less that 1 second. The issue depends on #51, #76, and #659. All the schedulers must be tested.

krizhanovsky · 2018-02-17T14:13:08Z

As described in #76 (comment) there are fundamental issues which don't allow to perform whole reload operation to be done within 1 second. Actually we don't care how long the operation takes, instead we do care how long Tempesta FW's downtime is. Thus, the test must send requests one after another (just like ping) and measure time delays between successful response (error responses as well as connection terminations are possible) and the maximum delay must not be more than 1 second.

vankoven · 2018-02-17T16:52:51Z

From backend servers perspective reconfiguration process takes 3 steps:

parsing new configuration. No downtime happen on this step.
updating server groups one-by-one. Only changed groups may go down at this step. and only one-by-one.
updating sched_http_rules. Rules are changed on RCU basis, so no downtime may happen here.

I think we need to check metrics for two cases: downtime for not-modified server group (should be 0), and down time for modified server group (should be less than a second).

Previously I've prepared stress tests for reconfiguration (reconfiguration under heavy load): https://github.com/tempesta-tech/tempesta/tree/master/tempesta_fw/t/functional/reconf . Most of them are use the same assertion:

tempesta/tempesta_fw/t/functional/reconf/reconf_stress.py

Lines 71 to 76 in 3f912aa

    
           for c in self.clients: 
        
               req, err = c.results() 
        
               # Tempesta must be reconfigured in less that 1sec. Errors must not 
        
               # happen after reconfig has finished. 
        
               max_err = req / duration 
        
               self.assertTrue((err < max_err), msg='HTTP client detected errors')

Request rate of wrk is relatively stable, so based on information of test duration. total sent requests and allowed down time (1sec) we can define a maximum number of allowed errors (502 responses from Tempesta). The limit mustn't be overcame. All the tests pass, that means at least on small configuration no errors happen.

Although many of use cases are checked in reconf tests (e.g. sticky sessions, changing the group scheduler, updating sched_http_rules) none of them use huge configuration. And all of them checks only modified groups.

I believe that ping request is close to reconfiguration under load. Since big timeouts between pings can lead to false positives.

krizhanovsky · 2018-02-20T01:42:45Z

The test must ensure that it uses optimal configuration options to be able to setup multi connection environment. For Nginx:

multi_accept on
huge values for worker_connections, worker_rlimit_nofile, worker_processes, see https://www.nginx.com/blog/tuning-nginx/
relatively large keepalive_timeout to not to rise reconnection events too frequently
listen backlog should be several thousands

Also set sysctls:

sysctl -w net.core.somaxconn=8192
sysctl -w net.ipv4.tcp_max_orphans=1000000

krizhanovsky added crucial test labels Feb 11, 2017

krizhanovsky added this to the 0.5.0 Web Server milestone Feb 11, 2017

krizhanovsky mentioned this issue Feb 11, 2017

cfg: hot configuration reloading #51

Closed

krizhanovsky assigned vankoven and keshonok Feb 26, 2017

krizhanovsky removed the crucial label Mar 17, 2017

krizhanovsky mentioned this issue May 25, 2017

Review APM & Ratio scheduler #712

Open

krizhanovsky assigned intelfx and unassigned vankoven and keshonok Jul 26, 2017

krizhanovsky modified the milestones: 0.5.0 Web Server, 0.5 alpha Jan 8, 2018

krizhanovsky assigned vladtcvs and unassigned intelfx Jan 8, 2018

krizhanovsky added the crucial label Jan 9, 2018

krizhanovsky mentioned this issue Jan 9, 2018

Requests scheduling to massive farm of backend servers #76

Closed

vladtcvs mentioned this issue Jan 17, 2018

1M backends #885

Merged

krizhanovsky modified the milestones: 1.0 Tempesta OS, 0.5 alpha Feb 5, 2018

vladtcvs closed this as completed in #885 Mar 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test dynamic reconfiguration of 1M upsream sites #680

Test dynamic reconfiguration of 1M upsream sites #680

krizhanovsky commented Feb 11, 2017 •

edited

Loading

krizhanovsky commented Feb 17, 2018

vankoven commented Feb 17, 2018

krizhanovsky commented Feb 20, 2018

Test dynamic reconfiguration of 1M upsream sites #680

Test dynamic reconfiguration of 1M upsream sites #680

Comments

krizhanovsky commented Feb 11, 2017 • edited Loading

krizhanovsky commented Feb 17, 2018

vankoven commented Feb 17, 2018

krizhanovsky commented Feb 20, 2018

krizhanovsky commented Feb 11, 2017 •

edited

Loading