Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test dynamic reconfiguration of 1M upsream sites #680

Closed
krizhanovsky opened this issue Feb 11, 2017 · 3 comments · Fixed by #885
Closed

Test dynamic reconfiguration of 1M upsream sites #680

krizhanovsky opened this issue Feb 11, 2017 · 3 comments · Fixed by #885
Assignees
Milestone

Comments

@krizhanovsky
Copy link
Contributor

krizhanovsky commented Feb 11, 2017

We should guarantee that we can dynamically add 1,000,001st upsream site with a downtime less that 1 second. The issue depends on #51, #76, and #659. All the schedulers must be tested.

@krizhanovsky
Copy link
Contributor Author

As described in #76 (comment) there are fundamental issues which don't allow to perform whole reload operation to be done within 1 second. Actually we don't care how long the operation takes, instead we do care how long Tempesta FW's downtime is. Thus, the test must send requests one after another (just like ping) and measure time delays between successful response (error responses as well as connection terminations are possible) and the maximum delay must not be more than 1 second.

@vankoven
Copy link
Contributor

From backend servers perspective reconfiguration process takes 3 steps:

  • parsing new configuration. No downtime happen on this step.
  • updating server groups one-by-one. Only changed groups may go down at this step. and only one-by-one.
  • updating sched_http_rules. Rules are changed on RCU basis, so no downtime may happen here.

I think we need to check metrics for two cases: downtime for not-modified server group (should be 0), and down time for modified server group (should be less than a second).

Previously I've prepared stress tests for reconfiguration (reconfiguration under heavy load): https://github.com/tempesta-tech/tempesta/tree/master/tempesta_fw/t/functional/reconf . Most of them are use the same assertion:

for c in self.clients:
req, err = c.results()
# Tempesta must be reconfigured in less that 1sec. Errors must not
# happen after reconfig has finished.
max_err = req / duration
self.assertTrue((err < max_err), msg='HTTP client detected errors')

Request rate of wrk is relatively stable, so based on information of test duration. total sent requests and allowed down time (1sec) we can define a maximum number of allowed errors (502 responses from Tempesta). The limit mustn't be overcame. All the tests pass, that means at least on small configuration no errors happen.

Although many of use cases are checked in reconf tests (e.g. sticky sessions, changing the group scheduler, updating sched_http_rules) none of them use huge configuration. And all of them checks only modified groups.

I believe that ping request is close to reconfiguration under load. Since big timeouts between pings can lead to false positives.

@krizhanovsky
Copy link
Contributor Author

The test must ensure that it uses optimal configuration options to be able to setup multi connection environment. For Nginx:

Also set sysctls:

sysctl -w net.core.somaxconn=8192
sysctl -w net.ipv4.tcp_max_orphans=1000000

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants