Database backend - stress test and failure test plan #4818

hzy46 · 2020-08-17T06:42:34Z

Test Environment

The cluster has 10000 existing jobs and about 10 nodes, 50 GPUs. Hived scheduler is enabled.

In each case, we test the list/job detail/submit job latency (fire 10 requests and calculate the average latency). All listing has an offset and a limit number of 20.

If there is no load, the latency is about:

list job	get job detail	submit a job
54.1 ms ± 17.2 ms	56.6 ms ± 29.2 ms	343 ms ± 125 ms

Stress Test

Job with a large task number

Submit 1 job with 250/1000/5000 tasks, and open 20+ job detail pages, check whether it will cause cluster unstability.
Also check if we can submit new job and view other job's detail.

250 tasks

get detail of this job	list job	get detail of other job	submit a job
186 ms ± 71.5 ms	168 ms ± 45.6 ms	112 ms ± 42.2 ms	396 ms ± 61.4 ms

1000 tasks

get detail of this job	list job	get detail of other job	submit a job
204 ms ± 59.8 ms	140 ms ± 93.9 ms	158 ms ± 93.4 ms	556 ms ± 202 ms

5000 tasks

get detail of this job	list job	get detail of other job	submit a job
552 ms ± 109 ms	180 ms ± 131 ms	157 ms ± 85.4 ms	496 ms ± 134 ms

In real use, user will exprence a large transport time because now the job detail json has 8MB+.

Problems found: tasks have unexpected retry caused by #4841

Large amount of jobs

Quick test: 1000 jobs will finish in 410s/522 s. Thus throughput is about 2 jobs/second. Db controller is not bottleneck.

Submit 2/10 jobs one second for 1 hour, each job finishes immediately. Check whether it will cause cluster unstability. Also check if we can submit new job and view other job's detail.

2 jobs/second for 1 hour

During submission:

list job	get job detail	submit a job
367 ms ± 85 ms	319 ms ± 140 ms	500 ms ± 125 ms

10 job/second for 1 hour

During submission:

list job	get job detail	submit a job
1.19 s ± 895 ms	623 ms ± 477 ms	7.32 s ± 2.92 s

Problems found:

db controller memory issue, concurrency issue Fix scale problem for database controller #4845
too many jobs will cause Cannot launch job because of BarrierNotPassed #4833

Job with a large task number and large retry times

Submit 1 job with 250 tasks and 100 retries.

list job	get job detail	submit a job

Problems found:

Cannot view retry history of jobs with large task number and large retry times #4846

Failure Test

Please launch a dev-box first, and stop all services.
Back up the previous data: sudo cp -r /mnt/paiInternal /mnt/paiInternalBak on master node

Shutdown database with ./paictl.py service stop -n postgresql, and wait for a while.
Expect: We cannot query or submit job, other services don't fail. Please record the error message

view job list:

submit job:

refresh job detail:

new job detail:

Start database with ./paictl.py service start -n postgresql
All function should become normal after a while.

Go to the master node, kill the corresponding process;

postgresql: use ps aux | grep postgres to find it
write-merger/framework-watcher/db-poller: ps aux | grep write-merger; ps aux | grep watcher/framework; ps aux | grep poller/index
rest-server: ps aux | grep 'node index.js'
api server: ps aux | grep kube-api
framework controller: ps aux | grep frameworkcontroller

Expect: All function should become normal after a while.

Data destroying test

Step 1 Submit a long-running job in OpenPAI;
Step 2 Destroy all database data: Go to the master node; remove or random delete files in /mnt/paiInternal
Step 3 Restart pai cluster by ./paictl.py service stop and ./paictl.py service start
Expect: The cluster should be OK. But all previous job data are lost, but you can find the long-running job in webportal.

The text was updated successfully, but these errors were encountered:

hzy46 · 2020-09-01T06:02:46Z

After stress test, I raise heap memory limit for write merger to 2GB, watcher to 8GB, poller to 4GB.

I believe it is OK to handle 30000 active jobs.

hzy46 self-assigned this Aug 17, 2020

scarlett2018 mentioned this issue Aug 17, 2020

2020 July ~ Aug Release #4642

Closed

39 tasks

yiyione mentioned this issue Aug 27, 2020

2020 July ~ Aug Test Plan #4838

Closed

30 tasks

scarlett2018 added the pai-dev label Dec 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Database backend - stress test and failure test plan #4818

Database backend - stress test and failure test plan #4818

hzy46 commented Aug 17, 2020 •

edited by debuggy

Loading

hzy46 commented Sep 1, 2020

Database backend - stress test and failure test plan #4818

Database backend - stress test and failure test plan #4818

Comments

hzy46 commented Aug 17, 2020 • edited by debuggy Loading

Test Environment

Stress Test

Job with a large task number

250 tasks

1000 tasks

5000 tasks

Large amount of jobs

2 jobs/second for 1 hour

10 job/second for 1 hour

Job with a large task number and large retry times

Failure Test

hzy46 commented Sep 1, 2020

hzy46 commented Aug 17, 2020 •

edited by debuggy

Loading