Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Database backend - stress test and failure test plan #4818

Open
hzy46 opened this issue Aug 17, 2020 · 1 comment
Open

Database backend - stress test and failure test plan #4818

hzy46 opened this issue Aug 17, 2020 · 1 comment
Assignees
Labels

Comments

@hzy46
Copy link
Contributor

hzy46 commented Aug 17, 2020

Test Environment

The cluster has 10000 existing jobs and about 10 nodes, 50 GPUs. Hived scheduler is enabled.

In each case, we test the list/job detail/submit job latency (fire 10 requests and calculate the average latency). All listing has an offset and a limit number of 20.

If there is no load, the latency is about:

list job get job detail submit a job
54.1 ms ± 17.2 ms 56.6 ms ± 29.2 ms 343 ms ± 125 ms

Stress Test

Job with a large task number

Submit 1 job with 250/1000/5000 tasks, and open 20+ job detail pages, check whether it will cause cluster unstability.
Also check if we can submit new job and view other job's detail.

250 tasks

get detail of this job list job get detail of other job submit a job
186 ms ± 71.5 ms 168 ms ± 45.6 ms 112 ms ± 42.2 ms 396 ms ± 61.4 ms

1000 tasks

get detail of this job list job get detail of other job submit a job
204 ms ± 59.8 ms 140 ms ± 93.9 ms 158 ms ± 93.4 ms 556 ms ± 202 ms

5000 tasks

get detail of this job list job get detail of other job submit a job
552 ms ± 109 ms 180 ms ± 131 ms 157 ms ± 85.4 ms 496 ms ± 134 ms

In real use, user will exprence a large transport time because now the job detail json has 8MB+.

Problems found: tasks have unexpected retry caused by #4841

Large amount of jobs

Quick test: 1000 jobs will finish in 410s/522 s. Thus throughput is about 2 jobs/second. Db controller is not bottleneck.

Submit 2/10 jobs one second for 1 hour, each job finishes immediately. Check whether it will cause cluster unstability. Also check if we can submit new job and view other job's detail.

2 jobs/second for 1 hour

During submission:

list job get job detail submit a job
367 ms ± 85 ms 319 ms ± 140 ms 500 ms ± 125 ms

10 job/second for 1 hour

During submission:

list job get job detail submit a job
1.19 s ± 895 ms 623 ms ± 477 ms 7.32 s ± 2.92 s

Problems found:

  1. db controller memory issue, concurrency issue Fix scale problem for database controller #4845
  2. too many jobs will cause Cannot launch job because of BarrierNotPassed #4833

Job with a large task number and large retry times

Submit 1 job with 250 tasks and 100 retries.

list job get job detail submit a job

Problems found:

Cannot view retry history of jobs with large task number and large retry times #4846

Failure Test

Please launch a dev-box first, and stop all services.
Back up the previous data: sudo cp -r /mnt/paiInternal /mnt/paiInternalBak on master node

  1. Shutdown database with ./paictl.py service stop -n postgresql, and wait for a while.
    Expect: We cannot query or submit job, other services don't fail. Please record the error message

view job list:
image
image

submit job:
image
image
refresh job detail:
image
image
new job detail:
image
image

Start database with ./paictl.py service start -n postgresql
All function should become normal after a while.

  1. Go to the master node, kill the corresponding process;
  • postgresql: use ps aux | grep postgres to find it
  • write-merger/framework-watcher/db-poller: ps aux | grep write-merger; ps aux | grep watcher/framework; ps aux | grep poller/index
  • rest-server: ps aux | grep 'node index.js'
  • api server: ps aux | grep kube-api
  • framework controller: ps aux | grep frameworkcontroller

Expect: All function should become normal after a while.

  1. Data destroying test

    Step 1 Submit a long-running job in OpenPAI;
    Step 2 Destroy all database data: Go to the master node; remove or random delete files in /mnt/paiInternal
    Step 3 Restart pai cluster by ./paictl.py service stop and ./paictl.py service start
    Expect: The cluster should be OK. But all previous job data are lost, but you can find the long-running job in webportal.

@hzy46 hzy46 self-assigned this Aug 17, 2020
@yiyione yiyione mentioned this issue Aug 27, 2020
30 tasks
@hzy46
Copy link
Contributor Author

hzy46 commented Sep 1, 2020

After stress test, I raise heap memory limit for write merger to 2GB, watcher to 8GB, poller to 4GB.

I believe it is OK to handle 30000 active jobs.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants