Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-RBD performance does not scale up well as fio-rbd #939

Open
xin3liang opened this issue Nov 11, 2024 · 7 comments
Open

Multi-RBD performance does not scale up well as fio-rbd #939

xin3liang opened this issue Nov 11, 2024 · 7 comments

Comments

@xin3liang
Copy link
Contributor

xin3liang commented Nov 11, 2024

We do some 4k random read/write performance tests on the below testbed. And found that the Nvmeof gateway multi-rbd performance does not scale well as fio-rbd.
image
image
image

Hardware

  • Arm CPU: Kunpeng 920, 2.6GHz, 96 CPU cores, 4 numa nodes
  • X86 CPU: Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz, 112 CPU cores, 2 numa nodes
  • Disk: 3 x ES3000 V6 NVMe SSD 3.2T per Arm server
  • Network: 1x MLNX ConnectX-5 100Gb IB,1x1G tcp

Software

  • OS: openEuler 22.03 LTS SP3, kernel 5.10.0-192.0.0.105.oe2203sp3
  • Ceph: main-nvmeof require revert commit "nvmeof gw monitor: disable by default"
  • SPDK: 24.05
  • nvmeof: 1.3.2
  • fio: fio-3.29

Deployment and Parameters Tuning

  • Deploy the Nvmeof gateway and Ceph cluster with cephadm.
  • To get a good backend Ceph performance for testing the Nvmf Gateway, we set a larger number to pg_num, set the replica size to 1, and bind each OSD to 4 cores.
  • Rebuild the nvmeof image without "--enable-debug" option as a release type build.
  • Tuning the CPU cores/mask of SPDK Nvmf target and Ceph client to bind their threads in the same NUMA as the 100Gbit NIC.
  • Make sure there are enough CPU cores for the nvmf target and Ceph client threads so that they don't meet the CPU bottleneck.
Each osd bind to 4 cores      
Set Ceph size=1 pg_num=16384 
# Note: nvmeof gw increase msg and io enqueue threads , bind ceph client, spdk tgt in the same numa as NIC    
ms_async_op_threads = 9   # 3->9      
librados_thread_count = 10 # 2->10      
x86 (4 spdk cores, all threads in same numa 1, NIC in numa 1)      
tgt_cmd_extra_args =               "-m 0xF0000000"      
librbd_core_mask = 0xFFFFFFF0000000FFFFFF00000000     
arm (6 spdk cores, all threads in numa  2,3, NIC in numa 2)      
tgt_cmd_extra_args =     "-m 0x3F000000000000"      
librbd_core_mask = 0xFFFFFFFFFFC0000000000000   

FYI, in case someone is interested in the details of the hybrid x86 and arm Ceph Nvmf Gateway cluster deployment. Please refer to the attached pdf:
Ceph SPDK NVMe-oF Gateway Evaluation on openEuler on openEuler (1).pdf

Fio Running Cmds and Configs
We run fio tests on the client node with cmds RW=randwrite BS=4k IODEPTH=128 fio ./[fio_test-rbd.conf|fio_test-nvmeof.conf] --numjobs=1

(.venv) [root@client1 spdktest]# cat fio_test-rbd.conf
[global]
#stonewall
description="Run ${RW} ${BS} rbd test"
bs=${BS}
ioengine=rbd
clientname=admin
pool=nvmeof
#pool=test-pool
thread=1
group_reporting=1
direct=1
verify=0
norandommap=1
time_based=1
ramp_time=10s
runtime=60m
iodepth=${IODEPTH}
rw=${RW}
#numa_cpu_nodes=0

[test-job1]
rbdname=fio_test_image1

[test-job2]
rbdname=fio_test_image2

[test-job3]
rbdname=fio_test_image3

[test-job4]
rbdname=fio_test_image4

[test-job5]
rbdname=fio_test_image5

(.venv) [root@client1 spdktest]# cat fio_test-nvmeof.conf
[global]
#stonewall
description="Run ${RW} ${BS} NVMe ssd test"
bs=${BS}
#ioengine=libaio
ioengine=io_uring
thread=1
group_reporting=1
direct=1
verify=0
norandommap=1
time_based=1
ramp_time=10s
runtime=1m
iodepth=${IODEPTH}
rw=${RW}
#numa_cpu_nodes=0

[test-job1]
#filename=/dev/nvme2n1
filename=/dev/nvme2n2

[test-job2]
#filename=/dev/nvme2n3
#filename=/dev/nvme2n4
filename=/dev/nvme4n1
#filename=/dev/nvme4n2

#[test-job3]
#filename=/dev/nvme2n5
##filename=/dev/nvme2n6
#
#[test-job4]
#filename=/dev/nvme2n7
##filename=/dev/nvme2n8
#
#[test-job5]
#filename=/dev/nvme2n9
##filename=/dev/nvme2n10
@xin3liang
Copy link
Contributor Author

We notice that currently, one ceph-nvmeof gateway creates only one Ceph IO context(RADOS connection) with Ceph whereas fio creates one Ceph IO context with Ceph for each running job.

And Refer to two performance tuning guides below, one Ceph IO context can't support too many RBD images read/write access well.
And maybe the RBD Grouping Strategy(one Ceph IO Context per group) would help with the multi-RBD performance scale-up.

See P9-10 of:
https://ci.spdk.io/download/2022-virtual-forum-prc/D2_4_Yue_A_Performance_Study_for_Ceph_NVMeoF_Gateway.pdf
Rbd Grouping Strategy:
https://www.intel.com/content/www/us/en/developer/articles/technical/performance-tuning-of-ceph-rbd.html

@xin3liang xin3liang changed the title Multi-rbd performance does not scale well as fio-rbd Multi-RBD performance does not scale well as fio-rbd Nov 11, 2024
@xin3liang xin3liang changed the title Multi-RBD performance does not scale well as fio-rbd Multi-RBD performance does not scale up well as fio-rbd Nov 11, 2024
@caroav
Copy link
Collaborator

caroav commented Nov 11, 2024

We currently create a cluster context for every X images. This is configurable by the "bdevs_per_cluster" parameter as in ceph-nvmeof.conf. Note that currently this is done per ANA group (and it had some reasons related to failback and blocklisting), but we are going to make it flat again. So you can set this to 1 if you want 1 Ceph IO context per image, or more.
FYI @oritwas @leonidc @baum

@xin3liang
Copy link
Contributor Author

We currently create a cluster context for every X images. This is configurable by the "bdevs_per_cluster" parameter as in ceph-nvmeof.conf. Note that currently this is done per ANA group (and it had some reasons related to failback and blocklisting), but we are going to make it flat again. So you can set this to 1 if you want 1 Ceph IO context per image, or more. FYI @oritwas @leonidc @baum

Sounds cool, thanks @caroav . Will give it a try.
BTW, regarding the configurable parameters in ceph-nvmeof.conf we might need to document all of them somewhere, I think.

@caroav
Copy link
Collaborator

caroav commented Nov 11, 2024

BTW, regarding the configurable parameters in ceph-nvmeof.conf we might need to document all of them somewhere, I think.

Yes I need to update the entire upstream nvmeof documentation. I will do it soon.

@xin3liang
Copy link
Contributor Author

After setting bdevs_per_cluster = 1, it can scale now. Thanks.
Note, test data below SPDK uses 16 CPU cores.

Image
Image

@xin3liang
Copy link
Contributor Author

xin3liang commented Jan 13, 2025

Encounter the performance drop rapidly issue as the RBD number grows, which might impact the scalability of nvmeof-gw.
For more details see the spdk issue: spdk/spdk#3547

Image

@xin3liang
Copy link
Contributor Author

FYI, when binding 5 cpus_per_cluster for the ceph client threads, it get the best IOPS performance.
Test script run-n-bdev-rbd-test2.sh
$RPC bdev_rbd_register_cluster $cluster --core-mask "$cpu_list" cpus_per_cluster=5

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🆕 New
Development

No branches or pull requests

2 participants