Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issues - multi core? #116

Closed
jkaberg opened this issue Jun 1, 2017 · 20 comments
Closed

Performance issues - multi core? #116

jkaberg opened this issue Jun 1, 2017 · 20 comments

Comments

@jkaberg
Copy link

jkaberg commented Jun 1, 2017

Just tried the 1.3 release and I'm seeing some lower transfer numbers (roughly around 50-60MB/s) on HDD/ZFS pool - usually speeds are around 110MB/s

CPU supports AES-NI (24 cores)

grep 'model name' /proc/cpuinfo
model name      : Intel(R) Xeon(R) CPU           X5660  @ 2.80GHz
<...>
grep aes /proc/cpuinfo | wc -l
24

gocryptfs speedtest

./gocryptfs -speed
AES-GCM-256-OpenSSL      226.92 MB/s
AES-GCM-256-Go           376.53 MB/s    (selected in auto mode)
AES-SIV-512-Go            87.52 MB/s

While mounting the filesystem and doing an larger transfer (10GB) I notice 1 core gets full load, but no additional cores gets used.

Is gocryptfs (or alternatively the encryption process) limited to one core? If so - consider this a feature request for muli core encryption 😄

If not - any ideas what might be the bottle neck?

@rfjakob
Copy link
Owner

rfjakob commented Jun 1, 2017 via email

@jkaberg
Copy link
Author

jkaberg commented Jun 1, 2017

@rfjakob I only tested 1.3 yet, the transfer was done with rsync -avP --progress source target. So it seems my single core performance is too slow (atleast not as fast as the HDD's).

Is it possible to do the encryption in parallel to utilize more cores?

@rfjakob
Copy link
Owner

rfjakob commented Jun 1, 2017 via email

@jkaberg
Copy link
Author

jkaberg commented Jun 1, 2017

@rfjakob

  1. yeah, 100%
  2. the underlying storage is a ZFS Raidz2 pool with 11 x 4TB SATA3 HGST drives.

Normal transfers (eg from zfs pool -> same zfs pool) hits a steady 110-120MB/s with the same rsync command as above, just not to a gocryptfs mount point on the very same ZFS pool

@Nodens-
Copy link

Nodens- commented Jun 1, 2017

This sounds like poor random write/read performance due to raidz2 parity overhead on top of encryption overhead. Quite possibly that the non-fixed stripe sizes of raidz in combination with fuse are also a factor..

Anything abnormal showing up on iotop?

rfjakob added a commit that referenced this issue Jun 1, 2017
Collect all the plaintext and pass everything to contentenc in
one call.

This will allow easier parallization of the encryption.

#116
@rfjakob
Copy link
Owner

rfjakob commented Jun 6, 2017

@jkaberg A difference between plain rsync and rsync+gocryptfs is that gocryptfs writes the data in 128KB blocks, while rsync probably uses bigger blocks. This is a FUSE limitation - the kernel always splits the data into 128KB blocks.

What throughput do you get when you write to the ZFS with 128KB blocks? Like this:

dd if=/dev/zero of=YOURZFSMOUNT/zero bs=128k

Then, to find out why we are running at 100% CPU: Can you post a cpu profile of gocryptfs? Mount with this option:

gocryptfs -cpuprofile /tmp/cpu.prof

then run the rsync and unmount. Thanks, Jakob

@Nodens-
Copy link

Nodens- commented Jun 6, 2017

This is what I meant by stripe sizes in combination with FUSE. The 128kb block size is probably what is bottlenecking. In this case compiling a custom kernel with FUSE_MAX_PAGES_PER_REQ higher than 32 may help alleviate the issue.

@rfjakob
Copy link
Owner

rfjakob commented Jun 7, 2017

Yes, increasing FUSE_MAX_PAGES_PER_REQ should increase the throughput. However, this is not something I can ask from users.

So I think behaving like dd bs=128k is the best we can do. But that we are pegged at 100% cpu is probably keeping us from getting there. But let's see what the cpu profile says.

@jkaberg
Copy link
Author

jkaberg commented Jun 7, 2017

@rfjakob Here's the output (/media/xfiles is the ZFS mountpoint)

root@gunder:/media/xfiles# dd if=/dev/zero of=/media/xfiles/zero bs=128k count=10000 conv=fdatasync
10000+0 records in
10000+0 records out
1310720000 bytes (1.3 GB) copied, 0.859343 s, 1.5 GB/s
root@gunder:/media/xfiles# ./gocryptfs encrypted/ unencrypted/
Password:
root@gunder:/media/xfiles# dd if=/dev/zero of=/media/xfiles/unencrypted/zero bs=128k count=10000 conv=fdatasync
10000+0 records in
10000+0 records out
1310720000 bytes (1.3 GB) copied, 9.93545 s, 132 MB/s
root@gunder:/media/xfiles# fusermount -u unencrypted/
root@gunder:/media/xfiles# ./gocryptfs -cpuprofile /tmp/cpu.prof encrypted/ unencrypted/
Writing CPU profile to /tmp/cpu.prof
Note: You must unmount gracefully, otherwise the profile file(s) will stay empty!
Password:
root@gunder:/media/xfiles# dd if=/dev/zero of=/media/xfiles/unencrypted/zero bs=128k count=10000 conv=fdatasync
10000+0 records in
10000+0 records out
1310720000 bytes (1.3 GB) copied, 10.0026 s, 131 MB/s
root@gunder:/media/xfiles# fusermount -u unencrypted/

The cpu profile can be found here

Strange thing is somehow rsync (using flags avP) to the same unencrypted mount is topping out at around 60 MB/s

@rfjakob
Copy link
Owner

rfjakob commented Jun 10, 2017

The CPU profile (rendered as pdf: pprof001.svg.pdf) show that we spend our time on:

36.8%    gcmAesEnc
14.2%    syscall.Pwrite
 6.9%    nonceGenerator.Get

I have already sped up nonceGenerator.Get quite a little in 80516ed . We cannot do anything about the pwrite syscall. That leaves gcmAesEnc. My benchmarks suggest that we can get a big improvement by parallelizing the encryption: results.txt

@rfjakob
Copy link
Owner

rfjakob commented Jun 10, 2017

On a 4-core, 8-thread machine (Xeon E31245) we get a superlinear (!!) improvement by switching form one to two threads:

Benchmark1_gogcm-8            	    5000	    282694 ns/op	 463.65 MB/s
Benchmark2_gogcm-8            	   20000	     99704 ns/op	1314.60 MB/s

@jkaberg
Copy link
Author

jkaberg commented Jun 10, 2017

Impressive numbers and work @rfjakob. You mind publishing a build for me to test (Linux amd64)?

@jkaberg
Copy link
Author

jkaberg commented Jun 10, 2017

Also good news from libfuse, https://github.com/libfuse/libfuse/releases/tag/fuse-3.0.2

"Internal: calculate request buffer size from page size and kernel page limit instead of using hardcoded 128 kB limit." (libfuse/libfuse@4f8f034)

This should help speed things up abit 😄

@rfjakob
Copy link
Owner

rfjakob commented Jun 10, 2017

The numbers I posted are from a synthetic benchmark ( https://github.com/rfjakob/gocryptfs-microbenchmarks ), I'm working on getting it into gocryptfs. I'll probably not get the same improvement in gocryptfs due the FUSE overhead. Will keep you updated here!

The page size thing, unfortunately, only applies to architectures other than x86. I believe arm64 and powerpc have a bigger page size, so they would get much bigger blocks.

@rfjakob
Copy link
Owner

rfjakob commented Jun 11, 2017

I have added two-way encryption parallelism. If you can test, here is the latest build:
gocryptfs_v1.3-70-gafc3a82_linux-static_amd64.tar.gz

@jkaberg
Copy link
Author

jkaberg commented Jun 11, 2017

@rfjakob Indeed, I'm seeing on avarage an 20MB/s increase (with rsync). Very nice! 😄

I did an cpu profile for you aswell, https://cloud.eth0.im/s/jEwCnsLJElFz8E0

While doing the rsync job I noticed my CPU is not going above 130%. From the commit messages I recon you limited threading to 2 threads, do you think bumping up that value would make a difference?

@rfjakob
Copy link
Owner

rfjakob commented Jun 11, 2017

Great, thanks! Rendered cpu profile: pprof002.svg.pdf

I saw about a 20% increase in my testing, and to be honest, I was a bit underwhelmed. It turns out that the encryption threads often get scheduled to the same core. This gets worse with more threads, which is why I have limited it to two-way parallelism for now.

@rfjakob
Copy link
Owner

rfjakob commented Jun 20, 2017

@jkaberg If you want to try again, 3c6fe98 should give it another boost. This gets rid of most of the garbage collection overhead by re-using temporary buffers.

@rfjakob
Copy link
Owner

rfjakob commented Jul 1, 2017

I think this can be closed - check out the performance history at performance.txt, the last commits gave us quite a boost.

@rfjakob rfjakob closed this as completed Jul 1, 2017
@jkaberg
Copy link
Author

jkaberg commented Jul 3, 2017

@rfjakob Thanks. I'll have a go when I'm back from vacation 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants