-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase the input block size for bgzip. #1768
Conversation
2270f34
to
aa6f354
Compare
Edited the buffer down to 256Kb instead of 1Mb as it seems to be sufficient (tested on tmpfs, fast NFS and lustre). |
Hmm, it's still variable! The effect of a bigger block size is more memcpy as we have fewer direct reads (as @daviesrob points out a readv can partially solve that as we can read direct to caller buff + remainder to look-ahead cache, but it doesn't fit with the backend semantics), but it also reduces the impact of many small reads due to small pipe size. Hence some machines it's slower to have a bigger block size, and it can have weird interactions with CPU load too which I cannot explain. Eg it's a win at -l0, but a penalty at -l1. It needs more head scratching probably. |
I retested this with different block sizes on a few systems. This time it was aggregate wall clock time on 100 trials of the same file, so cached and a lot of reproducability. Bgzip -l5, but our systems all have CPU frequency scaling on and the filesystems are sometimes shared so there are sources of error and randomness too. Even so, the charts are interesting. So 128k is enough on the Intel and 256k is enough on the AMD. I also tried an older Intel machine, but it just flatlined the CPU and made very little difference in buffer sizes. Trying -l1 instead changed it a little, but not significantly. I also tested reading an actual file rather than a pipe from Lustre on the AMD system and basically it was identical performance regardless of buffer size. This backs up my findings in the linked issue. Looking at CPU instead of elapsed time shows a small drop at around 128KB onwards, but not huge and not worth taking into account I suspect. However I think the 256k here is fine, as would 128kb so we should consider this PR again. We could also consider doing an fstat and only applying it on pipes. |
aa6f354
to
fdd6143
Compare
I think 128kB blocks are better, but would it be a good idea to put this in hFILE so all pipe users could benefit? I think it should probably only be for FIFOs though to avoid potential issues with over-reading on index-based jobs. The current limit of 32k in |
Commit e495718 changed bgzip from unix raw POSIX read() calls to hread(). Unfortunately hread gets its buffer size from stat of the input file descriptor, which can be 4kb for a pipe. We're reading 0xff00 bytes, so this ends up being split over two reads mostly, with one or both involving additional memcpys. This makes the buffered I/O worse performing than non-buffered. In the most extreme cases (cat data | bgzip -l0 > /dev/null) this is a two fold slow down. The easy solution is just to increase the buffer size to something sensible. Currently we play it cautiously and only do this on pipes and fifos. Fixes samtools#1767
fdd6143
to
186d21b
Compare
It's not possible to distinguish between normal pipes and named pipes (FIFOs), but I don't think that matters much and it still works OK on named pipes. The real problem is the buffer size is pathetic at 4096 according to stat. This is actually smaller than the standard linux pipe size. It's still internally cachine at 64kb, but we're reading it in small chunks. I suspece that's because it's the size the OS supports for atomic writes, but we're optimise for speed here. I've moved the change to hfile instead. |
On AMD EPYC 7713 aligning to cache size boundaries makes a very significant difference to fp->backend->read performance in the kernel. A modern Intel CPU did not demonstrate this difference. x86 often have cache line size of 64 bytes, and apple Arm chips 128 bytes. I haven't tested if Arm benefits from alignment during read calls, but we can check size with sysconf(_SC_LEVEL1_DCACHE_LINESIZE). However to avoid additional autoconfery I just picked 256 as it gives us headroom and is simple. Speed ups on the AMD EPYC: time bash -c 'for i in `seq 1 30`;do cat < ~/lustre/enwik9| ./bgzip -l5 -@32 > /dev/null;done' Unaligned real 0m45.012s user 10m7.661s sys 0m58.770s Aligned real 0m30.717s user 11m14.004s sys 0m32.921s It is likely this could improve other bits of code too.
63e1a18
to
d1be3c2
Compare
Commit e495718 changed bgzip from unix raw POSIX read() calls to hread(). Unfortunately hread gets its buffer size from stat of the input file descriptor, which can be 4kb for a pipe. We're reading 0xff00 bytes, so this ends up being split over two reads mostly, with one or both involving additional memcpys. This makes the buffered I/O worse performing than non-buffered. In the most extreme cases (cat data | bgzip -l0 > /dev/null) this is a two fold slow down.
The easy solution is just to increase the buffer size to something sensible. It's a little messy as we have to use hfile_internal.h to get hfile_set_blksize, but it works. I'm not sure why we didn't elect to make that API more public. Probably simply out of caution.
Fixes #1767