Reduce system call overhead of `copyFile` by increasing the buffer size to 128 KiB #69

nh2 · 2017-02-08T03:08:48Z

128 KiB is what coreutils cp uses, based on a benchmark provided here:
http://git.savannah.gnu.org/cgit/coreutils.git/tree/src/ioblksize.h?id=c0a79542fb5c2c22cf0a250db94af6f8581ca342#n23

The benchmarks report speedups of factor 5x to 30x depending on
the hardware.

…ze to 128 KiB. 128 KiB is what coreutils `cp` uses, based on a benchmark provided here: http://git.savannah.gnu.org/cgit/coreutils.git/tree/src/ioblksize.h?id=c0a79542fb5c2c22cf0a250db94af6f8581ca342#n23 The benchmarks report speedups of factor 5x to 30x depending on the hardware.

nh2 · 2017-02-08T03:09:11Z

I'd still like somebody to independently verify the gains.

nh2 · 2017-02-08T03:43:58Z

Also, there's some weird thing going on with write() syscalls and handle buffers:

write(3, "", 0)                         = 0 <0.000004>
write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072 <0.000050>
read(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072 <0.000023>
write(3, "", 0)                         = 0 <0.000005>
write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072 <0.000050>
read(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072 <0.000022>

This happens as soon as you set bufferSize > 8095.

With 8095 or smaller, it looks right like:

read(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8096) = 8096 <0.000005>
write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8095) = 8095 <0.000011>
read(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8096) = 8096 <0.000005>
write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8095) = 8095 <0.000009>

Rufflewind · 2017-02-08T04:35:29Z

Also, there's some weird thing going on with write() syscalls and handle buffers:

Have you tried disabling buffering perhaps?

erikd · 2017-02-08T04:55:43Z

Have you tried disabling buffering perhaps?

I can't seem to figure out whether hGteBuf uses FILE* handles for file descriptors underneath. The former are buffered and the latter are not.

nh2 · 2017-02-08T13:13:55Z

Have you tried disabling buffering perhaps?

@Rufflewind Yes, I had, to no avail; tried with NoBuffering, BlockBuffing Nothing, and BlockBuffering (Just 100000), no difference.

nh2 · 2017-02-08T13:55:03Z

What I find especially disturbing is that the write("") comes before the proper write.

nh2 · 2017-02-08T14:19:38Z

This leads me to the suspicion that this might be going on in wantWritableHandle / checkWriteableHandle.

nh2 · 2017-02-08T14:34:43Z

Minimal example:

import Foreign.Marshal.Alloc (allocaBytes)
import System.IO

main = do
  withBinaryFile "testfile2" WriteMode $ \ hTo -> do
    let bufferSize = 8096 -- write("") is gone when we use 8095 here
    allocaBytes bufferSize $ \buffer -> do
      hPutBuf hTo buffer bufferSize

In strace -f (evil write("") highlighted for emphasis):

open("testfile2", O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK, 0666) = 3 <0.000007>
fstat(3, {st_mode=S_IFREG|0664, st_size=8096, ...}) = 0 <0.000005>
ftruncate(3, 0)                         = 0 <0.000027>
ioctl(3, TCGETS, 0x7fff86ea7f60)        = -1 ENOTTY (Inappropriate ioctl for device) <0.000006>

write(3, "", 0)                         = 0 <0.000004>

write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8096) = 8096 <0.000017>
close(3)                                = 0 <0.000030>

nh2 · 2017-02-08T16:16:04Z

OK, making some progress here. The 8096 comes from https://hackage.haskell.org/package/base-4.9.1.0/docs/src/GHC.IO.FD.html#dEFAULT_FD_BUFFER_SIZE

instance BufferedIO FD where
  newBuffer _dev state = newByteBuffer dEFAULT_FD_BUFFER_SIZE state

-- We used to use System.Posix.Internals.dEFAULT_BUFFER_SIZE, which is
-- taken from the value of BUFSIZ on the current platform.  This value
-- varies too much though: it is 512 on Windows, 1024 on OS X and 8192
-- on Linux.  So let's just use a decent size on every platform:
dEFAULT_FD_BUFFER_SIZE :: Int
dEFAULT_FD_BUFFER_SIZE = 8096

Does anybody know why 8096 was chosen? It's none of the numbers mentioned before, and it's not a power of 2.

nh2 · 2017-02-08T17:14:05Z

OK it's all clear now, it's a typo in the number indeed, will post an update shortly.

The write("")s come from bufWrite, where a > should be >= in if (size - w > count).

There's also an unrelated -- TODO: we should do a non-blocking flush here in there that should probably be fixed to not violate the non-blocking guarantees of hPutBuf'.

nh2 · 2017-02-09T00:25:03Z

OK, I've written this up as GHC bugs:

and filed a third slightly related

hPutBufNonBlocking can block

nh2 · 2017-02-09T00:25:43Z

This PR can now return to its original purpose, increasing the copyFile buffer size to 128 KB.

nh2 · 2017-02-09T02:12:42Z

I've done some quick time based benchmarking between 1024 and 131072 buffer size.

Preparing a 1GB file:

dd if=/dev/zero of=testfile bs=1M count=1000

Code:

main = copyFile "testfile" "testfile2"

(running on a version of base that has the bug hPutBuf issues unnecessary empty write syscalls for large writes fixed -- for benchmarking it without having to recompile base, I used this file for generating the below measurements).

Because of the 8096 handle buffer bug, I provide times for both 1024 and 8096/8 = 1012 (as the latter will be more efficient until that bug is fixed, for bufferSize < 8096; for larger buffer sizes I don't provide multiples of 1012 because it makes no difference as the Handle buffer is not used in that case).

Running with GHC 8.0.2 on Stackage nightly-2017-01-25, using ghc --make -O copytest.hs && time ./copytest.

bufferSize = 1012:

0.18user 0.87system 0:02.69elapsed 39%CPU (0avgtext+0avgdata 4352maxresident)k
0inputs+2048000outputs (0major+314minor)pagefaults 0swaps

0.16user 0.92system 0:02.62elapsed 41%CPU (0avgtext+0avgdata 4572maxresident)k
0inputs+2048000outputs (0major+320minor)pagefaults 0swaps

0.20user 0.86system 0:02.80elapsed 38%CPU (0avgtext+0avgdata 4360maxresident)k
0inputs+2048000outputs (0major+315minor)pagefaults 0swaps

0.19user 0.85system 0:02.61elapsed 40%CPU (0avgtext+0avgdata 4532maxresident)k
0inputs+2048000outputs (0major+319minor)pagefaults 0swaps

0.18user 0.98system 0:02.65elapsed 43%CPU (0avgtext+0avgdata 4472maxresident)k
0inputs+2048000outputs (0major+317minor)pagefaults 0swaps

bufferSize = 1024:

0.20user 0.86system 0:02.82elapsed 37%CPU (0avgtext+0avgdata 4420maxresident)k
0inputs+2048000outputs (0major+314minor)pagefaults 0swaps

0.22user 0.73system 0:02.68elapsed 35%CPU (0avgtext+0avgdata 4468maxresident)k
0inputs+2048000outputs (0major+320minor)pagefaults 0swaps

0.19user 0.88system 0:02.67elapsed 40%CPU (0avgtext+0avgdata 4576maxresident)k
0inputs+2048000outputs (0major+318minor)pagefaults 0swaps

0.17user 0.78system 0:02.60elapsed 36%CPU (0avgtext+0avgdata 4496maxresident)k
0inputs+2048000outputs (0major+318minor)pagefaults 0swaps

0.20user 0.79system 0:02.64elapsed 37%CPU (0avgtext+0avgdata 4400maxresident)k
0inputs+2048000outputs (0major+318minor)pagefaults 0swaps

bufferSize = 1024 * 8:

0.09user 0.83system 0:02.50elapsed 37%CPU (0avgtext+0avgdata 4396maxresident)k
0inputs+2048000outputs (0major+315minor)pagefaults 0swaps

0.08user 0.84system 0:02.97elapsed 31%CPU (0avgtext+0avgdata 4508maxresident)k
0inputs+2048000outputs (0major+320minor)pagefaults 0swaps

0.12user 0.78system 0:02.64elapsed 34%CPU (0avgtext+0avgdata 4408maxresident)k
0inputs+2048000outputs (0major+319minor)pagefaults 0swaps

0.12user 0.78system 0:02.44elapsed 36%CPU (0avgtext+0avgdata 4504maxresident)k
0inputs+2048000outputs (0major+321minor)pagefaults 0swaps

0.10user 0.77system 0:02.47elapsed 35%CPU (0avgtext+0avgdata 4544maxresident)k
0inputs+2048000outputs (0major+321minor)pagefaults 0swaps

bufferSize = 1024 * 128:

0.00user 0.78system 0:02.33elapsed 33%CPU (0avgtext+0avgdata 4656maxresident)k
0inputs+2048000outputs (0major+349minor)pagefaults 0swaps

0.00user 0.82system 0:02.33elapsed 35%CPU (0avgtext+0avgdata 4528maxresident)k
0inputs+2048000outputs (0major+349minor)pagefaults 0swaps

0.00user 0.76system 0:02.36elapsed 32%CPU (0avgtext+0avgdata 4524maxresident)k
0inputs+2048000outputs (0major+348minor)pagefaults 0swaps

0.01user 0.74system 0:02.33elapsed 32%CPU (0avgtext+0avgdata 4548maxresident)k
0inputs+2048000outputs (0major+345minor)pagefaults 0swaps

0.00user 0.71system 0:02.27elapsed 31%CPU (0avgtext+0avgdata 4520maxresident)k
0inputs+2048000outputs (0major+346minor)pagefaults 0swaps

As you can see, user CPU time went from 0.20 to 0.01 seconds, and system CPU time also improved slightly.

These measurements support that bufSize = 1024 * 128 indeed improves performance.

nh2 · 2017-02-09T03:25:17Z

I should also mention, this change makes copyFile as fast as cp on a ramdisk, while before it was twice as slow:

Before: 0.20user 0.38system 0:00.58elapsed 99%CPU (0avgtext+0avgdata 4472maxresident)k
After:  0.00user 0.35system 0:00.35elapsed 99%CPU (0avgtext+0avgdata 4708maxresident)k
cp:     0.00user 0.33system 0:00.33elapsed 99%CPU (0avgtext+0avgdata 2568maxresident)k

The only thing that's faster is sendfile():

sendfile in Haskell: 0.00user 0.28system 0:00.28elapsed 99%CPU (0avgtext+0avgdata 3528maxresident)k
probably even slightly faster in C

nh2 · 2017-02-09T12:42:27Z

To answer the other question:

I can't seem to figure out whether hGteBuf uses FILE* handles for file descriptors underneath.

@erikd It doesn't, Handles are inspired by libc's standard input/output streams, but don't use them. Instead, they implement their own similar logic directly.

Rufflewind · 2017-02-10T02:31:35Z

I'd still like somebody to independently verify the gains.

Tested this on tmpfs (ramdisk) and SSD on my Linux machine, and there was a speedup of ~60%. The results fluctuates a lot on my machine. Generally I expect it to different from OS to OS depending on how well the OS caches the I/O.

In any case, I don't doubt that this should be speed increase in the majority of cases given how large I/O caches have become these days. Will merge shortly.

Preparing a 1GB file:

It's probably better to use /dev/urandom rather than /dev/zero to avoid predictability.

The only thing that's faster is sendfile():

In the future, it'd be nice to use sendfile on platforms that do support it.

Rufflewind · 2017-02-10T02:40:35Z

Thanks for the PR and the benchmarks!

It's a bit slow due to the `copyFile` bug haskell/directory#69 and this GHC bug: https://ghc.haskell.org/trac/ghc/ticket/13246

Rufflewind added a commit to Rufflewind/directory that referenced this pull request Feb 10, 2017

Update changelog for haskell#69

56de0b3

Rufflewind merged commit 51a0f72 into haskell:master Feb 10, 2017

nh2 added a commit to nh2/copier that referenced this pull request Feb 14, 2017

Initial copying version without update check.

ec9d4d3

It's a bit slow due to the `copyFile` bug haskell/directory#69 and this GHC bug: https://ghc.haskell.org/trac/ghc/ticket/13246

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce system call overhead of `copyFile` by increasing the buffer size to 128 KiB #69

Reduce system call overhead of `copyFile` by increasing the buffer size to 128 KiB #69

nh2 commented Feb 8, 2017

nh2 commented Feb 8, 2017

nh2 commented Feb 8, 2017 •

edited

Loading

Rufflewind commented Feb 8, 2017

erikd commented Feb 8, 2017

nh2 commented Feb 8, 2017

nh2 commented Feb 8, 2017

nh2 commented Feb 8, 2017

nh2 commented Feb 8, 2017

nh2 commented Feb 8, 2017

nh2 commented Feb 8, 2017 •

edited

Loading

nh2 commented Feb 9, 2017

nh2 commented Feb 9, 2017

nh2 commented Feb 9, 2017

nh2 commented Feb 9, 2017

nh2 commented Feb 9, 2017

Rufflewind commented Feb 10, 2017

Rufflewind commented Feb 10, 2017

Reduce system call overhead of copyFile by increasing the buffer size to 128 KiB #69

Reduce system call overhead of copyFile by increasing the buffer size to 128 KiB #69

Conversation

nh2 commented Feb 8, 2017

nh2 commented Feb 8, 2017

nh2 commented Feb 8, 2017 • edited Loading

Rufflewind commented Feb 8, 2017

erikd commented Feb 8, 2017

nh2 commented Feb 8, 2017

nh2 commented Feb 8, 2017

nh2 commented Feb 8, 2017

nh2 commented Feb 8, 2017

nh2 commented Feb 8, 2017

nh2 commented Feb 8, 2017 • edited Loading

nh2 commented Feb 9, 2017

nh2 commented Feb 9, 2017

nh2 commented Feb 9, 2017

nh2 commented Feb 9, 2017

nh2 commented Feb 9, 2017

Rufflewind commented Feb 10, 2017

Rufflewind commented Feb 10, 2017

Reduce system call overhead of `copyFile` by increasing the buffer size to 128 KiB #69

Reduce system call overhead of `copyFile` by increasing the buffer size to 128 KiB #69

nh2 commented Feb 8, 2017 •

edited

Loading

nh2 commented Feb 8, 2017 •

edited

Loading