Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce system call overhead of copyFile by increasing the buffer size to 128 KiB #69

Merged
merged 1 commit into from
Feb 10, 2017

Conversation

nh2
Copy link
Member

@nh2 nh2 commented Feb 8, 2017

128 KiB is what coreutils cp uses, based on a benchmark provided here:
http://git.savannah.gnu.org/cgit/coreutils.git/tree/src/ioblksize.h?id=c0a79542fb5c2c22cf0a250db94af6f8581ca342#n23

The benchmarks report speedups of factor 5x to 30x depending on
the hardware.

…ze to 128 KiB.

128 KiB is what coreutils `cp` uses, based on a benchmark provided here:
http://git.savannah.gnu.org/cgit/coreutils.git/tree/src/ioblksize.h?id=c0a79542fb5c2c22cf0a250db94af6f8581ca342#n23

The benchmarks report speedups of factor 5x to 30x depending on
the hardware.
@nh2
Copy link
Member Author

nh2 commented Feb 8, 2017

I'd still like somebody to independently verify the gains.

@nh2
Copy link
Member Author

nh2 commented Feb 8, 2017

Also, there's some weird thing going on with write() syscalls and handle buffers:

write(3, "", 0)                         = 0 <0.000004>
write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072 <0.000050>
read(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072 <0.000023>
write(3, "", 0)                         = 0 <0.000005>
write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072 <0.000050>
read(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072 <0.000022>

This happens as soon as you set bufferSize > 8095.

With 8095 or smaller, it looks right like:

read(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8096) = 8096 <0.000005>
write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8095) = 8095 <0.000011>
read(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8096) = 8096 <0.000005>
write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8095) = 8095 <0.000009>

@Rufflewind
Copy link
Member

Also, there's some weird thing going on with write() syscalls and handle buffers:

Have you tried disabling buffering perhaps?

@erikd
Copy link
Member

erikd commented Feb 8, 2017

Have you tried disabling buffering perhaps?

I can't seem to figure out whether hGteBuf uses FILE* handles for file descriptors underneath. The former are buffered and the latter are not.

@nh2
Copy link
Member Author

nh2 commented Feb 8, 2017

Have you tried disabling buffering perhaps?

@Rufflewind Yes, I had, to no avail; tried with NoBuffering, BlockBuffing Nothing, and BlockBuffering (Just 100000), no difference.

@nh2
Copy link
Member Author

nh2 commented Feb 8, 2017

What I find especially disturbing is that the write("") comes before the proper write.

@nh2
Copy link
Member Author

nh2 commented Feb 8, 2017

This leads me to the suspicion that this might be going on in wantWritableHandle / checkWriteableHandle.

@nh2
Copy link
Member Author

nh2 commented Feb 8, 2017

Minimal example:

import Foreign.Marshal.Alloc (allocaBytes)
import System.IO

main = do
  withBinaryFile "testfile2" WriteMode $ \ hTo -> do
    let bufferSize = 8096 -- write("") is gone when we use 8095 here
    allocaBytes bufferSize $ \buffer -> do
      hPutBuf hTo buffer bufferSize

In strace -f (evil write("") highlighted for emphasis):

open("testfile2", O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK, 0666) = 3 <0.000007>
fstat(3, {st_mode=S_IFREG|0664, st_size=8096, ...}) = 0 <0.000005>
ftruncate(3, 0)                         = 0 <0.000027>
ioctl(3, TCGETS, 0x7fff86ea7f60)        = -1 ENOTTY (Inappropriate ioctl for device) <0.000006>

write(3, "", 0)                         = 0 <0.000004>

write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8096) = 8096 <0.000017>
close(3)                                = 0 <0.000030>

@nh2
Copy link
Member Author

nh2 commented Feb 8, 2017

OK, making some progress here. The 8096 comes from https://hackage.haskell.org/package/base-4.9.1.0/docs/src/GHC.IO.FD.html#dEFAULT_FD_BUFFER_SIZE

instance BufferedIO FD where
  newBuffer _dev state = newByteBuffer dEFAULT_FD_BUFFER_SIZE state

-- We used to use System.Posix.Internals.dEFAULT_BUFFER_SIZE, which is
-- taken from the value of BUFSIZ on the current platform.  This value
-- varies too much though: it is 512 on Windows, 1024 on OS X and 8192
-- on Linux.  So let's just use a decent size on every platform:
dEFAULT_FD_BUFFER_SIZE :: Int
dEFAULT_FD_BUFFER_SIZE = 8096

Does anybody know why 8096 was chosen? It's none of the numbers mentioned before, and it's not a power of 2.

@nh2
Copy link
Member Author

nh2 commented Feb 8, 2017

OK it's all clear now, it's a typo in the number indeed, will post an update shortly.

The write("")s come from bufWrite, where a > should be >= in if (size - w > count).

There's also an unrelated -- TODO: we should do a non-blocking flush here in there that should probably be fixed to not violate the non-blocking guarantees of hPutBuf'.

@nh2
Copy link
Member Author

nh2 commented Feb 9, 2017

OK, I've written this up as GHC bugs:

and filed a third slightly related

@nh2
Copy link
Member Author

nh2 commented Feb 9, 2017

This PR can now return to its original purpose, increasing the copyFile buffer size to 128 KB.

@nh2
Copy link
Member Author

nh2 commented Feb 9, 2017

I've done some quick time based benchmarking between 1024 and 131072 buffer size.

Preparing a 1GB file:

dd if=/dev/zero of=testfile bs=1M count=1000

Code:

main = copyFile "testfile" "testfile2"

(running on a version of base that has the bug hPutBuf issues unnecessary empty write syscalls for large writes fixed -- for benchmarking it without having to recompile base, I used this file for generating the below measurements).

Because of the 8096 handle buffer bug, I provide times for both 1024 and 8096/8 = 1012 (as the latter will be more efficient until that bug is fixed, for bufferSize < 8096; for larger buffer sizes I don't provide multiples of 1012 because it makes no difference as the Handle buffer is not used in that case).

Running with GHC 8.0.2 on Stackage nightly-2017-01-25, using ghc --make -O copytest.hs && time ./copytest.

bufferSize = 1012:

0.18user 0.87system 0:02.69elapsed 39%CPU (0avgtext+0avgdata 4352maxresident)k
0inputs+2048000outputs (0major+314minor)pagefaults 0swaps

0.16user 0.92system 0:02.62elapsed 41%CPU (0avgtext+0avgdata 4572maxresident)k
0inputs+2048000outputs (0major+320minor)pagefaults 0swaps

0.20user 0.86system 0:02.80elapsed 38%CPU (0avgtext+0avgdata 4360maxresident)k
0inputs+2048000outputs (0major+315minor)pagefaults 0swaps

0.19user 0.85system 0:02.61elapsed 40%CPU (0avgtext+0avgdata 4532maxresident)k
0inputs+2048000outputs (0major+319minor)pagefaults 0swaps

0.18user 0.98system 0:02.65elapsed 43%CPU (0avgtext+0avgdata 4472maxresident)k
0inputs+2048000outputs (0major+317minor)pagefaults 0swaps

bufferSize = 1024:

0.20user 0.86system 0:02.82elapsed 37%CPU (0avgtext+0avgdata 4420maxresident)k
0inputs+2048000outputs (0major+314minor)pagefaults 0swaps

0.22user 0.73system 0:02.68elapsed 35%CPU (0avgtext+0avgdata 4468maxresident)k
0inputs+2048000outputs (0major+320minor)pagefaults 0swaps

0.19user 0.88system 0:02.67elapsed 40%CPU (0avgtext+0avgdata 4576maxresident)k
0inputs+2048000outputs (0major+318minor)pagefaults 0swaps

0.17user 0.78system 0:02.60elapsed 36%CPU (0avgtext+0avgdata 4496maxresident)k
0inputs+2048000outputs (0major+318minor)pagefaults 0swaps

0.20user 0.79system 0:02.64elapsed 37%CPU (0avgtext+0avgdata 4400maxresident)k
0inputs+2048000outputs (0major+318minor)pagefaults 0swaps

bufferSize = 1024 * 8:

0.09user 0.83system 0:02.50elapsed 37%CPU (0avgtext+0avgdata 4396maxresident)k
0inputs+2048000outputs (0major+315minor)pagefaults 0swaps

0.08user 0.84system 0:02.97elapsed 31%CPU (0avgtext+0avgdata 4508maxresident)k
0inputs+2048000outputs (0major+320minor)pagefaults 0swaps

0.12user 0.78system 0:02.64elapsed 34%CPU (0avgtext+0avgdata 4408maxresident)k
0inputs+2048000outputs (0major+319minor)pagefaults 0swaps

0.12user 0.78system 0:02.44elapsed 36%CPU (0avgtext+0avgdata 4504maxresident)k
0inputs+2048000outputs (0major+321minor)pagefaults 0swaps

0.10user 0.77system 0:02.47elapsed 35%CPU (0avgtext+0avgdata 4544maxresident)k
0inputs+2048000outputs (0major+321minor)pagefaults 0swaps

bufferSize = 1024 * 128:

0.00user 0.78system 0:02.33elapsed 33%CPU (0avgtext+0avgdata 4656maxresident)k
0inputs+2048000outputs (0major+349minor)pagefaults 0swaps

0.00user 0.82system 0:02.33elapsed 35%CPU (0avgtext+0avgdata 4528maxresident)k
0inputs+2048000outputs (0major+349minor)pagefaults 0swaps

0.00user 0.76system 0:02.36elapsed 32%CPU (0avgtext+0avgdata 4524maxresident)k
0inputs+2048000outputs (0major+348minor)pagefaults 0swaps

0.01user 0.74system 0:02.33elapsed 32%CPU (0avgtext+0avgdata 4548maxresident)k
0inputs+2048000outputs (0major+345minor)pagefaults 0swaps

0.00user 0.71system 0:02.27elapsed 31%CPU (0avgtext+0avgdata 4520maxresident)k
0inputs+2048000outputs (0major+346minor)pagefaults 0swaps

As you can see, user CPU time went from 0.20 to 0.01 seconds, and system CPU time also improved slightly.

These measurements support that bufSize = 1024 * 128 indeed improves performance.

@nh2
Copy link
Member Author

nh2 commented Feb 9, 2017

I should also mention, this change makes copyFile as fast as cp on a ramdisk, while before it was twice as slow:

Before: 0.20user 0.38system 0:00.58elapsed 99%CPU (0avgtext+0avgdata 4472maxresident)k
After:  0.00user 0.35system 0:00.35elapsed 99%CPU (0avgtext+0avgdata 4708maxresident)k
cp:     0.00user 0.33system 0:00.33elapsed 99%CPU (0avgtext+0avgdata 2568maxresident)k

The only thing that's faster is sendfile():

  • sendfile in Haskell: 0.00user 0.28system 0:00.28elapsed 99%CPU (0avgtext+0avgdata 3528maxresident)k
  • probably even slightly faster in C

@nh2
Copy link
Member Author

nh2 commented Feb 9, 2017

To answer the other question:

I can't seem to figure out whether hGteBuf uses FILE* handles for file descriptors underneath.

@erikd It doesn't, Handles are inspired by libc's standard input/output streams, but don't use them. Instead, they implement their own similar logic directly.

@Rufflewind
Copy link
Member

I'd still like somebody to independently verify the gains.

Tested this on tmpfs (ramdisk) and SSD on my Linux machine, and there was a speedup of ~60%. The results fluctuates a lot on my machine. Generally I expect it to different from OS to OS depending on how well the OS caches the I/O.

In any case, I don't doubt that this should be speed increase in the majority of cases given how large I/O caches have become these days. Will merge shortly.

Preparing a 1GB file:

It's probably better to use /dev/urandom rather than /dev/zero to avoid predictability.

The only thing that's faster is sendfile():

In the future, it'd be nice to use sendfile on platforms that do support it.

Rufflewind added a commit to Rufflewind/directory that referenced this pull request Feb 10, 2017
@Rufflewind Rufflewind merged commit 51a0f72 into haskell:master Feb 10, 2017
@Rufflewind
Copy link
Member

Thanks for the PR and the benchmarks!

nh2 added a commit to nh2/copier that referenced this pull request Feb 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants