Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up sum by using reasonable read buffer sizes. #3741

Merged
merged 5 commits into from
Jul 28, 2022
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions src/uu/sum/BENCHMARKING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
## Benchmarking `sum`

<!-- spell-checker:ignore wikidatawiki -->

Large sample files can for example be found in the [Wikipedia database dumps](https://dumps.wikimedia.org/wikidatawiki/latest/), usually sized at multiple gigabytes and comprising more than 100M lines.

After you have obtained and uncompressed such a file, you need to build `sum` in release mode

```shell
$ cargo build --release --package uu_sum
```

and then you can time how it long it takes to checksum the file by running

```shell
$ /usr/bin/time ./target/release/sum wikidatawiki-20211001-pages-logging.xml
```

For more systematic measurements that include warm-ups, repetitions and comparisons, [Hyperfine](https://github.com/sharkdp/hyperfine) can be helpful. For example, to compare this implementation to the one provided by your distribution run

```shell
$ hyperfine "./target/release/sum wikidatawiki-20211001-pages-logging.xml" "/usr/bin/sum wikidatawiki-20211001-pages-logging.xml"
resistor marked this conversation as resolved.
Show resolved Hide resolved
```
19 changes: 13 additions & 6 deletions src/uu/sum/src/sum.rs
Original file line number Diff line number Diff line change
Expand Up @@ -23,14 +23,19 @@ static USAGE: &str = "{} [OPTION]... [FILE]...";
static SUMMARY: &str = "Checksum and count the blocks in a file.\n\
With no FILE, or when FILE is -, read standard input.";

// This can be replaced with usize::div_ceil once it is stabilized
fn div_ceil(a: usize, b: usize) -> usize {
(a + b - 1) / b
resistor marked this conversation as resolved.
Show resolved Hide resolved
}

fn bsd_sum(mut reader: Box<dyn Read>) -> (usize, u16) {
let mut buf = [0; 1024];
let mut blocks_read = 0;
let mut buf = [0; 4096];
let mut bytes_read = 0;
let mut checksum: u16 = 0;
loop {
match reader.read(&mut buf) {
Ok(n) if n != 0 => {
blocks_read += 1;
bytes_read += n;
for &byte in buf[..n].iter() {
checksum = (checksum >> 1) + ((checksum & 1) << 15);
checksum = checksum.wrapping_add(u16::from(byte));
Expand All @@ -40,18 +45,19 @@ fn bsd_sum(mut reader: Box<dyn Read>) -> (usize, u16) {
}
}

let blocks_read = div_ceil(bytes_read, 1024);
(blocks_read, checksum)
}

fn sysv_sum(mut reader: Box<dyn Read>) -> (usize, u16) {
let mut buf = [0; 512];
let mut blocks_read = 0;
let mut buf = [0; 4096];
let mut bytes_read = 0;
let mut ret = 0u32;

loop {
match reader.read(&mut buf) {
Ok(n) if n != 0 => {
blocks_read += 1;
bytes_read += n;
for &byte in buf[..n].iter() {
ret = ret.wrapping_add(u32::from(byte));
}
Expand All @@ -63,6 +69,7 @@ fn sysv_sum(mut reader: Box<dyn Read>) -> (usize, u16) {
ret = (ret & 0xffff) + (ret >> 16);
ret = (ret & 0xffff) + (ret >> 16);

let blocks_read = div_ceil(bytes_read, 512);
resistor marked this conversation as resolved.
Show resolved Hide resolved
(blocks_read, ret as u16)
}

Expand Down