-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up Rust #1
base: master
Are you sure you want to change the base?
Conversation
This makes it a little bit nicer to check whether the last thing we added was a digit or not, and also lets us skip allocating an extra string when adding a digit.
It never actually returned any errors; so why return anything at all.
Don't just check if something worked fine, propagate the error otherwise.
This gives us a slight, but noticeable speed-up, since storing a char requires four times as much space.
Instead of cloning the vector to create a new partial_solutions vector, reuse the very same vector, and simply pop the item after the recursive call. This gives us another noticeable speed-up.
Instead of passing around the reference to the whole vector of digits and the start index separately, make use of a Rust idiom: slices. We can pass around a &[u8], which is a slice of the original vector. We can slice it further, such as passing &digits[i+1..] to a recursive invocation.
BigUints are slow. Since we don't actually have to do any arithmetics with them, we can get away with simply using string slices (or actually byte slices). This gives us a *huge* speed-up, because we can now directly slice a digit sequence out of a phone number instead of doing costly BigUint arithmetics.
This was clearly meant to ignore the *files* at the root of the tree, not the whole source directories. In particular, ./src/rust/phone_encoder/ is no longer ignored, while ./phone_encoder (in the project root directory) still is.
It can be easily made 2x faster still by switching the hasher from the default supposedly cryptographically secure one to something like aHash:
|
@bugaevc you could also try using https://docs.rs/grep-cli/0.1.6/grep_cli/ in place of stdout to avoid line-buffering as seen here: rust-lang/rust#60673 Adding the following to Cargo.toml generally gets another 2-5% bump:
|
Using |
I will give it a try later today, but that could just be the case for this! Thanks for trying it at all 👍 |
And yeah, using And second, lazy_static was entirely useless here, even if And yeah, the benchmark runner won't compile because it seems to be using some mac'isms? So that's very odd? Summary
'./phone_encoder_pr1 dictionary.txt phones_1000.txt' ran
1.34 ± 0.29 times faster than 'sbcl --script src/lisp/main.lisp dictionary.txt phones_1000.txt'
1.83 ± 0.37 times faster than './phone_encoder dictionary.txt phones_1000.txt'
5.16 ± 1.08 times faster than 'java -cp build/java Main dictionary.txt phones_1000.txt'
5.91 ± 1.10 times faster than 'java -cp build/java Main2 dictionary.txt phones_1000.txt'
Summary
'./phone_encoder_pr1 dictionary.txt phones_10_000.txt' ran
1.70 ± 0.25 times faster than 'sbcl --script src/lisp/main.lisp dictionary.txt phones_10_000.txt'
2.44 ± 0.34 times faster than './phone_encoder dictionary.txt phones_10_000.txt'
4.09 ± 0.57 times faster than 'java -cp build/java Main dictionary.txt phones_10_000.txt'
5.75 ± 0.79 times faster than 'java -cp build/java Main2 dictionary.txt phones_10_000.txt'
Summary
'./phone_encoder_pr1 dictionary.txt phones_50_000.txt' ran
2.04 ± 0.08 times faster than 'sbcl --script src/lisp/main.lisp dictionary.txt phones_50_000.txt'
2.57 ± 0.21 times faster than 'java -cp build/java Main dictionary.txt phones_50_000.txt'
3.16 ± 0.13 times faster than './phone_encoder dictionary.txt phones_50_000.txt'
4.77 ± 0.23 times faster than 'java -cp build/java Main2 dictionary.txt phones_50_000.txt'
Summary
'./phone_encoder_pr1 dictionary.txt phones_100_000_with_empty.txt' ran
2.16 ± 0.22 times faster than 'java -cp build/java Main dictionary.txt phones_100_000_with_empty.txt'
2.24 ± 0.07 times faster than 'sbcl --script src/lisp/main.lisp dictionary.txt phones_100_000_with_empty.txt'
3.40 ± 0.11 times faster than './phone_encoder dictionary.txt phones_100_000_with_empty.txt'
4.15 ± 0.14 times faster than 'java -cp build/java Main2 dictionary.txt phones_100_000_with_empty.txt' And the hyperfine exported markdown format (unsorted):
The the benchmark script I just added the dictionary file to each command along with a CHECK_FILE="proc_out.txt"
DEFAULT_INPUT="input.txt"
DEFAULT_OUTPUT="output.txt"
DICTIONARY="dictionary.txt"
COMMANDS=(
"java -cp build/java Main ${DICTIONARY} {file}" # Java 1
"java -cp build/java Main2 ${DICTIONARY} {file}" # Java 2
"sbcl --script src/lisp/main.lisp ${DICTIONARY} {file}" # Common Lisp
"./phone_encoder ${DICTIONARY} {file}" # Rust
"./phone_encoder_pr1 ${DICTIONARY} {file}" # Rust PR1
) And updated the correctness block to handle that by changing it to: echo "Checking all programs for correctness"
for CMD in "${COMMANDS[@]}"
do
echo "Checking: $CMD"
${CMD//{file\}/${DEFAULT_INPUT}} > $CHECK_FILE
diff -q <(sort $CHECK_FILE) <(sort $DEFAULT_OUTPUT)
echo "OK"
done And replaced that big benchmarking loop with looping over just the files instead to compare how they do on different inputs: echo "Benchmarking..."
for f in "${INPUTS[@]}"; do
echo "==> $f"
hyperfine --export-markdown "results-$f.md" -L file "$f" "${COMMANDS[@]}"
done And also added And of course there easily can be more optimizations done, such as reducing memory allocations and all such things, jemalloc maybe depending on the use load (or an arena), and other things, etc... |
Which remaining allocations can easily be optimized away? The remaining allocations are:
Also, does it seem to you that the remaining allocations are the bottleneck? From the flamechart, it appears that the majority of time is spent looping in |
That's unfortunate. I would like to keep
I've been doing Rust for a few years.... are you sure it's that bad?? BTW I won't merge the Pull Request, will just keep it open for future reference by people who are interested in see where Rust problems were originally... that might help them know what NOT to do in Rust to not make their programs slow. Thanks for all the effort! Very interesting to see how people manage to squeeze performance out of Rust :)
Where do you find this kind of information? If you look around for BigInt Rust implementation,
The value was not cloned, it was passed as a reference to the n = &n * (&*TEN) + &nth_digit(digits, i);
Wow that's very unexpected, I thought it was hard to get problems like this in Rust. Can you tell me exactly where the problem is?
Wow, you seriously want to win :D what's the point of comparing with other languages when you go to the effort of replacing stdout? In my benchmark_runner I threw away output to stdout, so I think it's fair to keep the programs using |
RE grep_cli, it's a known Rust bug that it will write a line at a time even when not writing to a TTY, so replacing stdout just brings it in line with what other languages are doing. Apparently it's not a slowdown here anyways 🤷 |
I went out of my way to simply improve the algorithm implementation and not change the actual algorithm 🙂 As you can see, the change is very small. The algorithm still works the exact same way, it's just that it now stores digit sequences as
Well... here are a few things I consider unidiomatic:
I don't really understand your motivation here; but it's of course up to you. But please update your blog post with new measurements, charts, and conclusions 🙂 |
If you are willing to use an implementation in unsafe Rust, Edit: I have to correct myself: |
@bugaevc thanks for your review. I will respond to your objections soon, but just want to let you know that the article's intention was never to see how fast a language could be, but how fast it would be when using a certain algorithm and written in a certain amount of time (I had to assume the same algorithm would take roughly the same amount of time in any language, which IMO is pretty close to reality). Using It's great to know there are many library implementations of big int in Rust, but to be honest, this is a weakness in Rust: you don't have a big standard library, and knowning which crate to use is next to impossible. You need to either believe other's suggestions or spend lots of time yourself to figure it out: both of which are not good in a business environment (not to mention that most Rust libraries are 0.x, have dozens of 0.x dependencies, and belong to personal accounts, which is really hard to trust). I don't think my blog post misrepresented Rust at all, in fact it was unfair to other languages in that it was the only one that used libraries (later, I also used libraries in Dart, but in that case it was only for some ASCII constants, not to optimise anything). Your suggestion that my Rust code is not perfect might stand, but do you think every Rust developer is perfect and will write highly optimised code? Do you think the code in the other languages was perfect? That's not in the spirit of my article to assume that is the case. The only purpose of measuring speed and memory was to see how a port of Norvig's solution would perform in other languages, thus discounting the big differences that would certainly arise due to different programs arriving at a solution in completely different ways ... I don't know why some people think it's useful to compare languages that way :) it's just not useful at all... any language can implement any algorithm, so if we want to update the Rust code with optimisations like this, fair enough, but I will also have to optimise the other languages and spend as much effort at least as you have spent on this for Rust. I will also have to update how long it took people to come up with those optimisations... Only then may I update anything on my results. But unfortunately I don't have enough free time (I already spent far too much on this problem) to engage in that right now. But I will definitely update the spreadsheet with the optimisations in all languages once I am able to collect them (there's a few improvements for Julia, and I have several ideas to make Java and Dart much faster as well). Finally, I said clearly in the README that I wouldn't accept pull requests: https://github.com/renatoathaydes/prechelt-phone-number-encoding#adding-a-new-program-to-the-benchmarks Will post them here as soon as I can. Thanks again for taking the time to make your points. |
+1
I actually didn't run across num and num-bigint until long after I'd been using rug.
They aren't really comparable as it is though, num-bigint doesn't do any of those low level performance enhancements that lisp, python, gmp, etc... do in their arbitrary precision integer implementations.
Regardless, having the implementation in the repo using num-bigint is not being fair to the rust implementation as it is not an implementation focused on performance but instead focused on readability and correctness to the detriment of performance.
It clones it internally to make a new extended copy. You can save that cost even within it by just using platform native integers with the num-bigint libary.
It's because it's using
+1
I don't think it was tribal in my case, I just grabbed the first result on crates for my use, which was rug, I only found out about the others (and their performance issues) later... ^.^;
It can be done perfectly fine in rust without libraries however. Using a bigint seems a bit odd to store the individual numbers in.
Not highly optimized no, but more traditionally rust, reaching for an arbitrary precision integer implementation is never something I would have looked for for a problem such as this. In lisp that is more expected due to how it's built, but not every language is built the same and its generally best to compare implementations written as if they were natively written in a given language, not something highly optimized, but something readable and "expected" for the language. I'd agree with the same regardless of the language (like even though java has arbitrary precision integers that's still not something a java program would reach for for a task such as this as it's just not the "java way"). |
I see, thank you for explaining.
That wasn't clear to me from the article at all. While it undoubtedly talks about developer productivity and the time it took for the studies' participants to prepare their submissions, the actual implementations being benchmarked, introduced in the Benchmarks section, are described as:
i.e. it says "port of Norvig’s solution to Rust" "to get a baseline for how fast the algorithm could go", not "as good of a Rust implementation as I managed to write in 3.5 hours, at which point I stopped in order to be fair to other languages". A bit later you even talk about spending time to improve the Rust version:
Of course, it's your post, so you know better; it's just that my understanding was incorrect; and thank you again for explaining it. But that also shows that it's easy to mistake your post for a usual post that compares the performance of idiomatic, reasonably optimized versions of the same program in different languages. So perhaps it wouldn't hurt to add a note along these lines: "Note: This Rust implementation is not as fast as it could be: with some small changes it can be made 3 to 6 times faster. But this article's intention is not to see how fast a language could be, but how fast it would be when using a certain algorithm and written in a certain amount of time." |
I am very curious at looking at the discussion of people that looked at that code as I'm curious how such obvious things were missed, I hopped in the rust discord and searched for "norvig" (last result 2 years ago), "lisp" (waaaay too many results, as well as a few variations of lisp), "bigint" (found a lot of people complaining about num-bigint being slow and that they should use a gmp based library instead like rug, mostly sql related messages though), "phone" (whoa way too many results), "phone number" (lot of complaints about advertisers taking people's phone numbers and how bad it is for 2fa), "prechelt" (no results at all), and a few other things including looking at many messages of accounts of a similar name to the github name here (though none identical), but not finding anything related to anything like this at all? |
This discussion is actually great. Every language community that has discussed my blog post so far has had some similar arguments (which is expected): the code should be optimised, it shouldn't use BigInt, it's not fair to our language because blah blah... But this is the first time anyone has come up with significant improvements. As I mentioned before, removing BigInt is a problem for me, but I am still interested in learning what makes Rust programs fast or slow, and this was very illuminating for that. Here's the results I got after running the Rust code in this Pull Request VS a Rust is now much closer to what everyone would expect: the fastest language and the lowest memory consumption :) But I have to note that CL is still, amazingly, very close. I must say I still do believe that a student in the same circumstances as the other students in the study might write sloppy Rust code (yes, I admit my Rust code was sloppy!) and that The The
What may be obvious to you is not obvious to everyone. The only "obvious" thing, I think, was replacing
This is actually a great comment: it shows that Rust is not magically fast, and not expected to be by its proponents, and that it still requires you to make a tradeoff. It takes effort to make Rust fast. Which is something very important to consider when choosing a language in a professional setting where time is of the essence. I've already published the data in the spreadsheet (linked above) and I will add a note to the original article mentioning a small summary of this discussion as I think readers may indeed arrive at the wrong conclusions based on what the article currently says about Rust (notice that I am a Rust proponent myself and use it at work, although as only a third or fourth language, and a few hobby projects). I will post the update tomorrow as it's late now! Thanks everyone for taking the time to scrutinize my article. |
See https://github.com/philipc/prechelt-phone-number-encoding/blob/rust/src/rust/phone_encoder/src/main.rs if you want a version that still uses |
@philipc that's amazing. It's nearly as performant as the highly optimised code in this PR, but much closer to the approach of the original! Your program:
This PR code:
Very well done... I wish I had used that in the study :( |
Hey guys, here I am again probably going to get in trouble! I decided to lose one more day of my life on this problem, and wrote a whole new article about this: https://renato.athaydes.com/posts/how-to-write-slow-rust-code.html I also edited the original with a note right next to where the Rust chart is shown for the first time. The conclusions are NOT what anyone expected, including me, again?! Did I do something wrong? I hope not... anyway, let me know if you would like me to change anything, or give more credit to you guys or something... Thanks! |
Oh,that's true... If I applied that, I would need to do the same in Java as Java obviously also has faster hash-map implementations. I don't think that doing that would be reasonable of me! There's always going to be something else that coulde be modified in both versions (and let's not forget Common Lisp is still not optimised at all, not really fair to them).
That's fair enough, I am going to update the article with results when both Java and Rust buffer stdout (currently, neither seems to do as once I use buffering, both get a big speed up on the larger duration problems). I will keep you posted once results are in. |
I don't know, to me it looks like the whole thing was just optimized away... there seems to be no plausible way it was spending all the rest of that minute just trying to print (none of the flamegraphs I've seen shows anything like that). |
An acceptable alternative in my opinion, would to consume the output of the program on a separate thread, and let the second thread print the results using the buffer... you can then see if the first thread terminates much earlier than the printer thread does, in which case we can count only the time taken up until the first thread terminates (as we don't really care as much about how long it takes to push to stdout). |
Please note that:
Are you saying that having looked at the generated assembly and the flamechart, or just from the general idea that this much of a speed-up is unexpected?
See, even simply buffering all that output takes a lot of time (apparently most of the 1 minute run time), which was clearly visible on the flamechart when using I'm no JVM expert, but a quick web search turned up Blackhole from JMH as Java's counterpart to Rust's |
@bugaevc if the solution to the problem is not visible, it's useless to measure anything, so not printing, or even at least keeping all found solutions for later printing, is not done, there's no point measuring anything.
I don't care at all about the assembly as even that is impossible to profile in your head. I care about the results being found and used. Also, figuring out the solutions should take a substantial amount of time as the number of combinations becomes enormous with longer phone numbers... so yes, it's very unrealistic the problem is actually being solved in just 1 second. Can you try to make the buffering/printing run on a separate Thread or not? In Java it's trivial to do that, I did it, it appears that even for very large inputs, the actual printing takes up less than 1/3 of the time even without any buffering. That's substantial, but I wouldn't say that's making the benchmark measure the entirely wrong thing because it is absolutely part of the problem that you need to finish outputing all solutions. |
How so? It makes a ton of sense to me to measure the time it takes to compute the solutions, not output them — especially in case outputting the solutions far exceeds the time to compute them. In other words, what you're currently benchmarking seems to be output performance, not the actual Prechelt’s problem or Norvig's solution. The only issue with this is ensuring that the steps to compute the solutions are actually taken if we don't print them out, because the optimizers are so damn smart. That's what the Unless, of course, I'm entirely wrong, and the optimizer did optimize the whole thing away without me noticing. I don't rule out this possibility, but as of now there's way more evidence that it works as I expect it to, than otherwise.
You don't have to profile the assembly in your head; you simply have to verify that the calls to Making the results "used" is precisely what
There seems to be specifically 71 998 979 solutions in the Outputting 72 million values, on the other hand, can be quite slow.
I sure can, but as said, that would mostly measure the time it takes to buffer the results, not the time it takes to compute them. It doesn't matter much that there would be two threads, as it would still be the first thread that does the buffering. So I've done that, tried to run it, and... ran into a little problem: the first thread generated solutions far faster than the second thread is able to consume and print them, so the process keeps allocating memory and eventually gets OOM-killed. Therefore I was unable to run it to the end, but runs for at least 30 seconds for sure. The code, in case you want to try this yourselfuse std::collections::HashMap;
use std::env::args;
use std::fs::File;
use std::io::{self, BufRead};
use std::path::Path;
use std::fmt::{self, Display, Formatter};
use std::sync::mpsc::{Sender, channel};
type Dictionary = HashMap<Vec<u8>, Vec<String>, ahash::RandomState>;
#[derive(Debug, Copy, Clone)]
enum WordOrDigit<'a> {
Word(&'a str),
Digit(u8),
}
impl Display for WordOrDigit<'_> {
fn fmt(&self, formatter: &mut Formatter<'_>) -> fmt::Result {
match self {
WordOrDigit::Word(s) => s.fmt(formatter),
WordOrDigit::Digit(d) => d.fmt(formatter),
}
}
}
/// Port of Peter Norvig's Lisp solution to the Prechelt phone-encoding problem.
///
/// Even though this is intended as a port, it deviates quite a bit from it
/// due to the very different natures of Lisp and Rust.
fn main() -> io::Result<()> {
// drop itself from args
let mut args: Vec<_> = args().skip(1).collect();
let words_file: String = if !args.is_empty() { args.remove(0) } else { "tests/words.txt".into() };
let input_file: String = if !args.is_empty() { args.remove(0) } else { "tests/numbers.txt".into() };
let dict = load_dict(words_file)?;
let (sender, receiver) = channel();
std::thread::spawn(|| {
for value in receiver {
println!("{}", value);
}
});
for line in read_lines(input_file)? {
let num = line?;
let digits: Vec<u8> = num.chars()
.filter(char::is_ascii_digit)
.map(|ch| ch as u8)
.collect();
print_translations(&num, &digits, &mut Vec::new(), &dict, &sender);
}
Ok(())
}
fn print_translations<'a>(
num: &str,
digits: &[u8],
words: &mut Vec<WordOrDigit<'a>>,
dict: &'a Dictionary,
sender: &Sender<String>,
) {
if digits.is_empty() {
print_solution(num, &words, sender);
return;
}
let mut found_word = false;
for i in 0..digits.len() {
let (key, rest_of_digits) = digits.split_at(i + 1);
if let Some(found_words) = dict.get(key) {
for word in found_words {
found_word = true;
words.push(WordOrDigit::Word(word));
print_translations(num, rest_of_digits, words, dict, sender);
words.pop();
}
}
}
if found_word {
return;
}
let last_is_digit = match words.last() {
Some(WordOrDigit::Digit(_)) => true,
_ => false,
};
if !last_is_digit {
let digit = digits[0] - b'0';
words.push(WordOrDigit::Digit(digit));
print_translations(num, &digits[1..], words, dict, sender);
words.pop();
}
}
fn print_solution(num: &str, words: &[WordOrDigit<'_>], sender: &Sender<String>) {
sender.send(num.to_owned()).unwrap();
for word in words {
sender.send(word.to_string()).unwrap();
}
}
fn load_dict(words_file: String) -> io::Result<Dictionary> {
let mut dict: Dictionary = HashMap::with_capacity_and_hasher(
100,
ahash::RandomState::default()
);
for line in read_lines(words_file)? {
let word = line?;
let key = word_to_number(&word);
let words = dict.entry(key).or_default();
words.push(word);
}
Ok(dict)
}
// The output is wrapped in a Result to allow matching on errors
// Returns an Iterator to the Reader of the lines of the file.
fn read_lines<P>(filename: P) -> io::Result<io::Lines<io::BufReader<File>>>
where P: AsRef<Path>, {
let file = File::open(filename)?;
Ok(io::BufReader::new(file).lines())
}
fn word_to_number(word: &str) -> Vec<u8> {
word.chars()
.filter(char::is_ascii_alphabetic)
.map(char_to_digit)
.map(|d| d + b'0')
.collect()
}
fn char_to_digit(ch: char) -> u8 {
match ch.to_ascii_lowercase() {
'e' => 0,
'j' | 'n' | 'q' => 1,
'r' | 'w' | 'x' => 2,
'd' | 's' | 'y' => 3,
'f' | 't' => 4,
'a' | 'm' => 5,
'c' | 'i' | 'v' => 6,
'b' | 'k' | 'u' => 7,
'l' | 'o' | 'p' => 8,
'g' | 'h' | 'z' => 9,
_ => panic!("invalid input: not a digit: {}", ch)
}
} This could be solved by using a bounded buffer, but that again would massively slow down the first thread once it exhausts the buffer size and will have to wait for the second thread to consume more values. Here's a suggestion in case you don't believe in I've implemented that for Rust, here's the code: use std::collections::HashMap;
use std::env::args;
use std::fs::File;
use std::io::{self, BufRead};
use std::path::Path;
use std::fmt::{self, Display, Formatter};
use std::hash::{Hash, Hasher};
type Dictionary = HashMap<Vec<u8>, Vec<String>, ahash::RandomState>;
#[derive(Debug, Copy, Clone, Hash)]
enum WordOrDigit<'a> {
Word(&'a str),
Digit(u8),
}
impl Display for WordOrDigit<'_> {
fn fmt(&self, formatter: &mut Formatter<'_>) -> fmt::Result {
match self {
WordOrDigit::Word(s) => s.fmt(formatter),
WordOrDigit::Digit(d) => d.fmt(formatter),
}
}
}
/// Port of Peter Norvig's Lisp solution to the Prechelt phone-encoding problem.
///
/// Even though this is intended as a port, it deviates quite a bit from it
/// due to the very different natures of Lisp and Rust.
fn main() -> io::Result<()> {
// drop itself from args
let mut args: Vec<_> = args().skip(1).collect();
let words_file: String = if !args.is_empty() { args.remove(0) } else { "tests/words.txt".into() };
let input_file: String = if !args.is_empty() { args.remove(0) } else { "tests/numbers.txt".into() };
let dict = load_dict(words_file)?;
let mut hasher = ahash::AHasher::default();
for line in read_lines(input_file)? {
let num = line?;
let digits: Vec<u8> = num.chars()
.filter(char::is_ascii_digit)
.map(|ch| ch as u8)
.collect();
print_translations(&num, &digits, &mut Vec::new(), &dict, &mut hasher);
}
println!("{}", hasher.finish());
Ok(())
}
fn print_translations<'a>(
num: &str,
digits: &[u8],
words: &mut Vec<WordOrDigit<'a>>,
dict: &'a Dictionary,
hasher: &mut impl Hasher,
) {
if digits.is_empty() {
print_solution(num, &words, hasher);
return;
}
let mut found_word = false;
for i in 0..digits.len() {
let (key, rest_of_digits) = digits.split_at(i + 1);
if let Some(found_words) = dict.get(key) {
for word in found_words {
found_word = true;
words.push(WordOrDigit::Word(word));
print_translations(num, rest_of_digits, words, dict, hasher);
words.pop();
}
}
}
if found_word {
return;
}
let last_is_digit = match words.last() {
Some(WordOrDigit::Digit(_)) => true,
_ => false,
};
if !last_is_digit {
let digit = digits[0] - b'0';
words.push(WordOrDigit::Digit(digit));
print_translations(num, &digits[1..], words, dict, hasher);
words.pop();
}
}
fn print_solution(
num: &str,
words: &[WordOrDigit<'_>],
hasher: &mut impl Hasher
) {
num.hash(hasher);
for word in words {
word.hash(hasher);
}
}
fn load_dict(words_file: String) -> io::Result<Dictionary> {
let mut dict: Dictionary = HashMap::with_capacity_and_hasher(
100,
ahash::RandomState::default()
);
for line in read_lines(words_file)? {
let word = line?;
let key = word_to_number(&word);
let words = dict.entry(key).or_default();
words.push(word);
}
Ok(dict)
}
// The output is wrapped in a Result to allow matching on errors
// Returns an Iterator to the Reader of the lines of the file.
fn read_lines<P>(filename: P) -> io::Result<io::Lines<io::BufReader<File>>>
where P: AsRef<Path>, {
let file = File::open(filename)?;
Ok(io::BufReader::new(file).lines())
}
fn word_to_number(word: &str) -> Vec<u8> {
word.chars()
.filter(char::is_ascii_alphabetic)
.map(char_to_digit)
.map(|d| d + b'0')
.collect()
}
fn char_to_digit(ch: char) -> u8 {
match ch.to_ascii_lowercase() {
'e' => 0,
'j' | 'n' | 'q' => 1,
'r' | 'w' | 'x' => 2,
'd' | 's' | 'y' => 3,
'f' | 't' => 4,
'a' | 'm' => 5,
'c' | 'i' | 'v' => 6,
'b' | 'k' | 'u' => 7,
'l' | 'o' | 'p' => 8,
'g' | 'h' | 'z' => 9,
_ => panic!("invalid input: not a digit: {}", ch)
}
} This runs in 5.7s, which is 6× slower than 1s (clearly, computing additional hashes takes time), but still 10× faster than the original 1 minute (which also confirms that the version with |
You have good points but I think they are mostly of limited interest to the readers of my blog post. I think what you're suggesting is that Java may be much slower to solve the problems, but so much faster to print them that it still wins. I don't see how that matters at all if that's the case. The user of the program wants to use the results of the program, they don't care where the time is being spent. Feel free to leave a criticism of my blog post here if you still feel like I was unfair on my analysis. I don't think I was. The problem was: find and print all solutions. Now you're solving a different problem entirely which I think doesn't make sense. |
@bugaevc I agree that it does not make much sense to use black box, because you still need to output your results somewhere. You could try to push your results to a string (or a |
That's not at all what I'm saying. In fact, I don't think I have made any claims in this thread on relative performance of the different languages. All I've been doing is benchmarking different variants of the Rust implementation against each other, and suggesting that some changes may or may not make sense to do in the Java version too, based on what I have observed about the Rust version. My main finding has been that in the I might be biased by my background in competitive programming, but it's a very well known fact that outputting large amounts of data is slow. Problems that generate a huge number of data ("Find all possible xxx that...") typically require the solutions to only output the number of items found and not the items themselves; this is done specifically to ensure that the solution has to find all the valid items, but does not have to print them. If I was to publish a benchmark that was meant to compare the performance of a few solutions to an algorithmic problem but in fact turned out to mostly measure the time it takes to output the answers, I'd be greatly embarrassed, and would rush to publish a rebuttal & fix the benchmark. That is not to suggest that it never makes sense to benchmark output performance/throughput of some system; such benchmarks can also be useful for their own purposes. It's just that it sounds very wrong to me to claim that you're benchmarking some solution to some problem when in fact what you're measuring is how long it takes to output 2.8 Gb of data. But of course, it's your blog, you make the rules. If you say that
then OK, let's see how fast we can output the 2.8 Gb of data. Increasing the buffer size to a megabyte seems to help a lot, bringing down the run time to ≈29s. (Increasing the buffer size further doesn't seem to improve the run time further — meaning the time is already dominated by buffering and not the actual diff relative to 7f1e5eediff --git a/src/rust/phone_encoder/src/main.rs b/src/rust/phone_encoder/src/main.rs
index 9c755ea..f5b184f 100644
--- a/src/rust/phone_encoder/src/main.rs
+++ b/src/rust/phone_encoder/src/main.rs
@@ -1,7 +1,7 @@
use std::collections::HashMap;
use std::env::args;
use std::fs::File;
-use std::io::{self, BufRead};
+use std::io::{self, Write, BufRead, BufWriter, StdoutLock};
use std::path::Path;
use std::fmt::{self, Display, Formatter};
@@ -35,13 +35,20 @@ fn main() -> io::Result<()> {
let dict = load_dict(words_file)?;
+
+ let stdout = io::stdout();
+ let mut writer = BufWriter::with_capacity(
+ 1024 * 1024,
+ stdout.lock(),
+ );
+
for line in read_lines(input_file)? {
let num = line?;
let digits: Vec<u8> = num.chars()
.filter(char::is_ascii_digit)
.map(|ch| ch as u8)
.collect();
- print_translations(&num, &digits, &mut Vec::new(), &dict);
+ print_translations(&num, &digits, &mut Vec::new(), &dict, &mut writer);
}
Ok(())
}
@@ -51,9 +58,10 @@ fn print_translations<'a>(
digits: &[u8],
words: &mut Vec<WordOrDigit<'a>>,
dict: &'a Dictionary,
+ writer: &mut BufWriter<StdoutLock>,
) {
if digits.is_empty() {
- print_solution(num, &words);
+ print_solution(num, &words, writer);
return;
}
let mut found_word = false;
@@ -63,7 +71,7 @@ fn print_translations<'a>(
for word in found_words {
found_word = true;
words.push(WordOrDigit::Word(word));
- print_translations(num, rest_of_digits, words, dict);
+ print_translations(num, rest_of_digits, words, dict, writer);
words.pop();
}
}
@@ -78,25 +86,25 @@ fn print_translations<'a>(
if !last_is_digit {
let digit = digits[0] - b'0';
words.push(WordOrDigit::Digit(digit));
- print_translations(num, &digits[1..], words, dict);
+ print_translations(num, &digits[1..], words, dict, writer);
words.pop();
}
}
-fn print_solution(num: &str, words: &[WordOrDigit<'_>]) {
+fn print_solution(num: &str, words: &[WordOrDigit<'_>], writer: &mut BufWriter<StdoutLock>) {
// do a little gymnastics here to avoid allocating a big string just for printing it
- print!("{}", num);
+ write!(writer, "{}", num).unwrap();
if words.is_empty() {
- println!(":");
+ writeln!(writer, ":").unwrap();
return;
}
- print!(": ");
+ write!(writer, ": ").unwrap();
let (head, tail) = words.split_at(words.len() - 1);
for word in head {
- print!("{} ", word);
+ write!(writer, "{} ", word).unwrap();
}
for word in tail { // only last word in tail
- println!("{}", word);
+ writeln!(writer, "{}", word).unwrap();
}
}
That is exactly what I mean by buffering (except you don't have to hold off printing all the way till the very end, doing it a few times in the middle is fine), and as you can see this is also very slow compared to actually finding the solutions. If we are indeed interested in the combined time it takes to find the solutions and output them (which would be reasonable if the output part wasn't this disproportionally slow), then yes, this seems to be our best shot. |
Ooooo nice! Java's BigInt is also pretty simple and not at all nor designed to be fast, just accurate for business use but as this was a Rust thread I didn't bother mentioning a few things about Java that I noticed. Unlike Rust, Java doesn't have a good binding with gmp or other such faster bigint libaries because the JNI just has so much overhead and I've never seen a native-java library that focuses on gmp-like performance.
I like to read such conversations because it shows me how the more average people think and reason through things, it's a huge learning opportunity for me. I've very biased from working in C++ for 30+ years so I often can't really see things from a more new person perspective, hence why I like reading articles like yours and discussions like that. :-)
Yeah I had that fear that using it as a dynamic library (which rug does by default) would prevent a lot of inline optimizations that would be a huge help. I've never used ibig before, only found out about it because of this thread, may need to experiment sometime. ^.^ I really like that it got rid of the internal clone, I was thinking that would be one of the biggest speed hits of it and this does seem to corrolate considering its being created via
Oh hey, I quite enjoy reading your articles, awesome!
Java indeed does not have a DoS resistent hasher, Rust is one of the very very few languages I've ever run across that defaults to a DoS resistent hasher, but it's designed to be extremely simple to swap out the hasher as they know that, unlike most languages, Rust programmers often like to optimize for speed, but they want the default to be 'safe' unlike other languages hashers. ^.^
Java doesn't have a good black_box equivilant function though, I've tried very hard and found nothing that is reliable. Even some supposed JVM-specific calls have failed and been optimized out...
Java does have faster hashmaps yes, but it's default hashmap is not DoS safe, so it should in general be faster than DoS resistent hashers regardless. But yeah, using the same hashing algorithm wouldn't be a bad idea, right now Rust and Java (and CL) all use different algorithms.
I'm curious about that as well, but the black_box usage does look correct, and if using nightly rust the black_box should work unless there's been a (very unlikely) regression in LLVM... Hmm, let me run that above code (hyperfine again as before, the pr1 version is as before): Summary
'./phone_encoder dictionary.txt phones_1000.txt' ran
1.32 ± 0.34 times faster than './phone_encoder_pr1 dictionary.txt phones_1000.txt'
1.66 ± 0.38 times faster than 'sbcl --script src/lisp/main.lisp dictionary.txt phones_1000.txt'
6.66 ± 1.25 times faster than 'java -cp build/java Main dictionary.txt phones_1000.txt'
7.77 ± 1.49 times faster than 'java -cp build/java Main2 dictionary.txt phones_1000.txt'
Summary
'./phone_encoder dictionary.txt phones_10_000.txt' ran
1.29 ± 0.26 times faster than './phone_encoder_pr1 dictionary.txt phones_10_000.txt'
2.21 ± 0.39 times faster than 'sbcl --script src/lisp/main.lisp dictionary.txt phones_10_000.txt'
5.68 ± 0.99 times faster than 'java -cp build/java Main dictionary.txt phones_10_000.txt'
7.34 ± 1.22 times faster than 'java -cp build/java Main2 dictionary.txt phones_10_000.txt'
Summary
'./phone_encoder dictionary.txt phones_10_000.txt' ran
1.29 ± 0.26 times faster than './phone_encoder_pr1 dictionary.txt phones_10_000.txt'
2.21 ± 0.39 times faster than 'sbcl --script src/lisp/main.lisp dictionary.txt phones_10_000.txt'
5.68 ± 0.99 times faster than 'java -cp build/java Main dictionary.txt phones_10_000.txt'
7.34 ± 1.22 times faster than 'java -cp build/java Main2 dictionary.txt phones_10_000.txt'
Summary
'./phone_encoder dictionary.txt phones_100_000_with_empty.txt' ran
1.91 ± 0.11 times faster than './phone_encoder_pr1 dictionary.txt phones_100_000_with_empty.txt'
3.69 ± 0.28 times faster than 'java -cp build/java Main dictionary.txt phones_100_000_with_empty.txt'
4.50 ± 0.22 times faster than 'sbcl --script src/lisp/main.lisp dictionary.txt phones_100_000_with_empty.txt'
8.47 ± 0.43 times faster than 'java -cp build/java Main2 dictionary.txt phones_100_000_with_empty.txt' I'm not sure I'm seeing the minute sized difference, here it is with 300k entries: > time ./phone_encoder_pr1 dictionary.txt <(cat ./phones_50000.txt ./phones_50000.txt ./phones_50000.txt ./phones_50000.txt ./phones_50000.txt ./phones_50000.txt) >/dev/null
./phone_encoder_pr1 dictionary.txt > /dev/null 1.27s user 0.03s system 99% cpu 1.299 total
> time ./phone_encoder dictionary.txt <(cat ./phones_50000.txt ./phones_50000.txt ./phones_50000.txt ./phones_50000.txt ./phones_50000.txt ./phones_50000.txt) >/dev/null
./phone_encoder dictionary.txt > /dev/null 0.65s user 0.00s system 99% cpu 0.657 total It's faster than the PR1 as given but not magnitudes slower, so it seems like black_box is working correctly? Looking at the generated ASM (via cargo-asm) they also look expectedly similar except for the actual missing calls to the output stream.
Exactly yeah.
Not reliable, it's one I've tried and I've had it optimized out in many cases... :-( It is the 'most' reliable I've found though since it seems to work sometimes and I haven't found a better one yet short of just outputting data to a JNI call (which has its own overhead, but is doable if you want an actual reliable one). Although... you could just pass the entire calculated data to a single JNI call to black hole the entire thing? It won't be serialized or anything, it's direct memory access over the JNI call.
I do agree, printing should still be done, just need to be sure in every language to lock stdout and buffer the output to try to reduce its overhead as much as possible. I'm not sure I'd start touching threads in programs, the way they synchronize can be extremely different (and honestly Rust would have an unnatural advantage here I would think...). I do see the advantage to only benchmarking the data calculations, but that starts getting into interesting things that are hard to resolve between native and JIT languages. Like if it were all native then you could make dynamic libaries between all and have a program benchmark across them, but instead with java and such you need to do something like trust that they benchmarking themselves accurately and report the results correctly all without optimizing huge chunks out. This could be minimized perhaps by, say, using jni in a rust benchmarker or so to dynamically link in native libaries and use JNI to call across to the java version and so forth and that overhead would be barely measurable (and you can indeed measure it if you so wish), that would prevent things from being optimized out at that point and allow it to be pretty easily reproducible across them all, but that's getting into some more work (although less than one might initially think). But yeah, in general benchmarking is really really really really hard to get right, that's why I dive into the assembly when I benchmark things so often (if anyone recognizes my benchmarking things randomly around the 'net), that way to compare the actual generated machine code, but you have to make sure you are testing what you actually want to test (generated code with identical algorithms? generated code from 'traditionally programmed' code for the given language? Etc...). |
The specific benchmark I have been talking about in the few last posts is
The others output a lot less data, which is why the effect is less pronounced and swapping the output for black box doesn't improve timing as much. |
As an aside, I wonder how the rust code would look when written in a more, at least what I consider, a traditionally rust style, with iterators and so forth rather than return variables, hmmm... I can't really see much if any performance advantage except perhaps in one place, but it would definitely read more clean (in my opinion anyway, but I'm a fan of functional styles)...
Even still it doesn't seem to be that huge, though let me duplicate it to 50 million: > rm -f ./phones_50m.txt; for (( i=0; i<1000; ++i)); do cat ./phones_50000.txt >> ./phones_50m.txt; done
> wc phones_50m.txt # It is indeed 50 million lines, though each line has 1000 duplicates each split by 50k lines
50000000 50000000 1328570000 phones_50m.txt And run those again... > time ./phone_encoder dictionary.txt phones_50m.txt > /dev/null
./phone_encoder dictionary.txt phones_50m.txt > /dev/null 96.31s user 0.28s system 99% cpu 1:36.59 total
? time ./phone_encoder_pr1 dictionary.txt phones_50m.txt > /dev/null
./phone_encoder_pr1 dictionary.txt phones_50m.txt > /dev/null 199.11s user 2.38s system 99% cpu 3:21.50 total So yeah, that took a very long time to run but the black box seems to be working fine, no multiple magnitudes of difference, was twice as fast as the base PR1 version. :-) |
You seem to have missed the |
The only > wc dictionary.txt
73113 73113 938582 dictionary.txt
> wc tests/words.txt
22 23 103 tests/words.txt
> find . -name words.txt
./tests/words.txt Though running it with that file instead of dictionary results in: > time ./phone_encoder tests/words.txt phones_50m.txt > /dev/null
./phone_encoder tests/words.txt phones_50m.txt > /dev/null 36.53s user 0.16s system 99% cpu 36.688 total
> time ./phone_encoder_pr1 tests/words.txt phones_50m.txt > /dev/null
./phone_encoder_pr1 tests/words.txt phones_50m.txt > /dev/null 72.81s user 0.50s system 99% cpu 1:13.32 total So it is still the same roughly double speed of pr1. Or did you generate a new |
prechelt-phone-number-encoding/benchmark.sh Lines 11 to 14 in 7f1e5ee
prechelt-phone-number-encoding/download_english_words.sh Lines 1 to 6 in 7f1e5ee
|
I don't have that in Ah yep, very recently. Cool, grabbed it and other updates: > curl https://raw.githubusercontent.com/dwyl/english-words/master/words.txt | tail -n 100000 > words.txt And I can't get anything to run in a reasonable time (waited at least 10 minutes) on any of the programs... > head -n 50000 words.txt > words-50k.txt And thus with 50k lines of words: > time sbcl --script src/lisp/main.lisp words-50k.txt phones_2000.txt >/dev/null
sbcl --script src/lisp/main.lisp words-50k.txt phones_2000.txt > /dev/null 26.94s user 0.12s system 99% cpu 27.056 total
> time ./phone_encoder_pr1 words-50k.txt phones_2000.txt >/dev/null
./phone_encoder_pr1 words-50k.txt phones_2000.txt > /dev/null 12.79s user 0.12s system 99% cpu 12.903 tota
> time ./phone_encoder words-50k.txt phones_2000.txt >/dev/null
./phone_encoder words-50k.txt phones_2000.txt > /dev/null 4.71s user 0.01s system 99% cpu 4.727 total
> time java -cp build/java Main words-50k.txt phones_2000.txt >/dev/null
java -cp build/java Main words-50k.txt phones_2000.txt > /dev/null 5.51s user 0.82s system 144% cpu 4.393 total
> time java -cp build/java Main2 words-50k.txt phones_2000.txt >/dev/null
java -cp build/java Main2 words-50k.txt phones_2000.txt > /dev/null 34.23s user 1.03s system 107% cpu 32.880 total And these still seem reasonable. I'm guessing some really bad hashing is happening in all languages or something as all of their running times start going exponentially up with a bigger words.txt file (adding another 10k lines to 60k increases the running time by ~10 times, and adding another 10k I wasn't able to complete anything in less than 10 minutes). Nice to see java's |
I wonder if it would run in a reasonable time for you with the black box. |
The EDIT: The PR1 one is this PR's code. |
Also, just for completion, my rust version is: |
So I had some time so I took a look at 'what' is so slow. Well [src/main.rs:43] ("NewSlow", took, num) = (
"NewSlow",
3.122004442s,
"364943-7643536938/--33734-89290451119/2875-949180",
) So yeah, that's incredibly slow for just a single thing lookup, but considering it's exponential with lots of loops for every single character it seems expected. The order of the slowest ones relate directly to the length of the phones_2000.txt line, so that's expected as well. I notice the new benchmark script generates the random 2000 numbers via |
A lot of the slowness comes from using After that, It'd be interesting to see the difference between an iterative vs a recursive approach. |
Great find!
Does the black box cut down the time significantly on your machine? |
DIdn't test with the black_box yet. |
Oh, for clarity I was testing on the black_box'd version, so there was no string formatting cost or flushing cost. The formatting and flushing adds a linear cost, but that's not the exponential cost by far.
On mine it did a little bit, that's basically the difference between the |
As I understand it, you weren't running the full |
Actually I ended up making a PFH version that used the full 100k words.txt version, but I was benchmarking and profiling with the 50k version since it would actually run to completion in less than 10 minutes. I'm still unsure why it's taking so long on this system compared to your timings above, this is not a low end server by any stretch, an AMD Ryzen 7 1700 8/16 with 32gigs of DDR4 3000 ram. The file is running off the HDD and not the SSD but I timed the I/O and it was barely registerable compared to the rest so it didn't seem related. |
For note, the only part about it that takes so long is the |
I finally got the time to write an extensive blog post about this: https://renato.athaydes.com/posts/how-to-write-slow-rust-code-part-2.html Let me know what you think! Summary: yes, you were correct: Rust runs twice as fast as Java when printouts do not become a bottleneck. Java only wins when printouts are the bottleneck, which is something less interesting than actually finding solutions. |
Nice, thanks for the update!
I, for one, think that it's cool that you've included the benchmarks with a trick to reduce output, like I suggested, although you have chosen a different way to do it than what I proposed. Then, you've said:
You're presenting this as a problem that no solution has been suggested for. I think it would be appropriate to mention the black box here, even if you're not convinced it works as I claim it does. For instance, something like the following would make sense to me: "@bugaevc has suggested that Rust's Finally, I hope the "rant" at the end was not directed at me. I never said or implied such things about you 🙂 |
@bugaevc sorry I didn't mention your suggestion :) but I did mention there were many "other ways" to do it and I just did something that I was confident myself that would work well.
No, not you. My post generated a lot of heated conversations, much worse than this one :D on HackerNews specially, of all places, people just didn't bother reading most of what I had to say and just trashed me and my code. |
Hello!
I've done some optimizations on the Rust implementation.
On a data set consisting of 500 000 phone numbers, the running time went from 7.167 s ± 0.239 s down to 2.237 s ± 0.033 s (as measured with Hyperfine — I didn't bother to port
benchmark_runner
to work on non-macOS).The largest speed-up is from dropping
BigUint
s, which are slow. Please see individual commits for details.I also tried to make the code a little more idiomatic Rust, but there's still a long way to go.