Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extractBytesLiteral overhead due to Bytes parameter (and shared_ptr usage) #1910

Closed
awelzel opened this issue Oct 28, 2024 · 3 comments
Closed
Assignees

Comments

@awelzel
Copy link
Contributor

awelzel commented Oct 28, 2024

As discussed on #1909, here's a microbenchmark showing that parsing a bytes literal becomes significantly slower when the literal's size exceeds the threshold for short-string-optimization (15 with clang/gcc on my system). Essentially, there's a std::string initialization in a potentially hot path (any unit containing a bytes literal hits this path). Eve with SSO in effect for small literals, there's overhead due to copying the characters into the Bytes (std::string).

It seems a call to extractBytesLiteral should either pass a string_view for the literal to avoid the initialization of the Bytes instance, or take a reference to a statically allocated Bytes instance to avoid this unneeded overhead.

Summary of the examples: test-20.spicy should parse at least as fast as test-10-10.spicy, not 1.13x slower. Note also they all use skip, so there's even less reason for the difference.

Starting with the following module, test-10 4*10 bytes.

# test-10-10.spicy
module literals;

type Message = unit {
    lit1: skip b"0000000000";
    rem1: skip b"0000000000";
    lit2: skip b"0000000000";
    rem2: skip b"0000000000";
};

public type Messages = unit {
    : Message[];
};

Moving to the following two variants, parsing 2 * (16 + 4) and 2*20 bytes

# test-16-4.spicy
module literals;

type Message = unit {
    lit1: skip b"0000000000000000";
    rem1: skip b"0000";
    lit2: skip b"0000000000000000";
    rem2: skip b"0000";
};

public type Messages = unit {
    : Message[];
};

And further

# test-20.spicy
module literals;

type Message = unit {
    lit1: skip b"00000000000000000000";
    lit2: skip b"00000000000000000000";
};

public type Messages = unit {
    : Message[];
};
$ ./run.sh 
+ spicyc -j test-10-10.spicy -o test-10-10.hlto
+ spicyc -j test-16-4.spicy -o test-16-4.hlto
+ spicyc -j test-20.spicy -o test-20.hlto
+ python3 -c 'print("0000000000" * 4096 * 512, end="")'
+ hyperfine -r 5 -L hlto test-10-10.hlto,test-16-4.hlto,test-20.hlto 'spicy-driver -f test.data {hlto}'
Benchmark 1: spicy-driver -f test.data test-10-10.hlto
  Time (mean ± σ):      3.209 s ±  0.022 s    [User: 3.177 s, System: 0.017 s]
  Range (min … max):    3.179 s …  3.229 s    5 runs
 
Benchmark 2: spicy-driver -f test.data test-16-4.hlto
  Time (mean ± σ):      3.602 s ±  0.005 s    [User: 3.564 s, System: 0.020 s]
  Range (min … max):    3.594 s …  3.606 s    5 runs
 
Benchmark 3: spicy-driver -f test.data test-20.hlto
  Time (mean ± σ):      3.637 s ±  0.027 s    [User: 3.608 s, System: 0.015 s]
  Range (min … max):    3.621 s …  3.685 s    5 runs
 
Summary
  spicy-driver -f test.data test-10-10.hlto ran
    1.12 ± 0.01 times faster than spicy-driver -f test.data test-16-4.hlto
    1.13 ± 0.01 times faster than spicy-driver -f test.data test-20.hlto

Runner:

#!/bin/bash
set -eux

spicyc -j test-10-10.spicy -o test-10-10.hlto
spicyc -j test-16-4.spicy -o test-16-4.hlto
spicyc -j test-20.spicy -o test-20.hlto
python3 -c 'print("0000000000" * 4096 * 512, end="")' > test.data

hyperfine -r 5 -L hlto test-10-10.hlto,test-16-4.hlto,test-20.hlto 'spicy-driver -f test.data {hlto}'
@awelzel
Copy link
Contributor Author

awelzel commented Oct 28, 2024

Uh.. @rsmmr - let me just piggy back. Taking a flamegraph of 10-10.spicy, the View::startsWith() implementation and Bytes::operator!= is full with shared_ptr overhead. The bigger bang for the buck will likely come from not instantiating shared ptrs (which I believe happens for every byte being compared - similar to #1663 )

10-10

@awelzel awelzel changed the title extractBytesLiteral overhead due to Bytes parameter extractBytesLiteral overhead due to Bytes parameter (and shared_ptr usage) Oct 28, 2024
@awelzel
Copy link
Contributor Author

awelzel commented Oct 28, 2024

Uhm, the following patch improves performance of this particularly reproducer ~3-4x. Doesn't seem like there's anything wrong with it, is there? The issue with the extra Bytes allocation continuous to exists, but this should fix some of the obvious shared_ptr overhead.

--- a/hilti/runtime/src/types/stream.cc
+++ b/hilti/runtime/src/types/stream.cc
@@ -413,8 +413,8 @@ bool View::startsWith(const Bytes& b) const {
     _ensureValid();
     auto s1 = unsafeBegin();
     auto e1 = unsafeEnd();
-    auto s2 = b.begin();
-    auto e2 = b.end();
+    auto s2 = b.str().begin();
+    auto e2 = b.str().end();
 
     // If the iterator has no data in it, we cannot dereference it.
     if ( isEmpty() )
Benchmark 1: spicy-driver -f test.data test-10-10.hlto
  Time (mean ± σ):     843.7 ms ±  10.6 ms    [User: 830.4 ms, System: 11.8 ms]
  Range (min … max):   831.6 ms … 858.5 ms    5 runs
 
Benchmark 2: spicy-driver -f test.data test-16-4.hlto
  Time (mean ± σ):      1.036 s ±  0.005 s    [User: 1.024 s, System: 0.011 s]
  Range (min … max):    1.031 s …  1.045 s    5 runs
 
Benchmark 3: spicy-driver -f test.data test-20.hlto
  Time (mean ± σ):      1.029 s ±  0.006 s    [User: 1.016 s, System: 0.011 s]
  Range (min … max):    1.024 s …  1.038 s    5 runs
 
Summary
  spicy-driver -f test.data test-10-10.hlto ran
    1.22 ± 0.02 times faster than spicy-driver -f test.data test-20.hlto
    1.23 ± 0.02 times faster than spicy-driver -f test.data test-16-4.hlto

@rsmmr
Copy link
Member

rsmmr commented Oct 29, 2024

-    auto s2 = b.begin();
-    auto e2 = b.end();
+    auto s2 = b.str().begin();
+    auto e2 = b.str().end();

Oh, good catch! I'll play a bit that, seems to make quite a difference indeed.

@rsmmr rsmmr self-assigned this Oct 29, 2024
rsmmr added a commit that referenced this issue Oct 29, 2024
We now create the bytes instance representing the literal as a global
singleton to avoid instantiating it over and over again.

Closes #1910.
rsmmr added a commit that referenced this issue Oct 30, 2024
We now create the bytes instance representing the literal as a global
singleton to avoid instantiating it over and over again.

Closes #1910.
rsmmr added a commit that referenced this issue Oct 31, 2024
We now create the bytes instance representing the literal as a global
singleton to avoid instantiating it over and over again.

Closes #1910.
@rsmmr rsmmr closed this as completed in d8f8b41 Nov 6, 2024
rsmmr added a commit that referenced this issue Nov 6, 2024
* origin/topic/robin/gh-1910-parse-literal:
  Replace our poor hash function with `std::hash()`.
  Add infrastructure to create and cache global constants.
  Polish code a bit for readability.
  Avoid redundant computation during literal parsing.
  Optimize parsing of literal bytes.
  Introduce `UnsafeConstIterator` for bytes instances.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants