bytes for loop performance #1663

awelzel · 2024-01-25T14:10:37Z

In the context zeek/zeek#3580 it was observed that a for loop over a bytes object to xor decode the content takes ~30% of the total runtime in that setting. This ticket is meant as a generic "speed up bytes iteration".

The test is iterating 500k times over a 134 bytes instance in %init. The SPICY_THREADED=1 version creates a short lived thread in spicy-driver's main to avoid glibc using an optimized non-atomic shared_ptrs (#1600) This reflects better the performance within an application like Zeek and makes performance significantly worse. Python is added as baseline which executes significantly faster.

No sure how relevant this is, for looping over large bytes might not be a common use case.

Command	Mean [s]	Min [s]	Max [s]	Relative
`echo "x" \| ../build/bin/spicy-driver loop.hlto`	4.922 ± 0.022	4.903	4.945	1.76 ± 0.01
`echo "x" \| SPICY_THREADED=1 ../build/bin/spicy-driver loop.hlto`	7.147 ± 0.015	7.132	7.161	2.56 ± 0.01
`python3 loop.py`	2.791 ± 0.007	2.783	2.797	1.00

The non-dwarf flamegraph (suspect std::shared_ptr above the flat bars like in zeek/zeek#3580).

#loop.spicy
module Loop;

global xs: bytes = b"AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA";

public type X = unit {

 on %init {
  local i = 0;
  local sum  = 0;
  while ( i < 500000 ) {
    for ( c in xs )
      sum += c;
    i++;
  }
  print "sum", sum;
 }
};

# loop.py
xs = b"AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"


def main():
    i = 0
    s = 0
    while i < 500000:
        for c in xs:
            s += c
        i += 1

    print("sum", s)


if __name__ == "__main__":
    main()

The text was updated successfully, but these errors were encountered:

bbannier · 2024-01-25T15:13:46Z

I think the more general issue is that iterations over collections of known size can be expensive since we always emit checks at runtime; in general this makes sense since even with a range for loop the underlying could be changed under the covers (e.g., image if every second iteration would empty the collection).

To be faster we would need a way to detect that the collection itself is not modified. This is non-trivial to do automatically since we'd have to make sure that all called code does not modify the collection (e.g., change element order, or remove them); this could even happen in e.g., called functions (which do not even need to be implemented in Spicy, so we might be unable to inspect their bodies). I think we talked in the past about collection methods like transform or for_each which would either compute a new collection or modify collection elements in place, but the function (or since Spicy currently doesn't have them: hypothetical closure) argument could still modify the collection itself so I fear that approach would still not allow faster iteration. It looks like the only way to accomplish this is very global data flow analysis.

If such iteration is a bottleneck for you right now you could implement a custom C++ function which uses unchecked iteration after making sure you can uphold the invariants required for that.

rsmmr added Priority Low Optimizer labels Jun 7, 2024

awelzel mentioned this issue Oct 28, 2024

extractBytesLiteral overhead due to Bytes parameter (and shared_ptr usage) #1910

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bytes for loop performance #1663

bytes for loop performance #1663

awelzel commented Jan 25, 2024 •

edited

Loading

bbannier commented Jan 25, 2024

bytes for loop performance #1663

bytes for loop performance #1663

Comments

awelzel commented Jan 25, 2024 • edited Loading

bbannier commented Jan 25, 2024

awelzel commented Jan 25, 2024 •

edited

Loading