Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[preview] String parsing improvements #314

Draft
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

jrudolph
Copy link
Member

@jrudolph jrudolph commented Jul 29, 2019

String parsing dominates parsing times so far for real world input (that isn't particularly number heavy). So, here's a first attempt to improve performance.

The speed-ups mostly come from

  • no explicit bounds checking
  • removing tricks
  • some code reorganization

Before:

GithubIssuesJsonAST.readCirce                github-akka-issues.json 2465.716 ± 191.574  ops/s
GithubIssuesJsonAST.readCirce                  big-string-array.json 3375.916 ± 317.283  ops/s
GithubIssuesJsonAST.readPlayJson             github-akka-issues.json 2002.679 ± 159.931  ops/s
GithubIssuesJsonAST.readPlayJson               big-string-array.json 2790.919 ±  59.378  ops/s

GithubIssuesJsonAST.readSprayJsonFromBytes   github-akka-issues.json 1048.052 ±  12.170  ops/s
GithubIssuesJsonAST.readSprayJsonFromBytes     big-string-array.json 1353.650 ±  27.423  ops/s
GithubIssuesJsonAST.readSprayJsonFromString  github-akka-issues.json 1202.670 ±  28.346  ops/s
GithubIssuesJsonAST.readSprayJsonFromString    big-string-array.json 1507.546 ±  19.228  ops/s

GithubIssuesCaseClassReading.readCircePartial           2074.334 ± 463.204  ops/s
GithubIssuesCaseClassReading.readJacksonPartial         3393.928 ± 166.386  ops/s
GithubIssuesCaseClassReading.readPlayJsonPartial        1883.332 ± 182.801  ops/s

GithubIssuesCaseClassReading.readSprayJsonViaASTPartial 1141.810 ± 3.432  ops/s

After:

GithubIssuesJsonAST.readSprayJsonFromBytes   github-akka-issues.json 2321.173 ±  74.728  ops/s
GithubIssuesJsonAST.readSprayJsonFromBytes     big-string-array.json 4250.720 ± 187.668  ops/s
GithubIssuesJsonAST.readSprayJsonFromString  github-akka-issues.json 2181.839 ±  60.334  ops/s
GithubIssuesJsonAST.readSprayJsonFromString    big-string-array.json 3447.510 ±  27.173  ops/s

GithubIssuesCaseClassReading.readSprayJsonViaASTPartial  2142.250 ± 99.059  ops/s

I.e. about 2x-3x faster for byte input. So, string parsing has been improved by ~ 3x which leads to a 2x improvement for the test case (88k json from github API).

As ParserInput needs to be changed, the change won't be simply binary compatible. I'll probably deprecate the current parser infrastructure and duplicate it by simpler entry-points that don't leak so many implementation details for common cases.

import io.circe.generic.semiauto._
import io.circe.parser._
implicit def circeUserFormat: Decoder[User] = deriveDecoder
implicit def circeLabelFormat: Decoder[LabelsArrayElement] = deriveDecoder

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrudolph please consider to cache derived decoders using val for better performance

input.setCursor(cursor) // now at end '"'
advance()
new String(charBuffer, 0, charsRead)
} else if (((b - 32) ^ 60) <= 0) // trick from jsoniter-scala

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is a check for escaping character and for characters that are not allowed to be represented as is without encoding: b < ' ' || b = '\'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I know, pretty clever :) I printed out the complete mapping between inputs and outputs to see where it was going. It seems it didn't help much here in the end.

} else if (b == '\\') {
val skip = escaped(cursor, charsRead)
parseOneSlow(cursor + skip, charsRead + 1)
} else if (input.needsDecoding && b >= 128) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing check for control characters that must be escaped

@sirthias
Copy link
Member

Awesome stuff, @jrudolph!
These are great improvements!!
The old "tricks" really don't pull their weight, especially on the newer JVMs.

@sirthias
Copy link
Member

One thing that might be worthwhile exploring to speed up string parsing even more is unrolling the parseOneFast loop to handle 4 characters in a single loop iteration.
I could imagine that we getter instruction level parallelism this way.

The beginning of this blog post here: https://chadaustin.me/2017/05/writing-a-really-really-fast-json-parser/ discusses this effect for JSON parsing.

In the end it's all about reducing the average parsing cycles per character, so any parallelization that we can get will help.

Another thing could be sth similar to this line of borer's JSON parser:
https://github.com/sirthias/borer/blob/master/core/src/main/scala/io/bullet/borer/json/JsonParser.scala#L417

Decoupling the character copying from the rest of the loop had an incredible effect on performance in borer's case. I suspect because the CPU can then freely move those instructions between everything else and make use of otherwise unused silicon capacity, so a lot of it gets essentially moved out of the critical path and costs "nothing" any more...

@plokhotnyuk
Copy link

@sirthias I have found that C2 can do unrolling for some simple loops, and for my case only GraalVM required for manual unrolling

@sirthias
Copy link
Member

Yeah, the JVM might be able to do it itself.
But this is hard to control and small changes might break the JVMs ability to do it.
So manual unrolling for this very hot code path appears to be worth the effort.

@plokhotnyuk
Copy link

Here are throughput scores of benchmarks which parse a string of 128 ASCII chars on the same Intel Core i7 CPU but on different versions of JDK and GraalVM:
image

It is much harder to keep them high on different hardware.

@jrudolph
Copy link
Member Author

jrudolph commented Jul 30, 2019

One thing that might be worthwhile exploring to speed up string parsing even more is unrolling the parseOneFast loop to handle 4 characters in a single loop iteration.
I could imagine that we getter instruction level parallelism this way

Tried that in different variations but no positive results so far.

Decoupling the character copying from the rest of the loop had an incredible effect on performance in borer's case.

Tried that as well in different versions but I also didn't find any clear wins. What helped indeed was moving the copying of the single character out of the branches.

I also experimented with using table lookups which also didn't help :)

Regarding string parsing benchmarks in particular, I found that the biggest factor that will bias the result is the size of the internal buffer for collecting the characters. If every parser run will reallocate that buffer you can only get max performance in the benchmark if the buffer is already sized just big enough for the resulting string. You can probably improve a lot in these benchmarks when you bring-your-own-buffer.

There's also the problem, that JIT often doesn't end up in the same state and I've often seen +/- 20% between different runs with 2 or 3 different steady states over different runs.

Here are the benchmark results from jsoniter-scala when parsing 1000 bytes with a 1024 char buffer:

# VM version: JDK 1.8.0_212, OpenJDK 64-Bit Server VM, 25.212-b04
# VM invoker: /usr/lib/jvm/adoptopenjdk-8-jdk-hotspot/jre/bin/java
# VM options: -server -Xms2g -Xmx2g -XX:NewSize=1g -XX:MaxNewSize=1g -XX:InitialCodeCacheSize=512m -XX:ReservedCodeCacheSize=512m -XX:+UseParallelGC -XX:-UseBiasedLocking -XX:+AlwaysPreTouch

Benchmark                                      (size)      Score       Error  Units
StringOfAsciiCharsReading.avSystemGenCodec       1000 420377.975 ±   3314.706  ops/s
StringOfAsciiCharsReading.borerJson              1000 374747.750 ±   5469.807  ops/s
StringOfAsciiCharsReading.circe                  1000 296227.975 ±   1608.166  ops/s
StringOfAsciiCharsReading.dslJsonScala           1000 747348.589 ±  14188.010  ops/s
StringOfAsciiCharsReading.jacksonScala           1000 668431.496 ±  10057.838  ops/s
StringOfAsciiCharsReading.jsoniterJava           1000 663563.884 ±   4021.098  ops/s
StringOfAsciiCharsReading.jsoniterScala          1000 799233.199 ±   3866.261  ops/s
StringOfAsciiCharsReading.playJson               1000 697148.819 ±  13384.538  ops/s
StringOfAsciiCharsReading.scalikeJackson         1000 383223.384 ±   1653.179  ops/s
StringOfAsciiCharsReading.sprayJson              1000 516299.306 ±  12754.457  ops/s
StringOfAsciiCharsReading.uPickle                1000 396939.986 ±   8720.495  ops/s

StringOfEscapedCharsReading.avSystemGenCodec     1000  40351.801 ±  2746.862  ops/s
StringOfEscapedCharsReading.borerJson            1000  56583.767 ±  3943.939  ops/s
StringOfEscapedCharsReading.circe                1000  33025.217 ±  1741.437  ops/s
StringOfEscapedCharsReading.dslJsonScala         1000  71626.399 ±  2489.666  ops/s
StringOfEscapedCharsReading.jacksonScala         1000  60839.264 ±   687.483  ops/s
StringOfEscapedCharsReading.jsoniterJava         1000  99722.853 ± 14966.104  ops/s
StringOfEscapedCharsReading.jsoniterScala        1000 185767.529 ±  4066.498  ops/s
StringOfEscapedCharsReading.playJson             1000  59031.123 ±  6318.695  ops/s
StringOfEscapedCharsReading.scalikeJackson       1000  49066.981 ±   810.482  ops/s
StringOfEscapedCharsReading.sprayJson            1000  81830.164 ±  3713.948  ops/s
StringOfEscapedCharsReading.uPickle              1000  34547.016 ±   730.916  ops/s

StringOfNonAsciiCharsReading.avSystemGenCodec    1000 126835.981 ±  5900.099  ops/s
StringOfNonAsciiCharsReading.borerJson           1000  81477.660 ±  2066.403  ops/s
StringOfNonAsciiCharsReading.circe               1000 103151.358 ±  3512.299  ops/s
StringOfNonAsciiCharsReading.dslJsonScala        1000 189423.431 ±  1927.495  ops/s
StringOfNonAsciiCharsReading.jacksonScala        1000 101410.377 ± 19933.428  ops/s
StringOfNonAsciiCharsReading.jsoniterJava        1000 267360.452 ±  5517.855  ops/s
StringOfNonAsciiCharsReading.jsoniterScala       1000 193001.124 ±  2849.416  ops/s
StringOfNonAsciiCharsReading.playJson            1000 116526.373 ±  4094.327  ops/s
StringOfNonAsciiCharsReading.scalikeJackson      1000 121490.223 ±  6358.179  ops/s
StringOfNonAsciiCharsReading.sprayJson           1000 215865.902 ± 13995.871  ops/s
StringOfNonAsciiCharsReading.uPickle             1000 103967.342 ±  3001.080  ops/s

Which are now very decent results for this PR given that there are no tricks involved.

@jrudolph
Copy link
Member Author

It is much harder to keep them high on different hardware.

I think, I'll keep spray-json reasonably simple, so I'll rather go with a somewhat simple solution even if that costs a bit of performance. That also has the benefit that it hasn't been optimized to any architecture and should still show reasonable (but maybe not top) performance.

@jrudolph
Copy link
Member Author

You can probably improve a lot in these benchmarks when you bring-your-own-buffer.

Caching the buffer in a thread local between parsing runs is another 10-20% win indeed.

@sirthias
Copy link
Member

Great stuff!
Looking forward to seeing this released!

@sirthias
Copy link
Member

Caching the buffer in a thread local between parsing runs is another 10-20% win indeed.

Interesting. In my tests with borer adding a ThreadLocal cache of the char buffer somehow resulted in a slight performance decrease, at least on graal.
This perf work really is a complete trial and error process... unfortunately.

@jrudolph
Copy link
Member Author

This perf work really is a complete trial and error process... unfortunately.

It is, I'm also still looking for a more predictable workflow. I didn't test extensively on Graal, from what I've seen, GraalCE was slower than hotspot and JDK11 was a bit faster than JDK8.

((input.byteAt(cursor + 1) & 0x3f) << 6) |
(input.byteAt(cursor + 2) & 0x3f)

if (codePoint < 0xffff) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the check is redundant, 2nd branch is unreachable in runtime

charBuffer(charsRead) = codePoint.toChar
4
} else {
charBuffer(charsRead) = ((0xd7C0 + (codePoint >> 10)).toChar)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add checking for broken surrogate pairs here, emoji so popular these days 😄

def unicodeChar(b: Char, cursor: Int, charsRead: Int): Int = {
if ((b & 0xE0) == 0xC0) { // two byte sequence
charBuffer(charsRead) = (((b & 0x1f) << 6) |
(input.byteAt(cursor + 1) & 0x3f)).toChar

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here and bellow all subsequent bytes after leading one should be checked if they are well formed (have 10 highest bits)

/* Scan to the end of a simple string as fast as possible, no escapes, no unicode */
// so far unclear results if that is really helpful
/*@tailrec def parseOneFast(cursor: Int, charsRead: Int): String = {
val b = input.byteAt(cursor)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had another try and found out that parseOneFast was slower because I wasn't using charAt here directly as below in parseOneSlow. Changing that it now makes sense again to keep it to improve parsing speed of simple strings (without escapes and unicode).

Comparing to jsoniter-scala I found another weird thing which is that surrounding the whole parseOneFast with a useless (always true) if clause (if (i < minLim) in jsoniter and here e.g. if(charsRead <= cursor)) is another 20-25% improvement.

Those kind of things seem to be rather random findings that cannot really be explained in terms of the original code but only by looking at all stages of the compilation/execution pipeline (Scala compiler -> JIT compiler -> µops). I tried Intel's VTune for the first time to get some insights and while it looks promising on the surface I didn't find any actionable explanations for the performance cliffs (and in any case it those could be quite specific to my particular processor). E.g. for the charAt vs. byteAt it showed bad core port usage ratios (>50% of cycles where no port made progress) which (AFAIU) may be caused by too many dependencies between instructions.

(So, kudos again to jsonitor-scala for managing to be fast in various environments but as I said before that's not my goal here)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

crazy stuff... and a huge time sink, unfortunately

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants