-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[preview] String parsing improvements #314
base: master
Are you sure you want to change the base?
Conversation
Otherwise, when running the benchmark project it will use the wrong version from sprayJsonJVM (yeah, weird).
(cherry picked from commit bf6c20065d9833b011fbefe9c418f72b5f46cd4f) (cherry picked from commit 28c8da725426403e5fedfdc49d096ef5e9a1585e) # Conflicts: # spray-json/shared/src/main/scala/spray/json/JsonParser.scala
(cherry picked from commit e1e7a8bfae1015e55999b6d9c57c0a978d4e9956) # Conflicts: # spray-json/shared/src/main/scala/spray/json/JsonParser.scala
import io.circe.generic.semiauto._ | ||
import io.circe.parser._ | ||
implicit def circeUserFormat: Decoder[User] = deriveDecoder | ||
implicit def circeLabelFormat: Decoder[LabelsArrayElement] = deriveDecoder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jrudolph please consider to cache derived decoders using val
for better performance
input.setCursor(cursor) // now at end '"' | ||
advance() | ||
new String(charBuffer, 0, charsRead) | ||
} else if (((b - 32) ^ 60) <= 0) // trick from jsoniter-scala |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is a check for escaping character and for characters that are not allowed to be represented as is without encoding: b < ' ' || b = '\'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I know, pretty clever :) I printed out the complete mapping between inputs and outputs to see where it was going. It seems it didn't help much here in the end.
} else if (b == '\\') { | ||
val skip = escaped(cursor, charsRead) | ||
parseOneSlow(cursor + skip, charsRead + 1) | ||
} else if (input.needsDecoding && b >= 128) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing check for control characters that must be escaped
Awesome stuff, @jrudolph! |
One thing that might be worthwhile exploring to speed up string parsing even more is unrolling the The beginning of this blog post here: https://chadaustin.me/2017/05/writing-a-really-really-fast-json-parser/ discusses this effect for JSON parsing. In the end it's all about reducing the average parsing cycles per character, so any parallelization that we can get will help. Another thing could be sth similar to this line of borer's JSON parser: Decoupling the character copying from the rest of the loop had an incredible effect on performance in borer's case. I suspect because the CPU can then freely move those instructions between everything else and make use of otherwise unused silicon capacity, so a lot of it gets essentially moved out of the critical path and costs "nothing" any more... |
Yeah, the JVM might be able to do it itself. |
Tried that in different variations but no positive results so far.
Tried that as well in different versions but I also didn't find any clear wins. What helped indeed was moving the copying of the single character out of the branches. I also experimented with using table lookups which also didn't help :) Regarding string parsing benchmarks in particular, I found that the biggest factor that will bias the result is the size of the internal buffer for collecting the characters. If every parser run will reallocate that buffer you can only get max performance in the benchmark if the buffer is already sized just big enough for the resulting string. You can probably improve a lot in these benchmarks when you bring-your-own-buffer. There's also the problem, that JIT often doesn't end up in the same state and I've often seen +/- 20% between different runs with 2 or 3 different steady states over different runs. Here are the benchmark results from jsoniter-scala when parsing 1000 bytes with a 1024 char buffer:
Which are now very decent results for this PR given that there are no tricks involved. |
I think, I'll keep spray-json reasonably simple, so I'll rather go with a somewhat simple solution even if that costs a bit of performance. That also has the benefit that it hasn't been optimized to any architecture and should still show reasonable (but maybe not top) performance. |
Caching the buffer in a thread local between parsing runs is another 10-20% win indeed. |
Great stuff! |
Interesting. In my tests with borer adding a ThreadLocal cache of the char buffer somehow resulted in a slight performance decrease, at least on graal. |
It is, I'm also still looking for a more predictable workflow. I didn't test extensively on Graal, from what I've seen, GraalCE was slower than hotspot and JDK11 was a bit faster than JDK8. |
((input.byteAt(cursor + 1) & 0x3f) << 6) | | ||
(input.byteAt(cursor + 2) & 0x3f) | ||
|
||
if (codePoint < 0xffff) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the check is redundant, 2nd branch is unreachable in runtime
charBuffer(charsRead) = codePoint.toChar | ||
4 | ||
} else { | ||
charBuffer(charsRead) = ((0xd7C0 + (codePoint >> 10)).toChar) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add checking for broken surrogate pairs here, emoji so popular these days 😄
def unicodeChar(b: Char, cursor: Int, charsRead: Int): Int = { | ||
if ((b & 0xE0) == 0xC0) { // two byte sequence | ||
charBuffer(charsRead) = (((b & 0x1f) << 6) | | ||
(input.byteAt(cursor + 1) & 0x3f)).toChar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here and bellow all subsequent bytes after leading one should be checked if they are well formed (have 10
highest bits)
/* Scan to the end of a simple string as fast as possible, no escapes, no unicode */ | ||
// so far unclear results if that is really helpful | ||
/*@tailrec def parseOneFast(cursor: Int, charsRead: Int): String = { | ||
val b = input.byteAt(cursor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had another try and found out that parseOneFast
was slower because I wasn't using charAt
here directly as below in parseOneSlow
. Changing that it now makes sense again to keep it to improve parsing speed of simple strings (without escapes and unicode).
Comparing to jsoniter-scala I found another weird thing which is that surrounding the whole parseOneFast
with a useless (always true) if
clause (if (i < minLim)
in jsoniter and here e.g. if(charsRead <= cursor)
) is another 20-25% improvement.
Those kind of things seem to be rather random findings that cannot really be explained in terms of the original code but only by looking at all stages of the compilation/execution pipeline (Scala compiler -> JIT compiler -> µops). I tried Intel's VTune for the first time to get some insights and while it looks promising on the surface I didn't find any actionable explanations for the performance cliffs (and in any case it those could be quite specific to my particular processor). E.g. for the charAt
vs. byteAt
it showed bad core port usage ratios (>50% of cycles where no port made progress) which (AFAIU) may be caused by too many dependencies between instructions.
(So, kudos again to jsonitor-scala for managing to be fast in various environments but as I said before that's not my goal here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
crazy stuff... and a huge time sink, unfortunately
String parsing dominates parsing times so far for real world input (that isn't particularly number heavy). So, here's a first attempt to improve performance.
The speed-ups mostly come from
Before:
After:
I.e. about 2x-3x faster for byte input. So, string parsing has been improved by ~ 3x which leads to a 2x improvement for the test case (88k json from github API).
As
ParserInput
needs to be changed, the change won't be simply binary compatible. I'll probably deprecate the current parser infrastructure and duplicate it by simpler entry-points that don't leak so many implementation details for common cases.