-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Builder: Efficiently handle literal strings #132
Conversation
Good catch, @phadej! |
Btw, how does this code handle surrogate-codepoints in literals? Does it handle them liberally according to WTF-8? |
I'm not sure I understand the question. What handling of surrogates do you propose is necessary here? This code doesn't attempt to do any decoding beyond what is necessary to handle the modified UTF-8 encoding of the U+0 codepoint. |
@bgamari the question I was basically asking is what happens for a string-literal like
and whether it gets encoded as
|
For the record, currently: Prelude Data.ByteString> unpack "Z\xd800Z"
[90,0,90]
Prelude Data.ByteString> unpack "Z\x02fcZ"
[90,252,90]
Prelude Data.ByteString Data.Word> fromIntegral (0x02fc :: Int) :: Word8
252 i.e. the thing you would expect from |
@phadej doesn't this PR affect the code-paths for e.g. |
I see the question now. With this patch we have this,
which is the invalid UTF-8 sequence that you point out in your question. It is also the same thing that we would produce today. I really don't think we are in a position where we can change this. In general we are in a bit of a tight spot here since we don't have |
@bgamari it's not a big deal; it'd just be good to warn about this in the documentation (and maybe at some point GHC could implement a warning about text literals containing suspicious code-points, mostly U+D800 through U+DFFF) |
50a4705
to
0bcb435
Compare
Curiously, Travis seems to fail reliably yet I don't see any of these failures locally. Hmmmm. |
@dcoutts, ping. |
Pinging @dcoutts. |
Ping @dcoutts, this would simplify my library |
Ping |
@bgamari It looks like there's not very much left to do before this can be merged. Do you intend to put the finishing touches on this PR soon? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Data/ByteString/Builder/Prim.hs
Outdated
IO $ \s -> case writeWord8OffAddr# op0# 0# 0## s of | ||
s' -> (# s', () #) | ||
let br' = BufferRange (op0 `plusPtr` 1) ope | ||
step (addr `plusAddr#` 1#) k br' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it be 'plusAddr# 2#` ? We've read two bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! It is interesting that tests were too weak to catch it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hsyl20 I improved tests and changed the increment to 2#
. Could you please take another look?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for the fix!
Previously
String
s would be handled withP.primMapListBounded P.charUtf8
. In the case that theString
was a literal, we would decode UTF-8 from the primitive string and then reencode each character as we wrote it to the target buffer. Not only was this inefficient to run, it was also inefficient to compile as we would be forced to inline and simplify large swathes of the builder machinery (see GHC #13960).The obvious solution here is to do what we should have done all along:
strcpy
directly out of the primitive string into the target buffer. In the case of UTF-8 things are slightly trickier as we must recognize NULL characters, which GHC encodes as0xc0 0x80
.Fixing this is a win in several respects: code size of a trivial
main = print $ BSL.length $ B.toLazyByteString $ B.string "hello world"
program is roughly cut in half. Moreover, the new approach is about twice as fast as the previous according to the provided benchmarks.