Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Builder: Efficiently handle literal strings #132

Merged
merged 9 commits into from
Aug 26, 2020

Conversation

bgamari
Copy link
Contributor

@bgamari bgamari commented Jul 13, 2017

Previously Strings would be handled with P.primMapListBounded P.charUtf8. In the case that the String was a literal, we would decode UTF-8 from the primitive string and then reencode each character as we wrote it to the target buffer. Not only was this inefficient to run, it was also inefficient to compile as we would be forced to inline and simplify large swathes of the builder machinery (see GHC #13960).

The obvious solution here is to do what we should have done all along: strcpy directly out of the primitive string into the target buffer. In the case of UTF-8 things are slightly trickier as we must recognize NULL characters, which GHC encodes as 0xc0 0x80.

Fixing this is a win in several respects: code size of a trivial main = print $ BSL.length $ B.toLazyByteString $ B.string "hello world" program is roughly cut in half. Moreover, the new approach is about twice as fast as the previous according to the provided benchmarks.

Data/ByteString/Builder.hs Outdated Show resolved Hide resolved
@bgamari
Copy link
Contributor Author

bgamari commented Jul 14, 2017

Good catch, @phadej!

@hvr
Copy link
Member

hvr commented Jul 19, 2017

Btw, how does this code handle surrogate-codepoints in literals? Does it handle them liberally according to WTF-8?

@bgamari
Copy link
Contributor Author

bgamari commented Jul 20, 2017

Btw, how does this code handle surrogate-codepoints in literals? Does it handle them liberally according to WTF-8?

I'm not sure I understand the question. What handling of surrogates do you propose is necessary here? This code doesn't attempt to do any decoding beyond what is necessary to handle the modified UTF-8 encoding of the U+0 codepoint.

@hvr
Copy link
Member

hvr commented Jul 20, 2017

@bgamari the question I was basically asking is what happens for a string-literal like

"Z\xd800Z"

and whether it gets encoded as

  • (valid UTF8 stream) [90,239,191,189,90] (replacement char, this is what e.g. Data.Text.encodeUtf8 does), or as
  • the invalid UTF8 stream [90,237,160,128,90] (which would have Data.Text.decodeUtf8 choke) or
  • something else happens

@phadej
Copy link
Contributor

phadej commented Jul 20, 2017

For the record, currently:

Prelude Data.ByteString> unpack "Z\xd800Z"
[90,0,90]
Prelude Data.ByteString> unpack "Z\x02fcZ"
[90,252,90]

Prelude Data.ByteString Data.Word> fromIntegral (0x02fc :: Int) :: Word8
252

i.e. the thing you would expect from fromIntegral :: Int -> Word8

@hvr
Copy link
Member

hvr commented Jul 20, 2017

@phadej doesn't this PR affect the code-paths for e.g. stringUtf8 "Z\xd800Z" :: Builder?

@bgamari
Copy link
Contributor Author

bgamari commented Jul 20, 2017

I see the question now. With this patch we have this,

>>> print $ BSL.unpack $ B.toLazyByteString $ B.stringUtf8 "Z\xd800Z"
[90,237,160,128,90]

which is the invalid UTF-8 sequence that you point out in your question. It is also the same thing that we would produce today. I really don't think we are in a position where we can change this.

In general we are in a bit of a tight spot here since we don't have ByteArray literals, therefore we abuse CString for this. Moreover, we use modified UTF-8 to encode strings containing code-point 0, so it's nearly impossible to distinguish between "strings" and "chunks of bytes". I'm working on resuscitating the length-annotated string patch which should help a bit here since we will no longer need UTF-8 encoding for plain ASCII strings.

@hvr
Copy link
Member

hvr commented Jul 20, 2017

@bgamari it's not a big deal; it'd just be good to warn about this in the documentation (and maybe at some point GHC could implement a warning about text literals containing suspicious code-points, mostly U+D800 through U+DFFF)

@bgamari bgamari force-pushed the build-cstring branch 4 times, most recently from 50a4705 to 0bcb435 Compare July 24, 2017 23:08
@bgamari
Copy link
Contributor Author

bgamari commented Jul 24, 2017

Curiously, Travis seems to fail reliably yet I don't see any of these failures locally. Hmmmm.

@bgamari
Copy link
Contributor Author

bgamari commented Jan 22, 2018

@dcoutts, ping.

@bgamari
Copy link
Contributor Author

bgamari commented Feb 18, 2018

Pinging @dcoutts.

@knupfer
Copy link

knupfer commented Apr 28, 2018

Ping @dcoutts, this would simplify my library

@chessai
Copy link
Member

chessai commented Oct 13, 2019

Ping

@sjakobi sjakobi added this to the 0.10.12.0 milestone Jul 3, 2020
@sjakobi
Copy link
Member

sjakobi commented Jul 14, 2020

@bgamari It looks like there's not very much left to do before this can be merged. Do you intend to put the finishing touches on this PR soon?

@sjakobi sjakobi modified the milestones: 0.10.12.0, Soon Aug 19, 2020
@Bodigrim Bodigrim modified the milestones: Soon, 0.11.0.0 Aug 21, 2020
Copy link
Contributor

@Bodigrim Bodigrim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Bodigrim Bodigrim requested review from hsyl20 and sjakobi August 23, 2020 19:13
IO $ \s -> case writeWord8OffAddr# op0# 0# 0## s of
s' -> (# s', () #)
let br' = BufferRange (op0 `plusPtr` 1) ope
step (addr `plusAddr#` 1#) k br'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be 'plusAddr# 2#` ? We've read two bytes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! It is interesting that tests were too weak to catch it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hsyl20 I improved tests and changed the increment to 2#. Could you please take another look?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the fix!

@Bodigrim Bodigrim merged commit 155bf8a into haskell:master Aug 26, 2020
@Bodigrim Bodigrim mentioned this pull request May 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants