-
Notifications
You must be signed in to change notification settings - Fork 47k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Flight] Optimize Large Strings by Not Escaping Them (#26932)
This introduces a Text row (T) which is essentially a string blob and refactors the parsing to now happen at the binary level. ``` RowID + ":" + "T" + ByteLengthInHex + "," + Text ``` Today, we encode all row data in JSON, which conveniently never has newline characters and so we use newline as the line terminator. We can't do that if we pass arbitrary unicode without escaping it. Instead, we pass the byte length (in hexadecimal) in the leading header for this row tag followed by a comma. We could be clever and use fixed or variable-length binary integers for the row id and length but it's not worth the more difficult debuggability so we keep these human readable in text. Before this PR, we used to decode the binary stream into UTF-8 strings before parsing them. This is inefficient because sometimes the slices end up having to be copied so it's better to decode it directly into the format. The follow up to this is also to add support for binary data and then we can't assume the entire payload is UTF-8 anyway. So this refactors the parser to parse the rows in binary and then decode the result into UTF-8. It does add some overhead to decoding on a per row basis though. Since we do this, we need to encode the byte length that we want decode - not the string length. Therefore, this requires clients to receive binary data and why I had to delete the string option. It also means that I had to add a way to get the byteLength from a chunk since they're not always binary. For Web streams it's easy since they're always typed arrays. For Node streams it's trickier so we use the byteLength helper which may not be very efficient. Might be worth eagerly encoding them to UTF8 - perhaps only for this case.
- Loading branch information
1 parent
ce6842d
commit db50164
Showing
11 changed files
with
267 additions
and
42 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.