-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UTF-8 Encoded Text #1
Conversation
bb50591
to
f8ee7e3
Compare
looks good to me |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's so great to have all these details in one place. Textual encoding is not an area I know much about, and so I haven't been able to follow this track well. Now, I see it all laid out. Thank you!
While I have a bunch of comments here, please do not see this as trying to defeat this proposal. I just want the text to be clear about the concrete wins for Haskell that we should expect after this is complete.
Co-authored-by: Richard Eisenberg <[email protected]>
The proposal should mention what happens with If my memory is correct, HVR intended to add However, for use as "text symbols", e.g. representing identifiers in the programming language I could argue that e.g.
|
Also, I'd like to have comments why we think UTF8 is better, when e.g.
|
This link might be useful: http://utf8everywhere.org |
I can add an example motivation. The aeson package parses from a ByteString assumed to be UTF-8. Then for each string you get a Text. If you’re using the fastest parsers you can, they will typically be e.g. attoparsec or flatparse which work on a ByteString. These often take advantage of the byte by byte format, using fast memchr and such functions from C. In order to use such parsers, you’re going from UTF-8, to UTF-16 —doubling the size, then you’ll have to again encode back into UTF-8. I’ve measured the performance of encodeUtf8 and found it to be quite small. About 1 microsecond per kilobyte of Unicode on my laptop. So doing this extra step is “OK”, but that depends on your app. I’ll attach my benchmarks here. Another example: I had a client who was referring to genes which are all ASCII, but was quite logically using Text. After loading millions from disk, this adds up to double the space, and a decode step (from "Latin1 to UTF-16"). Whereas a UTF-8 representation would require no decode step, nor a doubling of memory, and in fact would let you often use pointer aliasing to the original buffer, avoiding a copy/alloc. My advice was to switch to |
sampleUnicode :: Text
sampleUnicode = "! \" # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \\ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~ ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ® ¯ ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ Ā ā Ă ă Ą ą Ć ć Ĉ ĉ Ċ ċ Č č Ď ď Đ đ Ē ē Ĕ ĕ Ė ė Ę ę Ě ě Ĝ ĝ Ğ ğ Ġ ġ Ģ ģ Ĥ ĥ Ħ ħ Ĩ ĩ Ī ī Ĭ ĭ Į į İ ı IJ ij Ĵ ĵ Ķ ķ ĸ Ĺ ĺ Ļ ļ Ľ ľ Ŀ ŀ Ł ł Ń ń Ņ ņ Ň ň ʼn Ŋ ŋ Ō ō Ŏ ŏ Ő ő Œ œ Ŕ ŕ Ŗ ŗ Ř ř Ś ś Ŝ ŝ Ş ş Š š Ţ ţ Ť ť Ŧ ŧ Ũ ũ Ū ū Ŭ ŭ Ů ů Ű ű Ų ų Ŵ ŵ Ŷ ŷ Ÿ Ź ź Ż ż Ž ž ſ ƀ Ɓ Ƃ ƃ Ƅ ƅ Ɔ Ƈ ƈ Ɖ Ɗ Ƌ ƌ ƍ Ǝ Ə Ɛ Ƒ ƒ Ɠ Ɣ ƕ Ɩ Ɨ Ƙ ƙ ƚ ƛ Ɯ Ɲ ƞ Ɵ Ơ ơ Ƣ ƣ Ƥ ƥ Ʀ Ƨ ƨ Ʃ ƪ ƫ Ƭ ƭ Ʈ Ư ư Ʊ Ʋ Ƴ ƴ Ƶ ƶ Ʒ Ƹ ƹ ƺ ƻ Ƽ ƽ ƾ ƿ ǀ ǁ ǂ ǃ DŽ Dž dž LJ Lj lj NJ Nj nj Ǎ ǎ Ǐ ǐ Ǒ ǒ Ǔ ǔ Ǖ ǖ Ǘ ǘ Ǚ ǚ Ǜ ǜ ǝ Ǟ ǟ Ǡ ǡ Ǣ ǣ Ǥ ǥ Ǧ ǧ Ǩ ǩ Ǫ ǫ Ǭ ǭ Ǯ ǯ ǰ DZ Dz dz Ǵ ǵ Ǻ ǻ Ǽ ǽ Ǿ ǿ Ȁ ȁ Ȃ ȃ ɐ ɑ ɒ ɓ ɔ ɕ ɖ ɗ ɘ ə ɚ ɛ ɜ ɝ ɞ ɟ ɠ ɡ ɢ ɣ ɤ ɥ ɦ ɧ ɨ ɩ ɪ ɫ ɬ ɭ ɮ ɯ ɰ ɱ ɲ ɳ ɴ ɵ ɶ ɷ ɸ ɹ ɺ ɻ ɼ ɽ ɾ ɿ ʀ ʁ ʂ ʃ ʄ ʅ ʆ ʇ ʈ ʉ ʊ ʋ ʌ ʍ ʎ ʏ ʐ ʑ ʒ ʓ ʔ ʕ ʖ ʗ ʘ ʙ ʚ ʛ ʜ ʝ ʞ ʟ ʠ ʡ ʢ ʣ ʤ ʥ ʦ ʧ ʨ" bgroup
"encodeUtf8"
[ env
(pure (T.replicate i sampleUnicode))
(\t ->
bench
("T.encodeUtf8: " ++ show (i * T.length sampleUnicode) ++ " chars")
(whnf T.encodeUtf8 t))
| i <- [1, 10, 100]
] Numbers:
So about 1us per kilobyte. Pretty consistently, for a small subset of Unicode. For Chinese, the cost would be higher, but I haven't tested. Anyway, my point was to just bring some numbers to give people ballpark figures. I'd love to remove this redundant decode/encode step and fully support this initiative. |
Apologies if this was spelled out somewhere I haven't seen: but was there a consideration and decision made on having a package flag like "-futf8" or "-futf16" - allowing the authors to pick based on benchmarking their app? Web apps will be naturally better performing as most of everything web is UTF-8, whereas anything talking to ODBC or Windows may benefit from direct UTF-16 use. If I could just pick with a flag in my stack.yaml for the text package, that'd mean I at least get to make a decision rather than having it hard-coded into the ecosystem like it is in Rust. 🤔 Any opinions on that? |
It will be tricky. It is possible to generate a header |
If |
Good points. Inclusion with GHC would be a good reason to not bother with a flag. Assuming it’ll be hard to build GHC packages for a long time. You’re probably right about maintenance too. It’s a lot to ask to maintain text already. Just moving it to use UTF-8 internally seems more achievable and long-term sustainable. 👍 |
I'm not sure how relevant this is, but this project is very thought-provoking: simdjson : Parsing gigabytes of JSON per second As well as parsing JSON, it builds on earlier work using SIMD to validate UTF-8. The performance of both projects is astonishing, and the associated academic papers are very readable. I wonder how possible it would be to make use of these libraries through FFI, and whether that would be a justification for using an FFI-friendly memory representation. If it were possible to read/mmap an entire file and validate the UTF-8 in a fraction of a second, I think that would be a big win. |
Those uses of UTF-16 are mostly for historical reasons, because the implementations were done before UTF-8 was common when UCS-2 was still cool (and it was easiest just to upgrade them from UCS-2 to UTF-16 to keep them Unicode-compliant). |
proposals/002-text-utf-default.md
Outdated
|
||
- Ben Gamari: integration with GHC | ||
|
||
- The Cabal maintainers (fgaz, emilypi, mikolaj): integration with Cabal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neither Cabal
nor cabal-install
use Text
at the moment. It does use (vendored version) of ShortText
though.
IMO you can just drop Cabal
from the list, it won't be affected unless there is some need to use Text
in Cabal
(which there isn't as far as I can tell).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know much about Cabal internals, but Cabal.cabal
seems to include text
:
https://github.com/haskell/cabal/blob/4f8aeb2c8a0a3638e1af887dc869a17e291c8329/Cabal/Cabal.cabal#L272
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It defines few instances for Text
, it's not really used.
cabal master % git grep 'Data.Text'
-- tests:
Cabal-tests/tests/UnitTests/Distribution/Utils/Generic.hs:import qualified Data.Text as T
Cabal-tests/tests/UnitTests/Distribution/Utils/Generic.hs:import qualified Data.Text.Encoding as T
example (note in comment)
Cabal/src/Distribution/Backpack.hs:-- >>> eitherParsec "foo[Str=text-1.2.3:Data.Text.Text]" :: Either String OpenUnitId
Cabal/src/Distribution/Backpack.hs:-- Right (IndefFullUnitId (ComponentId "foo") (fromList [(ModuleName "Str",OpenModule (DefiniteUnitId (DefUnitId {unDefUnitId = UnitId "text-1.2.3"})) (ModuleName "Data.Text.Text"))]))
-- CharParsing class has
-- text :: Text -> m Text
-- method as it's in upstream parsers -package version
Cabal/src/Distribution/Compat/CharParsing.hs:import Data.Text (Text, unpack)
-- these are used for debug, not part of proper build
Cabal/src/Distribution/Fields/Lexer.hs:import qualified Data.Text as T
Cabal/src/Distribution/Fields/Lexer.hs:import qualified Data.Text.Encoding as T
Cabal/src/Distribution/Fields/Lexer.hs:import qualified Data.Text.Encoding.Error as T
-- ditto
Cabal/src/Distribution/Fields/LexerMonad.hs:import qualified Data.Text as T
Cabal/src/Distribution/Fields/LexerMonad.hs:import qualified Data.Text.Encoding as T
Cabal/src/Distribution/Fields/Parser.hs:import qualified Data.Text as T
Cabal/src/Distribution/Fields/Parser.hs:import qualified Data.Text.Encoding as T
Cabal/src/Distribution/Fields/Parser.hs:import qualified Data.Text.Encoding.Error as T
-- instances
Cabal/src/Distribution/Utils/Structured.hs:import qualified Data.Text as T
Cabal/src/Distribution/Utils/Structured.hs:import qualified Data.Text.Lazy as LT
-- some dev stuff
bootstrap/src/Main.hs:import qualified Data.Text as T
cabal-dev-scripts/src/GenSPDX.hs:import Data.Text (Text)
cabal-dev-scripts/src/GenSPDX.hs:import qualified Data.Text as T
cabal-dev-scripts/src/GenSPDXExc.hs:import Data.Text (Text)
cabal-dev-scripts/src/GenSPDXExc.hs:import qualified Data.Text as T
cabal-dev-scripts/src/GenUtils.hs:import Data.Text (Text)
cabal-dev-scripts/src/GenUtils.hs:import qualified Data.Text as T
-- this is really fun, the usage is
-- let command' = command { commandName = T.unpack . T.replace "v1-" "" . T.pack . commandName $ command }
cabal-install/src/Distribution/Client/CmdLegacy.hs:import qualified Data.Text as T
cabal-testsuite/src/Test/Cabal/Plan.hs:import qualified Data.Text as Text
-- docs
doc/cabal-package.rst:a requirement ``Str`` and an implementation ``Data.Text``, you can
doc/cabal-package.rst: ``mixins: parametrized requires (Str as Data.Text)``
doc/cabal-package.rst: ``mixins: text (Data.Text as Str)``
doc/cabal-package.rst: parametrized (MyModule as MyModule.Text) requires (Str as Data.Text),
-- lexer debugging (same as above)
templates/Lexer.x:import qualified Data.Text as T
templates/Lexer.x:import qualified Data.Text.Encoding as T
templates/Lexer.x:import qualified Data.Text.Encoding.Error as T
If any change in text
affects Cabal
performance, that will be interesting.
TL;DR text
dependency could be dropped quite easily, but there is really no point as it's there via parsec
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking of Cabal
made me remember pandoc
. It can be called the Haskell app, and it uses Text
extensively. So if you want to have some real project on the list, i'd nominate it.
The promise that API doesn't change too much should make building pandoc
even with its dependency footprint only a matter of CPU time.
Addition: pandoc
is also as text processing tool as a tool (written in Haskell) can be, so if it clearly benefits, that could be enough to declare the success of the whole proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, interesting, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@emilypi shall we remove Cabal from stakeholders?
This may be a little premature, but I'd like to make a plea for removing the BOM (Byte-Order Marker) that a lot of Windows software places at the beginning of UTF-8 files. This is used to declare the encoding of the file and isn't part of the meaning of the file. The logical place to do this would be in IO functions such as As a motivating example, the JSON spec explicitly says that a BOM should not be included in documents, but implementations may choose to skip one if it is present. A surprising number of systems do incorrectly include a BOM in JSON documents (eg Azure). It makes sense to elide it when reading a file or a stream that's already known or required to be UTF-8 because it doesn't contribute anything. Ban the BOM! |
@phadej Re: |
@emilypi the I repeat. |
Very glad to see this happening!! I really hope the effort doesn't get stymied by endless bikeshedding. 😬 Can anyone explain the relationship with the text-utf8 project? (Edit: disregard, I just read the attached proposal!) |
@pchiusano I'll clarify relations with prior art in an upcoming update. While elaborating the proposal, we came to a conclusion different from current statement. |
CC @jgm any comments on behalf of |
No substantive comments, but I fully support this proposal! (Note added later: pandoc stores lots and lots of very short Texts in AST nodes. I've sometimes thought about using ShortText or ByteString instead to reduce memory use -- especially now that I know how big the constant overhead of Text is -- but the expense of converting between these formats and Text has made this seem unattractive. If Text used UTF-8, this sort of thing would be much more feasible.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glad to see this taking shape.
Co-authored-by: José Lorenzo Rodríguez <[email protected]>
I should have asked that at the very beginning: What is the goal of having this PR. Is it to facilitate some discussion, which may or may not be incorporated into proposal text based on which (alone, or including the discussion) will be brought to the HF technical group for decision for clearing funds for the implementation? On the other hand, this project is already on https://haskell.foundation/projects/ page, |
@phadej I think that's a meta question that shouldn't be answered here, but i'm happy to have an out-of-band discussion about it elsewhere in more depth. Please focus on the proposal as a project specification for work that will happen if approved by the HF. The other proposals will operate similarly. |
The critical point is if approved. Thank you. What is the status with the other projects on the https://haskell.foundation/projects/ page, are all of haskup, GHC Platform CI, Project Matchmaker, Performance Tuning Book, Vector Types Proposal and GHC Performance Dashboards also awaiting HF approval (and therefore don't receive funds). The https://haskell.foundation/projects/ page page gives an impression that all of these projects are in the same state and the impression isn't that it is "awaiting HF approval", but rather that they are already ongoing and "blessed" by HF. |
And as separate comment: For me the I hope that it's clear for everyone that it's not up to HF Technical board to decide (before or after the experiment is done) whether Again, just give @Bodigrim what he needs to write the patch and run the benchmarks. Let's not make this more difficult then it needs to be. |
Thank you all for the extremely productive feedback on this proposal, it has been great to see people interested and highly involved. As of the HF Board Meeting on May 20th, 2021 this proposal has been approved. |
Fix broken link in PROPOSALS.md
...or indirectly, via https://hackage.haskell.org/package/ghc-9.2.1 ...but removed by Matt Pickering for us when we hit that wall (https://hackage.haskell.org/package/ghc-9.2.2) |
However, GHC is now reinstallable! |
@Ericson2314 Where / When did this happen? |
https://gitlab.haskell.org/ghc/ghc/-/merge_requests/5965 GHC itself, I suppose I should disclaim :). Stuff like https://gitlab.haskell.org/ghc/ghc/-/merge_requests/6803 is still needed if we are to nicely be able to rebuild all the libraries. |
thanks! |
This is old news, but the proposal has been completed successfully in haskell/text#365 and released as |
This proposal outlines a project plan for the migration of the
text
package from its current default encoding (UTF-16) to a new default of UTF-8.The lack of UTF-8 as a default in the
text
package is a pain point raised by the Haskell Community andmany of our industry partners for many years. We have done our homework in soliciting feedback from the broader community and industry, and have received positive affirmation of the following:
text
package name, and providing alternatives in the case that users require UTF-16 text.Rendered