Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 Encoded Text #1

Merged
merged 11 commits into from
May 20, 2021
156 changes: 156 additions & 0 deletions proposals/002-text-utf-default.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
This proposal is for the migration of the `text` package from its
default UTF-16 encoding to UTF-8. The lack of UTF-8 as a default in the
`text` package is a pain point raised by the Haskell Community and
industry partners for many years. The Haskell Foundation is uniquely
positioned to effect a change like this, granted there is an appetite
for breakage, and that such a project is well-socialized and planned.
During the meetings of the Haskell Foundation Tech Agenda Track, we
identified the pros and cons of such a migration against historical
attempts, and agreed that it was within our power to do.

Further, we solicited both community and industry feedback to gauge the
appetite for breakage, and what stakeholders would be affected by the
changes. Andrew Lelechenko has offered to lead the implementation of
this project.

# Motivation

- UTF-16 by default requires that all Text values pay a premium for serialization. Arguably, the performance impact of Text is flipped
upside-down: most text is UTF-8, and Haskell devs pay an undue cost when working with the wrong default.

- UTF-8 is the industry standard and by far the most common text encoding, with roughly 97% of web pages existing in UTF-8. The
existing UTF-16 default imposes an additional hurdle to working with the vast majority of web content on earth.

- Many systems in Haskell are UTF-8 by default (e.g. Haddock)
Bodigrim marked this conversation as resolved.
Show resolved Hide resolved

# Goals

- Solicit feedback from community members and industry to gauge appetite for such a change.
emilypi marked this conversation as resolved.
Show resolved Hide resolved

- Provide an implementation, migration, and delivery plan for changing the default encoding of `text` from UTF-16 to UTF-8.

- Ensure stakeholders (e.g. GHC, Cabal, Stack, boot libs) have ample time to migrate and address any bugs.

- Implementation should not significantly alter the performance characteristics of the base `text` library within some tolerance
Bodigrim marked this conversation as resolved.
Show resolved Hide resolved
threshold.

Bodigrim marked this conversation as resolved.
Show resolved Hide resolved
emilypi marked this conversation as resolved.
Show resolved Hide resolved
# People

- Performers:

- Leader: Andrew Lelechenko (bodigrim)

- Support: Emily Pillmore (emilypi)

- Reviewers:

- The text maintainers

- Xia Li-Yao (lysxia)

- Emily Pillmore (emilypi)

- Dan Cartwright (chessai)

- Callan McGill (boarders)

- Stakeholders:

- Edward Kmett: has been vocal about his use of `text-icu` and requires it not be broken.

- Ben Gamari: integration with GHC
Bodigrim marked this conversation as resolved.
Show resolved Hide resolved

- The Cabal maintainers (fgaz, emilypi, mikolaj): integration with Cabal
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither Cabal nor cabal-install use Text at the moment. It does use (vendored version) of ShortText though.

IMO you can just drop Cabal from the list, it won't be affected unless there is some need to use Text in Cabal (which there isn't as far as I can tell).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know much about Cabal internals, but Cabal.cabal seems to include text:
https://github.com/haskell/cabal/blob/4f8aeb2c8a0a3638e1af887dc869a17e291c8329/Cabal/Cabal.cabal#L272

Copy link

@phadej phadej May 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It defines few instances for Text, it's not really used.

cabal master % git grep 'Data.Text'

-- tests:
Cabal-tests/tests/UnitTests/Distribution/Utils/Generic.hs:import qualified Data.Text as T
Cabal-tests/tests/UnitTests/Distribution/Utils/Generic.hs:import qualified Data.Text.Encoding as T

example (note in comment)
Cabal/src/Distribution/Backpack.hs:-- >>> eitherParsec "foo[Str=text-1.2.3:Data.Text.Text]" :: Either String OpenUnitId
Cabal/src/Distribution/Backpack.hs:-- Right (IndefFullUnitId (ComponentId "foo") (fromList [(ModuleName "Str",OpenModule (DefiniteUnitId (DefUnitId {unDefUnitId = UnitId "text-1.2.3"})) (ModuleName "Data.Text.Text"))]))

-- CharParsing class has
-- text :: Text -> m Text
-- method as it's in upstream parsers -package version
Cabal/src/Distribution/Compat/CharParsing.hs:import Data.Text (Text, unpack)

-- these are used for debug, not part of proper build
Cabal/src/Distribution/Fields/Lexer.hs:import qualified Data.Text   as T
Cabal/src/Distribution/Fields/Lexer.hs:import qualified Data.Text.Encoding as T
Cabal/src/Distribution/Fields/Lexer.hs:import qualified Data.Text.Encoding.Error as T

-- ditto
Cabal/src/Distribution/Fields/LexerMonad.hs:import qualified Data.Text          as T
Cabal/src/Distribution/Fields/LexerMonad.hs:import qualified Data.Text.Encoding as T
Cabal/src/Distribution/Fields/Parser.hs:import qualified Data.Text                as T
Cabal/src/Distribution/Fields/Parser.hs:import qualified Data.Text.Encoding       as T
Cabal/src/Distribution/Fields/Parser.hs:import qualified Data.Text.Encoding.Error as T

-- instances
Cabal/src/Distribution/Utils/Structured.hs:import qualified Data.Text                    as T
Cabal/src/Distribution/Utils/Structured.hs:import qualified Data.Text.Lazy               as LT

-- some dev stuff
bootstrap/src/Main.hs:import qualified Data.Text as T
cabal-dev-scripts/src/GenSPDX.hs:import Data.Text        (Text)
cabal-dev-scripts/src/GenSPDX.hs:import qualified Data.Text            as T
cabal-dev-scripts/src/GenSPDXExc.hs:import Data.Text        (Text)
cabal-dev-scripts/src/GenSPDXExc.hs:import qualified Data.Text            as T
cabal-dev-scripts/src/GenUtils.hs:import Data.Text    (Text)
cabal-dev-scripts/src/GenUtils.hs:import qualified Data.Text           as T

-- this is really fun, the usage is
-- let command' = command { commandName = T.unpack . T.replace "v1-" "" . T.pack . commandName $ command }
cabal-install/src/Distribution/Client/CmdLegacy.hs:import qualified Data.Text as T
cabal-testsuite/src/Test/Cabal/Plan.hs:import qualified Data.Text as Text

-- docs
doc/cabal-package.rst:a requirement ``Str`` and an implementation ``Data.Text``, you can
doc/cabal-package.rst:  ``mixins: parametrized requires (Str as Data.Text)``
doc/cabal-package.rst:  ``mixins: text (Data.Text as Str)``
doc/cabal-package.rst:        parametrized (MyModule as MyModule.Text) requires (Str as Data.Text),

-- lexer debugging (same as above)
templates/Lexer.x:import qualified Data.Text   as T
templates/Lexer.x:import qualified Data.Text.Encoding as T
templates/Lexer.x:import qualified Data.Text.Encoding.Error as T

If any change in text affects Cabal performance, that will be interesting.

TL;DR text dependency could be dropped quite easily, but there is really no point as it's there via parsec.

Copy link

@phadej phadej May 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking of Cabal made me remember pandoc. It can be called the Haskell app, and it uses Text extensively. So if you want to have some real project on the list, i'd nominate it.

The promise that API doesn't change too much should make building pandoc even with its dependency footprint only a matter of CPU time.

Addition: pandoc is also as text processing tool as a tool (written in Haskell) can be, so if it clearly benefits, that could be enough to declare the success of the whole proposal.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, interesting, thanks.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emilypi shall we remove Cabal from stakeholders?


Progress will be reported on a weekly basis to the HF Technical Agenda
Track, with Emily as support for Andrew.

# Timeline

We expect that this project will take roughly 6 months to fully
complete: 3-4 months to complete the code implementation, performance
testing, and unit testing, another 1-2 months to integrate with
stakeholders and diagnose any potential issues with the migration.

**Preparation:**

Using the HVR's existing [`text-utf8`](https://github.com/text-utf8) as
emilypi marked this conversation as resolved.
Show resolved Hide resolved
a starting point, the following must be done before an implementation is
started:

- Modernize the codebase and clear out the bitrot

- Establish a baseline for performance and any related issues.

- Update testing and performance benchmarks to make use of `inspection-testing` to ensure fusion is not broken in
subsequent UTF-8 related changes.

An MVP should completely preserve standard user-facing API, and not
break fusion. Performance should not significantly diverge from the
existing UTF-16 text package. There will be an expected change to the
exposed Text internals, in which case, breakage should be assessed by
circulating a git commit reference to a release candidate as soon as
possible. This candidate should be sourced publicly and loudly.
emilypi marked this conversation as resolved.
Show resolved Hide resolved

**Implementation:**

- TBD: There is a straightforward implementation, but this one is left up to Andrew for comment.

**Stakeholders:**

- Library authors will need to be made aware of changes and adjust accordingly. HF will provide a git reference to a complete MVP as
soon as possible, and produce a migration guide.

- In the case of GHC, Cabal, and other core infrastructure, we will work closely with these packages to help migrate and assess+fix
breakages.

While we do not expect many authors to experience significant changes,
there will be some help that needs to be given in terms of bumping
Hackage bounds since this migration will be a major version bump. HF
will need to coordinate with the Hackage Trustees to help move along
packages that go out of date.

# Deliverables

- text-2.0.0.0, which will provide a UTF-8 encoding for Text as a default for all versions going forward.

- A `text-utf16` package, which is a preservation of the current UTF-16 encoded text, for backwards compatibility.
Bodigrim marked this conversation as resolved.
Show resolved Hide resolved

- Updates to the Text Haddocks that reflect the UTF-8 changes

- Announcements and updates across all Haskell channels covering the following:
- Significant dates and milestones

- Expected code impact

- Release candidates

- Delivery
Bodigrim marked this conversation as resolved.
Show resolved Hide resolved

- Migration instructions and design documentation

# Outcomes

- Addresses a recognized need and want from both Industry and the Haskell Community

- Better UX for Haskell's text story
Bodigrim marked this conversation as resolved.
Show resolved Hide resolved

- Establishes the ability for Haskell Foundation to Get Things Done that were previously blocked for 10+ years in the community.
Bodigrim marked this conversation as resolved.
Show resolved Hide resolved

- Better interoperability and web story

# Risks

- HF must minimize the cost to migrate, or people will just get mad (and rightfully so).
emilypi marked this conversation as resolved.
Show resolved Hide resolved

- Text-icu will need a bespoke UTF-8 conversion function. In general, the Unicode story must be tracked and made sure it will not break.
- Recommendation: make this a high-priority deliverable when project planning

- The old UTF-16 text package will need to be preserved, and will require a maintainer.
Bodigrim marked this conversation as resolved.
Show resolved Hide resolved

- Performance expectations should be managed: UTF-8 text is \*not\* a panacea. It will not result in a 2x or even significant
Bodigrim marked this conversation as resolved.
Show resolved Hide resolved
performance increase in many cases. In fact, performance may regress in some programs.
- Recommendation: we must set expectations early and often throughout the implementation process. We expect there to be
improvements to performance \*on the margins\*.

- Note: We have made this argument, and the appetite for change did not shift, but it is still important to track.