Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable implicit fusion rules #348

Merged
merged 2 commits into from
Jun 21, 2021
Merged

Conversation

Bodigrim
Copy link
Contributor

A tale of two tails

What is the asymptotic complexity of Data.Text.tail? Internally Text is a byte array with offset and length counters, so taking its tail involves just bumping an offset by 1 or 2 code units, depending on UTF-16 encoding of the first character. This should be O(1) time/memory and is indeed so.

text/src/Data/Text.hs

Lines 532 to 536 in f1b4fc0

tail :: Text -> Text
tail t@(Text arr off len)
| len <= 0 = emptyError "tail"
| otherwise = text arr (off+d) (len-d)
where d = iter_ t 0

But what is the complexity of Data.Text.tail . Data.Text.tail? O(1), isn't it? Lo and behold: it is O(n) time and O(n) memory! Here is why:

text/src/Data/Text.hs

Lines 539 to 544 in f1b4fc0

{-# RULES
"TEXT tail -> fused" [~1] forall t.
tail t = unstream (S.tail (stream t))
"TEXT tail -> unfused" [1] forall t.
unstream (S.tail (stream t)) = tail t
#-}

When we compile a single tail, these rules cancel each other and we end up with a usual definition, cited above, which has O(1) complexity. However, if there is tail . tail, things get interesting because of stream . unstream = id rewrite rule:

tail . tail
  = unstream . S.tail . stream . unstream . S.tail . stream
  = unstream . S.tail . S.tail . stream

Here S.tail comes from Data.Text.Internal.Fusion.Common and just skips the first element of a stream, so it is O(1). But nevertheless stream must process the materialized Text in whole, so O(n) time, and unstream materializes the whole stream into Text, so O(n) memory.

This fusion issue plagues not only tail . tail, but also any other combination of slicing functions. See Bottlenecked on Haskell's text library for the similar issue about take . drop. This particular one was patched in #301 with one more rewrite rule to prevent fusion of take . drop. But we cannot reasonably add rules for every possible combination of take, drop, tail, init, head, last!

How important is fusion?

Fusion frameworks shine for polymorphic containers such as vector, where long chains of transformations are ubiquitous. They are much less useful for monomorphic containers. That's basically why ByteString abandoned its fusion framework 13 years ago. Chains of operations like map of filter are even less likely to occur for Text than for ByteString: processing Unicode data in such way is almost certainly unsound.

Instead there are two operations, which dominate Text applications:

  • Slicing. Once we received a Text input, we need to parse it one way or another, and parsing means slicing. As explained above, the fusion framework in Text has all chances to make your performance unpredictably worse and no chances to make it better.

  • Concatenation. To produce a Text output we usually concatenate smaller pieces of Text. There is a robust, fast and predictable mechanism to do so via Builder. Relying on the fusion framework for concatenation is an antipattern, because it leads to slow compilation times and unpredictable runtime. Instead of neatly copying two bytearrays together (as Builder does), the fusion framework decodes both of them to streams and then encodes them back.

I'm not, however, arguing here for removing stream fusion framework entirely. Partly because there still could be legitimate cases to use it. Mostly because many functions in text do not have a materialized counterpart and are expressed in terms of stream transformations. Rewriting them to operate on byte arrays is a non-trivial enterprise.

However, I argue for scraping implicit fusion rules. This is exactly what gets us into trouble with tail above. If users have a particularly long sequence of transformations, they can reach out to stream API explicitly. But incurring this choice upon them implicitly, in an unmitigatable way is unacceptable.

This PR removes rewrite rules for implicit fusion. It also removes haddock annotations: some function may still happen to fuse by chance (if they have no materialized implementation at all), but this is no longer promised. There are no visible API changes, so in certain sense this is not even a major change. Anyways, we are heading to switching internal representation to UTF8, so time is right to make all major changes in one go.

Further reading

@phadej
Copy link
Contributor

phadej commented Jun 11, 2021

I think this is the right thing to do. Implicit fusion is unpredictable, and you explain, doesn't even work in simple cases.

@jberryman
Copy link

jberryman commented Jun 11, 2021

So we have the RULE

tail t = unstream (S.tail (stream t))

which we only ever want to do when t is (unstream ...), relying on the general rule:

stream (unstream s) = s

Could we not instead make the rule something like?:

 tail (unstream t) = unstream (S.tail (stream (unstream t)))
 -- or simplified:
 tail (unstream t) = unstream (S.tail  t)

It seems like if you could do something like this for all the O(1) slice functions you'd end up only fusing when there was at least one non-problematic function in the composition. (Maybe that doesn't make sense; I don't have a good understanding of fusion)

I also wonder if you've tried to measure the impact of this? Coming up with a benchmark set that seems realistic and reasonable would also be work towards the utf-8 rewrite

@Bodigrim
Copy link
Contributor Author

Bodigrim commented Jun 11, 2021

So we have the RULE

tail t = unstream (S.tail (stream t))

which we only ever want to do when t is (unstream ...)

We may also wish to trigger this rule when tail t is an argument of stream. But not when tail is at the beginning or at the end of pipeline: there is no benefit to fuse tail . map f or map f . tail. Potentially we could have a rule of form stream . tail . unstream = S.tail. But now you lost an ability to fuse stream . tail . tail . unstream to S.tail . S.tail.

The thing is that other functions have a choice between materialized and streaming versions as well. E. g., concatenation can just physically copy bytearrays, or convert them to streams and sequence results. isPrefixOf can either operate on arrays, or on streams. One can compute the length of Text from its materialized representation by counting continuation bytes, or convert it to a stream and measure its size, etc.

The choice of fusion strategy is a non-local optimization. I think that nowadays one could get a pretty good mileage from a compiler plugin for fusion, able to perform global analysis. But rewrite rules are unsuitable to do so.

I also wonder if you've tried to measure the impact of this?

One can run cabal bench, but honestly I do not think it is relevant to this PR. The goal here is to make performance predictable, not necessarily better for a given set of benchmarks. Any routine which happens to get slower can be restored to its former glory by using explicit stream API.

@jberryman
Copy link

The goal here is to make performance predictable, not necessarily better for a given set of benchmarks.

I think rewrite rules should probably never be the cause of worse than advertised performance, and if the only way to achieve that in this case is to remove fusion that makes sense to me.

It sounds like you're going further though and making an argument that applies just as well to rewrite rule -based optimizations generally, and even most of ghc's optimization passes, right?

@Bodigrim
Copy link
Contributor Author

It sounds like you're going further though and making an argument that applies just as well to rewrite rule -based optimizations generally, and even most of ghc's optimization passes, right?

I would not go that far. I'm quite happy when things magically become better than advertised. But not when they become worse and I'm out of control.

Rewrite rules are a very brittle tool. They are pain to debug and pain to test. Rewrite rules with phase control are exponentially worse. There still could be rare cases when they bear their weight, but I don't think text is one of them for the reasons described above.

@cartazio
Copy link

Yeah, rewrite rules are pretty non-compositional to engineer and this is a great example. At some point I hope ghc or a successor to it works out having some sort of equality saturation based scheme because then a lot of the complexity properties should be better. But that’s not today. Predictable characteristics is step -1 of good quality enginering. And a library like text deserves to have predictable good performance.

Copy link

@parsonsmatt parsonsmatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this, though I do think we should note in the CHANGELOG that such a thing is happening in a relatively loud way. I would expect a lot of naive uses of Text to become slow from this change. We would ideally point folks to a means of using the streaming interface directly. Right now, a Ctrl+F on the Hackage docs for Data.Text only pull up the lazy text type as an option. But even the Data.Text.Lazy docs say that a lazy Text computation will allocate two lazy Text values. Searching for streaming again in this module only pulls up the fusion rules.

Searching streaming pulls up nothing on the main page. To use the fusion framework directly, you need to look at Data.Text.Internal.Fusion, which carries a Warning: , minimal documentation, and no examples.

Perhaps we could avoid the documentation/de-internalization effort on the Fusion modules by instead linking to one of the many streaming frameworks, like conduit, pipes, streaming, etc.

I'm approving the PR based on technical merits and overall impact. But I would like to see affected users given a clear warning and a well-documented and easily discoverable migration path to fixing these issues.

emilypi
emilypi previously approved these changes Jun 14, 2021
Copy link
Member

@emilypi emilypi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see no code problems @Bodigrim, but I think @parsonsmatt has a point in the sense of documenting these changes as we go along. Unless you want to do one bigger push at a later time, it'd probably be good to record as we go.

@Bodigrim
Copy link
Contributor Author

@parsonsmatt thanks, I mentioned the change in haddocks. Sorry for the brevity, I'll elaborate on migration path later, after UTF8 switch. I hope it is acceptable for now.

IMHO streaming and fusion are not quite interchangeable: streaming is about not materializing data in full, while fusion is about not materializing data at all. So strictly speaking conduit and pipes are not a replacement for Data.Text.Internal.Fusion. I'm happy to raise an issue about renaming it to Data.Text.Fusion and adding proper examples/documentation.

Copy link
Member

@emilypi emilypi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the docs @Bodigrim - approved.

@Boarders
Copy link

Very good detective work @Bodigrim! Given the nature of how people use text I agree that automatic fusion is a bad idea and that we should disable it by default (it is unclear when it is a good idea though it seems to work reasonably for typical tasks with vector). I'm going to work on some docs for the stream fusion framework to enable opt-in fusion for users and ease migration paths for this change.

@Bodigrim
Copy link
Contributor Author

@parsonsmatt would you like me to improve anything else?

Copy link

@parsonsmatt parsonsmatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great, thanks!

@Bodigrim Bodigrim merged commit 321c10a into haskell:master Jun 21, 2021
@Bodigrim Bodigrim deleted the no-implicit-fusion branch June 21, 2021 17:48
@jberryman
Copy link

For the record it looks like this change nicely improves performance on the hasura benchmark suite. No insight as to whether that's from our own code, or deeper in a library somewhere. Also this is comparing 1.2.3.2 with 321c10a , so maybe there were other performance-related changes that account for this

Results copied below; let me know if you want any interpretation


The regression report below shows, for each benchmark, the percent change for different metrics, between the merge base (the changes from PR 2186) and this PR. For advice on interpreting benchmarks, please see benchmarks/README.md.

More significant regressions or improvements will be colored with #b31d28 or #22863a, respectively.

You can view graphs of the full reports here:

Click here for a detailed report.
#       ┌────────────────┐
#       │  chinook.json  │
#       └────────────────┘
#                                          
#       ᐉ  Memory Residency (RTS-reported):
#           live_bytes               :    -0.1   (BEFORE benchmarks ran; baseline for schema)
#           live_bytes               :    -0.1   (AFTER benchmarks ran)
*           mem_in_use               :     2.2   (BEFORE benchmarks ran; baseline for schema)
#           mem_in_use               :    -1.6   (AFTER benchmarks ran)
#                                          
#       ᐅ simple query low load:
*           bytes_alloc_per_req      :    -2.1
*           min                      :     2.3
#           p50                      :     2.0
#                                          
#       ᐅ simple query high load:
*           bytes_alloc_per_req      :    -2.1
#           min                      :    -1.1
#           p50                      :     0.1
#                                          
#       ᐅ complex query low load small result:
+           bytes_alloc_per_req      :   -14.3
+           min                      :    -6.3
+           p50                      :    -5.0
#                                          
#       ᐅ complex query high load small result:
+           bytes_alloc_per_req      :   -14.3
+           min                      :    -5.0
+           p50                      :    -6.5
#                                          
#       ᐅ complex query low load large result:
+           bytes_alloc_per_req      :   -13.9
+           min                      :    -4.7
+           p50                      :    -4.7
#                                          
#       ᐅ complex query high load large result:
+           bytes_alloc_per_req      :   -13.9
#           min                      :    -0.2
#           p50                      :    -0.8
#                                          
#       ┌────────────────────┐
#       │  huge_schema.json  │
#       └────────────────────┘
#                                          
#       ᐉ  Memory Residency (RTS-reported):
#           live_bytes               :    -0.0   (BEFORE benchmarks ran; baseline for schema)
#           live_bytes               :    -0.0   (AFTER benchmarks ran)
#           mem_in_use               :     0.1   (BEFORE benchmarks ran; baseline for schema)
*           mem_in_use               :    -3.2   (AFTER benchmarks ran)
#                                          
#       ᐅ small query low load:
*           bytes_alloc_per_req      :    -2.4
#           min                      :    -1.3
*           p50                      :     2.0
#                                          
#       ᐅ huge query low load:
+++         bytes_alloc_per_req      :   -31.0
++          min                      :   -17.8
++          p50                      :   -18.6
#                                          

@abooij
Copy link

abooij commented Aug 26, 2021

To add to @jberryman's results, the Hasura code has a lot of interleavings of <> and pack, along these lines:

pack "Hello there " <> userName <> pack ", the temperature today is " <> degreesCelcius <> pack " degrees celcius."

However, through the use of OverloadedStrings, these applications of pack become invisible.

Before this PR, append and pack would interact particularly badly. Here's a Haskell program that, under -O2, generates compilation products exceeding 2 gigabytes:

{-# LANGUAGE OverloadedStrings #-}
module Test where
import Data.Text
g :: Text -> String -> Text
g a b = a <> " " <> a <> " " <> a <> " " <> a <> " " <> a <> " " <> a <> " "
  <> a <> " " <> a <> " " <> a <> " " <> a <> " " <> a <> " " <> a <> " " <> a
  <> " " <> a <> " " <> a <> " " <> a <> " " <> a <> " " <> a <> " " <> a <> " "
  <> a <> " " <> a <> " " <> a <> " " <> a <> " " <> a <> " " <> a <> " " <> a
  <> " " <> a <> " " <> a <> " " <> a <> " " <> a <> " " <> a <> " " <> a <> " "
  <> a <> " " <> a <> " " <> a <> " " <> a <> " " <> a <> " " <> a <> " " <> a
  <> " " <> a <> " " <> a <> " " <> a <> " " <> a <> " " <> a <> " " <> a <> " "
  <> a <> " " <> a <> " " <> a <> " " <> a <> " " <> a <> " " <> a <> " " <> a
  <> " " <> a <> " " <> a <> " " <> a <> " " <> a <> " " <> a <> " " <> a <> " "
  <> a <> " " <> a <> " " <> a <> " " <> a <> " " <> a <> " " <> a <> " " <> a
  <> " " <> a <> " " <> a <> " " <> a <> " " <> a <> " " <> a <> " " <> a <> " "
  <> a <> " " <> a <> " " <> a <> " " <> a <> " " <> a <> " " <> a <> " " <> a
  <> " " <> a <> " " <> a <> " " <> a <> " " <> a <> " " <> a <> " " <> a <> " "
  <> a <> " " <> a <> " " <> a <> " " <> a <> " " <> a <> " " <> a <> " " <> a
  <> " " <> a <> " " <> a <> " " <> a <> " " <> a <> " " <> a <> " " <> a <> " "
  <> a <> " " <> a <> " " <> a <> " " <> a <> " " <> a <> " " <> a <> " " <> a
  <> " " <> a <> " " <> a <> " " <> a <> " " <> a <> " " <> a <> " " <> a <> " "
  <> a <> " " <> a <> " " <> a <> " " <> a <> " " <> a <> " " <> a <> " " <> a
  <> " " <> a <> " " <> a <> " " <> a <> " " <> a

@jkachmar jkachmar mentioned this pull request Aug 26, 2021
jberryman pushed a commit to hasura/text that referenced this pull request Sep 1, 2021
A cherry-pick of haskell#348
onto 1.2.3.2, with some quick conflict fixes
@phadej phadej mentioned this pull request Sep 9, 2021
awjchen pushed a commit to awjchen/text that referenced this pull request Jan 7, 2022
awjchen pushed a commit to awjchen/text that referenced this pull request Jan 7, 2022
jberryman pushed a commit to hasura/text that referenced this pull request Jul 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants