-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disable implicit fusion rules #348
Conversation
I think this is the right thing to do. Implicit fusion is unpredictable, and you explain, doesn't even work in simple cases. |
So we have the
which we only ever want to do when
Could we not instead make the rule something like?:
It seems like if you could do something like this for all the O(1) slice functions you'd end up only fusing when there was at least one non-problematic function in the composition. (Maybe that doesn't make sense; I don't have a good understanding of fusion) I also wonder if you've tried to measure the impact of this? Coming up with a benchmark set that seems realistic and reasonable would also be work towards the utf-8 rewrite |
We may also wish to trigger this rule when The thing is that other functions have a choice between materialized and streaming versions as well. E. g., concatenation can just physically copy bytearrays, or convert them to streams and sequence results. The choice of fusion strategy is a non-local optimization. I think that nowadays one could get a pretty good mileage from a compiler plugin for fusion, able to perform global analysis. But rewrite rules are unsuitable to do so.
One can run |
I think rewrite rules should probably never be the cause of worse than advertised performance, and if the only way to achieve that in this case is to remove fusion that makes sense to me. It sounds like you're going further though and making an argument that applies just as well to rewrite rule -based optimizations generally, and even most of ghc's optimization passes, right? |
I would not go that far. I'm quite happy when things magically become better than advertised. But not when they become worse and I'm out of control. Rewrite rules are a very brittle tool. They are pain to debug and pain to test. Rewrite rules with phase control are exponentially worse. There still could be rare cases when they bear their weight, but I don't think |
Yeah, rewrite rules are pretty non-compositional to engineer and this is a great example. At some point I hope ghc or a successor to it works out having some sort of equality saturation based scheme because then a lot of the complexity properties should be better. But that’s not today. Predictable characteristics is step -1 of good quality enginering. And a library like text deserves to have predictable good performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this, though I do think we should note in the CHANGELOG that such a thing is happening in a relatively loud way. I would expect a lot of naive uses of Text
to become slow from this change. We would ideally point folks to a means of using the streaming interface directly. Right now, a Ctrl+F
on the Hackage docs for Data.Text
only pull up the lazy text type as an option. But even the Data.Text.Lazy
docs say that a lazy Text
computation will allocate two lazy Text
values. Searching for streaming
again in this module only pulls up the fusion rules.
Searching streaming
pulls up nothing on the main page. To use the fusion framework directly, you need to look at Data.Text.Internal.Fusion
, which carries a Warning:
, minimal documentation, and no examples.
Perhaps we could avoid the documentation/de-internalization effort on the Fusion
modules by instead linking to one of the many streaming frameworks, like conduit
, pipes
, streaming
, etc.
I'm approving the PR based on technical merits and overall impact. But I would like to see affected users given a clear warning and a well-documented and easily discoverable migration path to fixing these issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see no code problems @Bodigrim, but I think @parsonsmatt has a point in the sense of documenting these changes as we go along. Unless you want to do one bigger push at a later time, it'd probably be good to record as we go.
@parsonsmatt thanks, I mentioned the change in haddocks. Sorry for the brevity, I'll elaborate on migration path later, after UTF8 switch. I hope it is acceptable for now. IMHO streaming and fusion are not quite interchangeable: streaming is about not materializing data in full, while fusion is about not materializing data at all. So strictly speaking |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating the docs @Bodigrim - approved.
Very good detective work @Bodigrim! Given the nature of how people use |
@parsonsmatt would you like me to improve anything else? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great, thanks!
For the record it looks like this change nicely improves performance on the hasura benchmark suite. No insight as to whether that's from our own code, or deeper in a library somewhere. Also this is comparing Results copied below; let me know if you want any interpretation The regression report below shows, for each benchmark, the percent change for different metrics, between the merge base (the changes from PR 2186) and this PR. For advice on interpreting benchmarks, please see benchmarks/README.md. More significant regressions or improvements will be colored with You can view graphs of the full reports here:
Click here for a detailed report.# ┌────────────────┐
# │ chinook.json │
# └────────────────┘
#
# ᐉ Memory Residency (RTS-reported):
# live_bytes : -0.1 (BEFORE benchmarks ran; baseline for schema)
# live_bytes : -0.1 (AFTER benchmarks ran)
* mem_in_use : 2.2 (BEFORE benchmarks ran; baseline for schema)
# mem_in_use : -1.6 (AFTER benchmarks ran)
#
# ᐅ simple query low load:
* bytes_alloc_per_req : -2.1
* min : 2.3
# p50 : 2.0
#
# ᐅ simple query high load:
* bytes_alloc_per_req : -2.1
# min : -1.1
# p50 : 0.1
#
# ᐅ complex query low load small result:
+ bytes_alloc_per_req : -14.3
+ min : -6.3
+ p50 : -5.0
#
# ᐅ complex query high load small result:
+ bytes_alloc_per_req : -14.3
+ min : -5.0
+ p50 : -6.5
#
# ᐅ complex query low load large result:
+ bytes_alloc_per_req : -13.9
+ min : -4.7
+ p50 : -4.7
#
# ᐅ complex query high load large result:
+ bytes_alloc_per_req : -13.9
# min : -0.2
# p50 : -0.8
#
# ┌────────────────────┐
# │ huge_schema.json │
# └────────────────────┘
#
# ᐉ Memory Residency (RTS-reported):
# live_bytes : -0.0 (BEFORE benchmarks ran; baseline for schema)
# live_bytes : -0.0 (AFTER benchmarks ran)
# mem_in_use : 0.1 (BEFORE benchmarks ran; baseline for schema)
* mem_in_use : -3.2 (AFTER benchmarks ran)
#
# ᐅ small query low load:
* bytes_alloc_per_req : -2.4
# min : -1.3
* p50 : 2.0
#
# ᐅ huge query low load:
+++ bytes_alloc_per_req : -31.0
++ min : -17.8
++ p50 : -18.6
# |
To add to @jberryman's results, the Hasura code has a lot of interleavings of
However, through the use of Before this PR,
|
A cherry-pick of haskell#348 onto 1.2.3.2, with some quick conflict fixes
A cherry-pick of haskell#348 onto 1.2.5.0
A tale of two tails
What is the asymptotic complexity of
Data.Text.tail
? InternallyText
is a byte array with offset and length counters, so taking its tail involves just bumping an offset by 1 or 2 code units, depending on UTF-16 encoding of the first character. This should be O(1) time/memory and is indeed so.text/src/Data/Text.hs
Lines 532 to 536 in f1b4fc0
But what is the complexity of
Data.Text.tail . Data.Text.tail
? O(1), isn't it? Lo and behold: it is O(n) time and O(n) memory! Here is why:text/src/Data/Text.hs
Lines 539 to 544 in f1b4fc0
When we compile a single
tail
, these rules cancel each other and we end up with a usual definition, cited above, which has O(1) complexity. However, if there istail . tail
, things get interesting because ofstream . unstream = id
rewrite rule:Here
S.tail
comes fromData.Text.Internal.Fusion.Common
and just skips the first element of a stream, so it is O(1). But neverthelessstream
must process the materializedText
in whole, so O(n) time, andunstream
materializes the whole stream intoText
, so O(n) memory.This fusion issue plagues not only
tail . tail
, but also any other combination of slicing functions. See Bottlenecked on Haskell's text library for the similar issue abouttake . drop
. This particular one was patched in #301 with one more rewrite rule to prevent fusion oftake . drop
. But we cannot reasonably add rules for every possible combination oftake
,drop
,tail
,init
,head
,last
!How important is fusion?
Fusion frameworks shine for polymorphic containers such as
vector
, where long chains of transformations are ubiquitous. They are much less useful for monomorphic containers. That's basically whyByteString
abandoned its fusion framework 13 years ago. Chains of operations likemap
offilter
are even less likely to occur forText
than forByteString
: processing Unicode data in such way is almost certainly unsound.Instead there are two operations, which dominate
Text
applications:Slicing. Once we received a
Text
input, we need to parse it one way or another, and parsing means slicing. As explained above, the fusion framework inText
has all chances to make your performance unpredictably worse and no chances to make it better.Concatenation. To produce a
Text
output we usually concatenate smaller pieces ofText
. There is a robust, fast and predictable mechanism to do so viaBuilder
. Relying on the fusion framework for concatenation is an antipattern, because it leads to slow compilation times and unpredictable runtime. Instead of neatly copying two bytearrays together (asBuilder
does), the fusion framework decodes both of them to streams and then encodes them back.I'm not, however, arguing here for removing stream fusion framework entirely. Partly because there still could be legitimate cases to use it. Mostly because many functions in
text
do not have a materialized counterpart and are expressed in terms of stream transformations. Rewriting them to operate on byte arrays is a non-trivial enterprise.However, I argue for scraping implicit fusion rules. This is exactly what gets us into trouble with
tail
above. If users have a particularly long sequence of transformations, they can reach out to stream API explicitly. But incurring this choice upon them implicitly, in an unmitigatable way is unacceptable.This PR removes rewrite rules for implicit fusion. It also removes haddock annotations: some function may still happen to fuse by chance (if they have no materialized implementation at all), but this is no longer promised. There are no visible API changes, so in certain sense this is not even a major change. Anyways, we are heading to switching internal representation to UTF8, so time is right to make all major changes in one go.
Further reading