-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add offsets to loads and stores #1474
base: master
Are you sure you want to change the base?
Conversation
Previously `load` and `store` both took SSA pointer variables directly. This commit adds an offset relative to those variables. In other words, we can now turn: ``` %7: ptr = ... %8: ptr = ptr_add %7, 8 %9: i8 = load %8 ``` into: ``` %7: ptr = ... %9: i8 = load [%7 + 8] ``` This is effective on x64 (and should be on, at least, Arm too) because "load from register + offset" can generally be encoded in a single instruction. We previously had a hack for the specific case above iff `ptr_add` was immediately followed by `load/`store`. This commit generalises that: the optimisation pass "inlines" `ptr_add` no matter how far back in the trace it is. So this: ``` %7: ptr = ... %8: ptr = ptr_add %7, 8 ... %5432: i8 = load %8 ``` generates the same, efficient, code as if there was nothing between the `ptr_add` and the `load`. We do the same optimisation for `store`s too. This gives a 2.5% speed-up for big_loop. There are, however, downsides: 1. We complicate the JIT IR syntax with `[%val + off]`. Currently I haven't added JIT IR parser support for this, because it doesn't seem very useful, but you can observe it when traces are printed. 2. We can no longer fit instructions in a single machine word: there isn't space in `StoreInst` (and only 7 bits free in `LoadInst`). TBH, I now think -- and this is entirely on my head! -- that we were going to have to break this constrain at some point soon, in order to allow bigger `InstIdx`s. Especially because we compile in a thread, I think the practical effect of this is low (I can still observe a speedup after all!), and if we had to go above 2 machine words, I'm not sure it would make much difference. 3. Philosophically speaking, we're extending the JIT IR simply to express an optimisation. Frankly speaking, I think the right long-term thing to do is -- probably -- to introduce another IR that has these details in, but I have no desire to introduce such a complication now. 4. This doesn't help quite as much as one would hope because a surprising amount of `PtrAdd`s are only left around because they're needed by a guard. Optimising this is a job for later. I have -- and obviously this isn't enforceable! -- hacked around the consequences of ykjit#3 by only allowing the optimiser to set `off` to a non-zero value. That means most of the pipeline can avoid thinking about this, but we shouldn't kid ourselves that this means there is no additional complexity from this change. There is, the complexity is permanent --- though, in my judgement, it is probably worthwhile. I also doubt this is the last change along these lines (e.g. should we try handling `DynPtrAdd`).
This all seems sensible to me. If we are to allow instructions to be larger than a word, I wonder if we'd still strive to pack operands? I recall you saying that the indices are likely to be too small for long-term use, and if they were full-on usizes then we could get rid of a load of Note a typo in the PR title. |
An (accidental but nice) advantage of
Ooops, fixed! |
Looks good. Are there now other instructions that now could be simplified. |
The main commit here is 04c879d which (hopefully) extensively documents the motivation and consequences of this PR. I do consider this PR an RFC of sorts, though I definitely come down on the side of "the pros outweigh the cons", though I'm open to arguments the other way!