Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

minimal pass at getting cfunction to always run in the newest world #20167

Merged
merged 1 commit into from
Jan 27, 2017

Conversation

vtjnash
Copy link
Member

@vtjnash vtjnash commented Jan 20, 2017

We are not really striving for accuracy or speed here (on the method invalidation path), just validity. That's also why there's no tests added here. I'm just trying to get some packages working again as quickly as possible, then will need to go back and make this much better.

@kshyatt kshyatt added the compiler:codegen Generation of LLVM IR and native code label Jan 21, 2017
@yuyichao
Copy link
Contributor

I'm just trying to get some packages working again as quickly as possible

Packages can already be trivially fixed by adding eval. I think they are (at least I am) only waiting for a long term solution.

@vtjnash
Copy link
Member Author

vtjnash commented Jan 21, 2017

Getting eval right is not trivial (lots of poky corner cases), and it's unnecessary churn to add it to packages to fix a bug in base.

@yuyichao
Copy link
Contributor

Most of the package won't hit those corner cases and #19784 should be enough and seems like it should be useful to have as a convenient function even if a different underlying implementation is used.

In both short term (before unmanaged thread is supported) and long term, this causes a inconsistency of what code is running when a function is called on a managed vs unmanaged thread and also a race with only managed threads.

@vtjnash
Copy link
Member Author

vtjnash commented Jan 21, 2017

I'm still against adding invokelatest

In both the short and long term this makes it solidly deterministic and simple for what code gets run. The only inconsistency is with what accidentally went into my original PR, where it incorrectly captures world age and pretends that it is a first class value.

@yuyichao
Copy link
Contributor

How is it consistent in short term? It turns into capture behavior on unmanaged thread.

@vtjnash
Copy link
Member Author

vtjnash commented Jan 21, 2017

Only if it notices that loading pTLS would cause a segfault. It's invalid anyways, and not the first hack to keep it working.

@yuyichao
Copy link
Contributor

not the first hack to keep it working.

That's exactly where the inconsistency is.

I'm still against adding invokelatest

Although that's exactly what this does.

@vtjnash
Copy link
Member Author

vtjnash commented Jan 21, 2017

And it doesn't actually turn into capture, it just picks a random newer age that it knows has already been inferred. This is actually more legal and consistent than capture, due to the constraint that incurring a race condition in function definition must inherently imply a potential for it to use an older age than the actual newest (e.g. when the capture happens is permitted to be non-deterministic. In the future, we will tighten this further to permit external synchronization, but do I mention that this mode of execution is invalid?

@vtjnash
Copy link
Member Author

vtjnash commented Jan 21, 2017

Although that's exactly what this does.

True. I'm aware that it can be used to emulate it, but it's intentionally a different model.

@yuyichao
Copy link
Contributor

And it doesn't actually turn into capture, it just picks a random newer age that it knows has already been inferred.

What's "it"?

when the capture happens is permitted to be non-deterministic

The world age or more precisely the set of available method is already user visible so capturing exactly that when cfunction is called should be as deterministic as running any other normal code?

I'm aware that it can be used to emulate it, but it's intentionally a different model.

So it's an implicit way to change world which I think is bad in general since the user shouldn't do it by accident for most cases and it also requires dynamic dispatches to a case where the performance is generally believed to be predictably good (essentially the @threadcall problem but in more valid use cases).

@vtjnash
Copy link
Member Author

vtjnash commented Jan 21, 2017

With the new foreigncall work, it also now possible to write backedge-free ccalls. I don't consider it a priority to rewrite threadcall right now, but just pointing out that codegen does have support now to emit a true dependency-free ccall thunk.

@vtjnash
Copy link
Member Author

vtjnash commented Jan 21, 2017

Most of the package won't hit those corner cases and #19784 should be enough and

Yes (confirmed by PkgEval), but that doesn't discount the importance of packages that need this (PyCall, Gtk, LibExpat are the big ones I know of)

@vtjnash
Copy link
Member Author

vtjnash commented Jan 21, 2017

Yes, both must be visible, for inference, as well as decent error message. But that's not the same as being able to capture it. You can inspect back to any world age and compute the set of methods at that point. Confusingly, though, this historical view may not actually represent a past state of the system (usually due to compaction, or incremental compilation), so I don't recommend attempting too much with this.

No, I must disagree about needing world changes to be explicit. While I expect the user will eventually need to be aware of how they can change, nothing else in the current implementation requires explicit changes, while there are many implicit changes. In the manual, I only highlighted the couple as the most likely example should to be encounted.

@vtjnash
Copy link
Member Author

vtjnash commented Jan 21, 2017

Just to reiterate one of the primary design points: the world counter is present to compute correctness. It is not there to stop you from shooting your self in the foot by monkey patching something wrong. Thus, it gets captured by inference (and therefore also staged functions), and must get captured by Tasks (to make them inferrable & multi-threaded). It doesn't need to be captured by anything else. So that's it. Nothing else should be capturing them.

@yuyichao
Copy link
Contributor

the world counter is present to compute correctness

And it allows optimizations, which this is undoing.

@vtjnash
Copy link
Member Author

vtjnash commented Jan 22, 2017

nah, it just allows different optimizations (some of which haven't been written yet, but are required for closing other issues). keeping the current behavior, otoh, requires destroying the basic assumption underlying the-efficiency-of-compilation-in-the-presence-of-the-world-computation that storing a closure over a particular world value is impossible. I don't want to have to keep chasing down the correctness issues to maintain that.

we are not really striving for accuracy or speed here (on the method invalidation path), just validity
#endif

#if defined(USE_ORCJIT) || defined(USE_MCJIT)
// make the alias name is valid for the current session
jl_ExecutionEngine->addGlobalMapping(GA, (void*)(uintptr_t)Addr);
// make the alias name is valid for the current session
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're just changing the indentation here, but is this supposed to say "make sure the alias name is valid" or "make the alias name valid" ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think in this case that either would be valid corrections

@tkelman
Copy link
Contributor

tkelman commented Jan 24, 2017

Is it a fair summary of what this does to say it's equivalent to automating some of the eval-wrapping for you in the generated C callback function to fall back to dynamic dispatch when there would otherwise be world age issues?

@vtjnash
Copy link
Member Author

vtjnash commented Jan 24, 2017

Yes. With a plan of later optimizing the dynamic fallback.

@yuyichao
Copy link
Contributor

nah, it just allows different optimizations (some of which haven't been written yet, but are required for closing other issues).

I think it's reasonable to say that the function pointer will be used much more often than cfunction. The final form of both likely requires some sort of lazy lookup with work counter and it's better if it is done at the construction time and not in the callback function. Also, the optimization that you are planing to do in cfunction should also be equally applicable to invokelatest.

keeping the current behavior, otoh, requires destroying the basic assumption underlying the-efficiency-of-compilation-in-the-presence-of-the-world-computation that storing a closure over a particular world value is impossible.

This is only the case for the naive stopgap implementation.

I don't want to have to keep chasing down the correctness issues to maintain that.

What correctness issue?

@vtjnash
Copy link
Member Author

vtjnash commented Jan 24, 2017

I'm not sure what you're arguing in the first paragraph. You seem to be simply agreeing that they both require some amount of lazy lookup and later optimization over the naive version. And as you point, the optimization and behavior of cfunction here should fairly closely match that required by an invokelatest-type functionality. But while JIT-patching or just pre-populating a method fptr when a backedge gets invalidated, how would you optimize the current behavior?

What correctness issue?

How would you infer PyCall.__init__? That is already currently a very slow and expensive call because it contains a cfunction and we aren't doing enough to infer that ahead-of-time. I also punted (for now) on some of the correctness of computing precompile file backedges by assuming that it is unusual for any inferred method to have an upper bound.

@yuyichao
Copy link
Contributor

I'm not sure what you're arguing in the first paragraph.

The optimizations that can be implemented with this approach isn't specific to this and is not an argument for doing this.

how would you optimize the current behavior?

Only emit a cache and don't limit world range at all?

How would you infer PyCall.__init__?

Inferring it should be fine? The cfunction check in inference in my PR should only be temporary. For static compilation the captured age can be relocated too.

@vtjnash
Copy link
Member Author

vtjnash commented Jan 24, 2017

static compilation the captured age can be relocated too.

With this PR, the optimization of runtime age is simply a static relocation to point to the method for the current world whenever that it changed (very rare). With the proposed current PR, it can't be statically relocated, since the runtime world isn't known during deserialization / static compilation, it can only get discovered at runtime. How do you optimize that?

@yuyichao
Copy link
Contributor

With the proposed current PR, it can't be statically relocated, since the runtime world isn't known during deserialization / static compilation, it can only get discovered at runtime. How do you optimize that?

It can be done when the cfunction is called at runtime to make the cached cfunction to belong to that world. As I said, I think we can expect the function pointer to be used much more often than the call to cfunction so given that ccall(cfunction(...), ...) can't be made to a single call (unless we recognize this pattern specifically) I'm much more willing to put the cost on in the cfunction part rather than the ccall (i.e. function pointer) part.

@vtjnash
Copy link
Member Author

vtjnash commented Jan 24, 2017

Where does the design goal that ccall(cfunction(...), ...) be equivalent to f() come from? I can't see a use case for that, or reason to encourage believing that it will be made equivalent.

Duplicating the work into cfunction doesn't reduce the cost at runtime. It still has to check for the world constraints, update the pointers, rebuild the argument list, transition back to managed state, etc.
But it does add the cost and complexity of making and using a trampoline / nest. (the net result of calling into LLVM to make a new function pointer is the same as making a trampoline via nest intrinsics; the implementation doesn't change the model.)

@yuyichao
Copy link
Contributor

Where does the design goal that ccall(cfunction(...), ...) be equivalent to f() come from? I can't see a use case for that, or reason to encourage believing that it will be made equivalent.

It's not the design goal. It's just a different way of saying the total cost of construct a cfunction and calling it.

world constraints

Can be skipped since it can know if the callee doesn't need it.

update the pointers,

It's calling a method in a known world so no need to update pointer?

rebuild the argument list

Sure. I think most C libraries uses C callback that does not pass structs around so it's mostly register movs.

transition back to managed state

Never needed. Not now since it's not implemented yet and not in the future unless the callee needs to allocate.

But it does add the cost and complexity of making and using a trampoline / nest.

The cache doesn't need a trampoline. Using trampoline in the slow path will be another optimization.

@vtjnash
Copy link
Member Author

vtjnash commented Jan 24, 2017

It's calling a method in a known world so no need to update pointer?

Can be skipped since it can know if the callee doesn't need it.

yep, that optimization is the same in either case.

rebuild the argument list

yes, my point is that it's never going to be zero-cost, so it's just a choice between "load from jl_world_counter" vs. "load from trampoline". And one of those is a trampoline...

unless the callee needs to allocate.

the callee is certainly allowed to allocate

The cache doesn't need a trampoline.

What do you call a runtime thunk if you aren't going to call it a trampoline?

@yuyichao
Copy link
Contributor

What do you call a runtime thunk if you aren't going to call it a trampoline?

In that sense sure. I'm just saying that for PyCall.__init__ case we can live with a single function with a gv and while it has the same number of load, I won't really call it a trampoline (in that it doesn't have to use llvm.*.trampoline).....

Can be skipped since it can know if the callee doesn't need it.

yep, that optimization is the same in either case.

Not entirely. The cfunction in this PR still always need to read the world counter. With a world capturing cfunction, this is not necessary in some cases.

yes, my point is that it's never going to be zero-cost, so it's just a choice between "load from jl_world_counter" vs. "load from trampoline". And one of those is a trampoline...

Loading and updating the world counter isn't my concern at all. The difference is that if it is to be executed in the latest world, the cfunction always need to do a (cached) lookup.

the callee is certainly allowed to allocate

Right. The point though is that this is always done in the callee and is not part of the cfunction overhead. We'll always enter the callee unmanaged.

@vtjnash
Copy link
Member Author

vtjnash commented Jan 24, 2017

I won't really call it a trampoline (in that it doesn't have to use llvm.*.trampoline).....

You have a function pointer that captures a runtime value. Whether you do this via LLVM intrinsics or codegen, it's the same feature.

The cfunction in this PR still always need to read the world counter.

No, just like the rest of the system, it can assume that if the world counter invalidated a back-edge, it also updated the pointers to them. It only has to check and update the world counter if codegen reports that there is a dynamic user of it (i.e. a dynamic dispatch).

The difference is that if it is to be executed in the latest world, the cfunction always need to do a (cached) lookup.

I don't follow. I'm proposing that the cfunction cache be world-independent so that it doesn't do a lookup, just a pointer load at a fixed location.

We'll always enter the callee unmanaged

At the very least, we'll need to verify that a managed context exists for the thread. The cfunction thunk itself also does allocation in some cases, so it's not always going to enter the callee unmanged.

@yuyichao
Copy link
Contributor

No, just like the rest of the system, it can assume that if the world counter invalidated a back-edge, it also updated the pointers to them. It only has to check and update the world counter if codegen reports that there is a dynamic user of it (i.e. a dynamic dispatch).

I'm talking about the dispatch within the function returned by cfunction. Since a single pointer is now supposed to be always switching to the latest world how do you avoid emitting a world age or some kind of invalidation check when you emit that function?

I don't follow. I'm proposing that the cfunction cache be world-independent so that it doesn't do a lookup, just a pointer load at a fixed location.

Same as above, I'm talking about the lookup needed in the function returned by cfunction. I assume that lookup will be world dependent?

At the very least, we'll need to verify that a managed context exists for the thread.

Only if the cfunction wrapper needs it, which I'd expect to be a uncommon (e.g. passing mutable type as value)/slow (e.g. actual dynamic dispatch) case anyway. This is also different from the @threadcall case where there's something reasonable that we can do. If a cfunction has to allocate on a unmanaged thread, there's nothing we can do other than abort anyway so it's not the job of the cfunction wrapper to do all sort of verification.

@vtjnash
Copy link
Member Author

vtjnash commented Jan 24, 2017

I'm talking about the dispatch within the function returned by cfunction. Since a single pointer is now supposed to be always switching to the latest world how do you avoid emitting a world age or some kind of invalidation check when you emit that function?

JIT patching is probably the cheapest if we need to really optimize. It's a rare occurrence, so shouldn't happen very often. Most likely though, I would just go with a simple PLT.

I assume that lookup will be world dependent?

Yes, but it's important to look at which world it's dependent on, since it's actually highly asymmetric. Except for return-type and inlining, it's cheaper and easier to depending on the latest world than to depend on the runtime world. I'm considering seeing if I reflect this better in the layout of TypeMap, since I realize that currently the API looks like it would be symmetric, while in fact is it computed as an half-open interval.

Only if the cfunction wrapper needs it, which I'd expect to be a uncommon

Agreed, this is in essence the same code-gen computation of whether the world-age state was needed. Once the cfunction thunk has picked a world (regardless of mechanism), it's effectively indistinguishable how we then execute that code, since it is at that point that all choices converge into running compiled code for a fixed world age.

there's nothing we can do

... except for initializing a TLS context and and seamlessly handling the request?

@yuyichao
Copy link
Contributor

JIT patching is probably the cheapest if we need to really optimize. It's a rare occurrence, so shouldn't happen very often. Most likely though, I would just go with a simple PLT.

OK, I think this should be good enough to clear my performance concern and it sounds reasonably easy to implement. I assume you still need to load the latest world to be switched to (if needed by the function) in the function loaded from the PLT. It seems quite tricky to fix the race caused by

Thread 1 Thread 2
in plt: ptr = load slot0
global: invalidate slot0
jl_world_counter++
call ptr
in real callback: load jl_world_counter

without atomic ops in the callback/plt, (if not possible an acquire load followed by a branch on the slot load again can probably work) (~10 cycles net overhead on aarch64, free on x86).

I'm still not entirely convinced that always seeing the latest world in cfunction is the right semantics (partially because of the race) but that's a much smaller concern.

@vtjnash
Copy link
Member Author

vtjnash commented Jan 25, 2017

I agree that the race there is quite tricky. We likely don't handle that correctly anywhere right now, since we lack the lock needed to ensure the invalidation finishes before the world-counter is globally incremented. I'm figuring the PLT can load the world counter first, then the slot (ensuring this ordering is free on x86, not necessarily true elsewhere), and then select max slot->min_world, current_world. This can usually be done without a branch via a conditional mov instruction, and slot->min_world is a compile-time constant. A naive (i.e. still racy since it doesn't use atomics) implementation of the latter approach is part of this PR (primarily as a means of supporting threadcall).

Another option for avoiding that conditional move might be to do a two-stage update, first inserting a guard entry (a small assembly thunk which just wait for the world lock to be released then restarts), then does all the world invalidation computations and updates the counter, then puts in the real entry which allows the assembly thunk to resume.

@martinholters
Copy link
Member

What's the status here? @yuyichao, could @vtjnash dispel your reservations? Is this good to go?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:codegen Generation of LLVM IR and native code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants