Make cfunction optmization in codegen valid #19801

yuyichao · 2016-12-31T23:49:53Z

The backedge in inference if obviously very pessimistic but I doubt it will cause too much trouble. Also need tests.

This makes sure that cfunction (or jl_function_ptr) always returns a function pointer that captures the calling world and makes sure that both the runtime implementation and codegen agrees on this.

Add to 0.6 milestone since this should be ready as is (missing tests and further optimizations) and should make it easier to handle cases like #19790 in packages.

yuyichao · 2017-01-01T00:09:42Z

Since the cfunction overwrite the the world age, I guess the condition can't be loosen by too much since AFAICT we don't track if the callee might have dynamic dispatch.

yuyichao · 2017-01-01T00:32:16Z

Tests added. I'll probably leave the condition as is for now....

vtjnash · 2017-01-01T06:13:37Z

src/ccall.cpp

@@ -1635,7 +1635,7 @@ static jl_cgval_t emit_ccall(jl_value_t **args, size_t nargs, jl_codectx_t *ctx)
                frt = jl_tparam0(frt);
                Value *llvmf = NULL;
                JL_TRY {
-                    llvmf = jl_cfunction_object((jl_function_t*)f, frt, (jl_tupletype_t*)fargt);
+                    llvmf = jl_cfunction_object((jl_function_t*)f, frt, (jl_tupletype_t*)fargt, ctx->world);


Can also check here that min_age == max_age, to see if the backedge is right

vtjnash · 2017-01-01T06:17:01Z

src/codegen.cpp

    JL_GC_PUSH1(&argt);
    if (jl_is_tuple(argt)) {
        // TODO: maybe deprecation warning, better checking
        argt = (jl_value_t*)jl_apply_tuple_type_v((jl_value_t**)jl_data_ptr(argt), jl_nfields(argt));
    }
-    Function *llvmf = jl_cfunction_object(f, rt, (jl_tupletype_t*)argt);
+    Function *llvmf = jl_cfunction_object(f, rt, (jl_tupletype_t*)argt,
+                                          ptls->world_age);


Hm. Is this actually right? Now that I think back on some of my original work on cfunction, I think I may have contemplated they should always appear to operate in the newest world. And that it would update the JIT code as needed to make that true. Clearly I forgot about that plan as I finished up the PR though last year. But it's a new year here now!

I think it's very important to make sure that we can still do the same cfunction optimization that we are doing now, i.e. cfunction on known types/functions should return a constant (or relocated) pointer. How can you do it if cfunction always returns the function in the latest world?

StefanKarpinski · 2017-01-03T13:47:30Z

Merge?

vtjnash

I don't think we should do it this way. (see comments)

yuyichao · 2017-01-03T22:05:37Z

I disagree. I think it is agreed on that trying to observe a change after eval immediately (i.e. without going back to toplevel) is bad. Making cfunction always run in the latest world will do just that. The only drawback of this approach I can see is the restriction of valid world on functions that uses cfunction. However, this is not a user visible API change and it can be optimized by emitting an array of cfunctions (possibly lazily) and load with the world index.

vtjnash · 2017-01-04T18:34:41Z

I think it is agreed on that trying to observe a change after eval immediately (i.e. without going back to toplevel) is bad.

That is usually true, which makes it a good heuristic, but I think that's just an artifact, not a principle design goal. I would instead state this principle in the reverse as "world age should not change within a dynamic scope, due to eval, or when adding a method definition, etc." However, I would also point out that cfunction creates a new dynamic scope upon entry, so it would fall into this category. I think we both agree that clearly cfunction must create a new dynamic scope, since it must do some world-age / task management (it has no idea what task it is getting called from, if any). Indeed, any transition from C -> Julia needs to configure the environment. (I wasn't entirely aware of this when making the PR, so there's a number of jl_apply calls in Julia's src that, upon reflection and review, are missing a preceding world-sync.) This further means that ccall(cfunction(f)) cannot be the same as f(), so that option isn't possible.

To see why it should be preferred to use the newest age, consider some properties of the static compilation case:

We will only guarantee that we have code for running in the newest world (+ the type-inference world) at startup
We want to be able to precompile all cfunctions
Holding a reference to an existing world will "pin" that world in the function cache, preventing us from erasing its inference results
When running code that started in an older world, we may want to disable inference / compilation and just do pure interpretation (under the assumption that it won't be called again)

One corollary of all this is that it should probably be hard to "capture" the world counter as a first-class value. It is just simpler to not have any constructs for explicitly capturing world age (other than a task).

JeffBezanson · 2017-01-04T19:02:36Z

I would find it surprising if, in

function foo()
    f()   # gives `x`
    cb = cfunction(f)
    ccall(:call_callback, T, (...), cb)   # gives `y`

results x and y were different. I want C to call the same function f that I see. Intuitively, a function call is factored into a lookup and an actual call. cfunction simply exposes the result of the lookup step (plus calling convention differences of course). That's especially true if we want the call to cfunction replaced with a constant pointer at compile time. IIUC, that seems to argue in favor of this PR?

vtjnash · 2017-01-04T19:19:48Z

not entirely. that's why I wrote: "This further means that ccall(cfunction(f)) cannot be the same as f(), so that option isn't possible."

JeffBezanson · 2017-01-04T19:22:48Z

Maybe it can't be exactly the same, but can it at least have the property described in my comment?

vtjnash · 2017-01-04T19:45:40Z

That's especially true if we want the call to cfunction replaced with a constant pointer at compile time.

Nope, that actually derives the opposite result. If cfunction is merely a function of it's argument (as I'm proposing), then it's relatively easy to setup a cache. What I implemented instead (and now regret) is that it also a function of the dynamic scope captured from the Task that it is called from. For correctness, that requires ensuring that all of the interacting code being run on that Task was uniquely compiled for that exact dynamic state (this PR).

Intuitively, a function call is factored into a lookup and an actual call. cfunction simply exposes the result of the lookup step (plus calling convention differences of course).

Yes, but to be more pedantic however, it's factored as (lookup and an actual call) from a world. You can't split that into two operations without introducing a new primitive (actual call + world switch) (to go along with the basic (lookup from world)). I would prefer to not add that the new primitive as a side-effect of the implementation of cfunction. (although I'm happy to consider whether we should add that primitive independent of cfunction: cf. https://discourse.julialang.org/t/proposal-for-a-first-class-dispatch-wrapper/1127)

JeffBezanson · 2017-01-04T19:55:50Z

If cfunction is merely a function of it's argument

But the world number is implicitly an argument to method lookup itself, and therefore must also be an argument to cfunction. I don't see how cfunction is different from jl_apply_generic, which is a function of world age, plus compile-time method lookup, which is a function of world age in the same way.

vtjnash · 2017-01-04T23:45:25Z

But the world number is implicitly an argument to method lookup itself, and therefore must also be an argument to cfunction.

I'm not sure what you mean by "implicitly" in this context. Method lookup takes an explicit argument of the world number. It is explicitly passed to jl_apply via the current-task argument (sure, it's via TLS, but that's merely an optimization). Similarly to the argument types themselves, jl_apply simply knows that these arguments can't change between the lookup and apply step, so it can avoid the re-verification step. If these steps were separated, it would have to verify that the arguments to the lookup and apply steps were compatible. That further exemplifies that the world is an argument to the apply stage, and not a separate action. Here, we see that cfunction is instead defining apply very differently by capturing the world-age in the lookup step, and changing the apply world to match.

However, cfunction does not have a current-task argument. The values in it are therefore simply undefined. Since we don't specialize the compilation on them (current_module, current_task / tls / thread-id, & similar – although the others aren't particularly interesting), this doesn't generally cause any problems to just ignore them and hope for the best.

The world age counter is a bit different, since we specialize on it very heavily, it's not really realistic to hope for the best (too many places need to assert if it isn't set correctly.) So we need to deal with it, and so there's 3 general option categories:

We could deal with it like the other dynamic scope state (e.g. assume that the C-code dealt with it, convert to dynamic dispatch if it isn't inbounds when we arrive in the cfunction). This would be the best fix for your example, it matches the behavior of jl_apply_generic, and is also very simple. But I think over time we will want to reduce our reliance on this undefined behavior, not encourage it.
We could deal with it by making cfunction a closure over the task that it was called from. That's what this PR does.
We could deal with it by making cfunction behave equivalently to the other entry points found in libjulia, namely, by creating a new dynamic context for it upon entry. This is what I want to do.

I don't see how cfunction is different from jl_apply_generic, which is a function of world age, plus compile-time method lookup, which is a function of world age in the same way.

Inference does a complex little dance to build equivalence classes of world ages. I don't see this argument really going anywhere though. I think this is sort of an empty argument, since it quickly becomes a tautology between the choice to do static inference vs. dynamic de-optimization (e.g. given that's how we want to get efficient, statically compile code, we specialize code on world age in this way. And because that makes it efficient, we use it to statically compile code before running.) I already mentioned, however, that the difference here is between whether the lookup result is a closure over the world age (e.g. apply should call convert(TT, args)) or an argument to both (e.g. apply should call typeassert(args, TT))

JeffBezanson · 2017-01-05T01:31:08Z

Ok, I think we're getting somewhere: the difference is that with cfunction, the lookup and call are not atomic. First, looking only at my specific example:

function foo()
    f()   # line 1
    cb = cfunction(f)   # line 2
    ccall(:call_callback, T, (...), cb)   # line 3

is there anything that can happen between lines 2 and 3 that renders it invalid to look up f in the world foo is executing in? I don't believe so.

Next, I'll grant that you might call cfunction at a random time (e.g. load time), save the pointer, and actually pass it to C much later (possibly in a different world). But since the address to call is fixed, it's not easy to avoid mismatches between the compiled code and the current world when it executes. It seems the address would have to point to a trampoline that re-compiles the function, or calls an unoptimized version of it, if the world has changed. Is that what you propose? I'm almost inclined not to support that style of use at all, and basically mandate the pattern I have above where cfunction is always called (semantically) at the point where a function is passed to C. We could potentially do that e.g. by allowing cfunction to appear only as an argument to ccall.

You seem to be saying that a callback can't see the world counter in the TLS. I don't see why --- can't a callback still look at the TLS just like any other code?

yuyichao · 2017-01-05T01:49:56Z

It seems the address would have to point to a trampoline that re-compiles the function, or calls an unoptimized version of it, if the world has changed. Is that what you propose? I'm almost inclined not to support that style of use at all, and basically mandate the pattern I have above where cfunction is always called (semantically) at the point where a function is passed to C.

I won't against such pattern but it's worth pointing out that such implementation will almost certainly break the usage of running the function on an unmanaged thread, which is the current recommended/documented way to interact with threaded callback and is used by @threadcall. I'd like to not break that before we have proper unmanaged thread support.

JeffBezanson · 2017-01-05T01:53:19Z

such implementation will almost certainly break the usage of running the function on an unmanaged thread

Which part are you referring to? The trampoline, or the second part of the bit you quoted?

yuyichao · 2017-01-05T02:00:29Z

The trampoline

I was also trying to reply to whether the usage should be supported so I copied too much.....

For the usage pattern, I feel like it might be useful to have a version of cfunction that do not change the world and do trampoline/dynamic dispatch but I'm not sure if it worth having two versions of cfunction.

vtjnash · 2017-01-05T02:02:35Z

is there anything that can happen between lines 2 and 3 that renders it invalid to look up f in the world foo is executing in? I don't believe so.

No, in this case the trouble would be outside foo. And it's only a problem for case 2 in my list above (this PR). The trouble being that you don't know what the world counter will be at runtime, so you can't have cfunction return a pre-generated function pointer during optimization.

Next, I'll grant that you might call cfunction at a random time (e.g. load time), save the pointer, and actually pass it to C much later (possibly in a different world). But since the address to call is fixed, it's not easy to avoid mismatches between the compiled code and the current world when it executes.

You say "not easy", I say, that means have to write our IO scheduler to create a dedicated Task to run all callbacks.

It seems the address would have to point to a trampoline that re-compiles the function, or calls an unoptimized version of it, if the world has changed. Is that what you propose?

That's probably true for case 3. Although it's pretty easy to optimize, since the code that will get run is a function of the cfunction signature and not the runtime context, that state is easy to track with a backedge and rewrite (dynamic patching) when required.

For case 1, you either have to hard abort() if the world-age mismatches, or dynamic dispatch to the right function.

I'm almost inclined not to support that style of use at all, and basically mandate the pattern I have above where cfunction is always called (semantically) at the point where a function is passed to C. We could potentially do that e.g. by allowing cfunction to appear only as an argument to ccall.

Again, would have to rewrite the scheduler, and it'd be tricky to use cfunction as an FFI for arbitrary callbacks, but sure, who was using those (*cough* Gtk *cough* PyCall *cough*)?

You seem to be saying that a callback can't see the world counter in the TLS. I don't see why --- can't a callback still look at the TLS just like any other code?

TLS is sometimes not present (@threadcall), or just generally may not be managed by the foreign code. It's not really a big deal – it just happens to mean that supporting case 1 (making your specific example work) means potentially dropping inference inside the cfunction (it's impossible for inference to know what the value of the world-age will be, unlike for normal jl_apply, where it knows exactly). We likely can do optimistic optimization (assume the world won't change, and branch as needed), but that's obviously not completely ideal.

vtjnash · 2017-01-05T02:16:38Z

I won't against such pattern but it's worth pointing out that such implementation will almost certainly break the usage of running the function on an unmanaged thread, which is the current recommended/documented way to interact with threaded callback and is used by @threadcall. I'd like to not break that before we have proper unmanaged thread support.

I think we currently only allow ccall on bitstypes loaded from constants? I think we can depend upon that not getting invalided under all of the three above scenarios. We already have to detect this situation in the cfunction to work around it, so it's not a huge deal to continue handling it that way.

To have proper unmanaged thread support, we will probably need some way of realizing a new Task on that stack. It's probably possible we could fold that into cfunction's responsibilities. Option 1 is the only option that isn't compatible with that, but it's not really compatible with threads (or tasks, or eval) in general anyways (as you noted above, it would be mandated that the reference to the cfunction becomes invalid when your function does out of scope).

JeffBezanson · 2017-01-05T03:11:43Z

Isn't it true that either with or without this PR, we have the problem of the world changing between getting a pointer from cfunction and some C code calling it? In that sense, there is no "correct" world to pass to cfunction. AFAICT, this PR doesn't change the fact that cfunction "captures" a world, it just changes which world, which still might prove to be the wrong one. I just want to check my understanding here.

vtjnash · 2017-01-05T03:40:56Z

That's correct, this just tries to better check that we have optimized the code to capture the expected world. That's also why I'm arguing that cfunction shouldn't take a world argument at all. That's how I initially wrote the implementation in #17057 (also with a TODO comment to sort it out better later), but then changed it at the last minute. The original version (case 3) worked much better on PkgEval, but was running into very rare assertion failures (since I wasn't managing the world correctly on entry). I went with case 2, even though it seemed to not work as well, simply because it had a simple mental model that worked consistently (although not necessarily working well).

JeffBezanson · 2017-01-05T04:05:08Z

Ok, I think I'm starting to get a handle on this then. My understanding of your preferred approach:

Entering a cfunction callback enters a new dynamic scope.
Those dynamic scopes are always in the newest world.
To make that work, when something changes update all affected cfunctions by patching their code, so that they're always correct for the newest world (could be done lazily, but same idea).

If the first point is assumed, I think the rest follows pretty naturally. If you're going to enter a new dynamic scope, it might as well be in the newest world.

The first point is the surprising part though. It's not immediately intuitive that entering a cfunction would be a new dynamic scope. I think in a perfect world we wouldn't want that behavior, but the argument seems to be that staying in the same dynamic scope is impossible since there's no reliable way to pass the world number through. If we could get at the world number, the cfunction could contain code to check the world number and branch out if it needs to be recompiled. But that check is a bit expensive, so we'd rather not bother. Is that a fair summary?

vtjnash · 2017-01-05T04:58:27Z

Yes, that's a good summary.

The first point is the surprising part though. It's not immediately intuitive that entering a cfunction would be a new dynamic scope. I think in a perfect world we wouldn't want that behavior

This is the bit I would actually disagree with (well, except the "surprising" and "non intuitive" bits). After recently reviewing all of the places in base where we call back into Julia code, I discovered that we should always change to the newest dynamic scope. I honestly wasn't expecting that. But it looks like anywhere I didn't do this was simply an omission due to starting from the wrong assumption about this (and because those places are generally only reached while handling toplevel expressions, it's harder to construct a case where you could notice the difference).

I don't know that I have a full mental model of why this seems to be the case so consistently. The best notion I have right now to explain it is that it ends up being highly desirable to distinguish between call-backs vs. call-forwards (c.f the invokelatest PR & related issues). I think that cfunction clearly is intended for creating a callback, so we want it to have the new-scope behavior.

I think that just leaves the behavior of the ccall(cfunction()) example as a bit of an oddity, because it makes a callback, then immediately tries to call-forward to it. I think the main observation here is just that other languages don't distinguish between call-backs and call-forwards (or call-now?), so we don't have prior-art to reflect upon.

If this was a perfect world, I would propose a dual system, where a call (any call, not just cfunction) automatically transformed from call-forward to call-back depending on whether the function argument was derived from below or above on the call-stack. But I think that's probably mostly nonsense.

yuyichao · 2017-01-05T06:11:38Z

If the first point is assumed, I think the rest follows pretty naturally. If you're going to enter a new dynamic scope, it might as well be in the newest world.

It could also be the world on construction (or whatever captured world) since that's the only solution I can think of that doesn't break @threadcall.

yuyichao · 2017-01-05T06:29:38Z

I don't know that I have a full mental model of why this seems to be the case so consistently.

I believe it's because that's what every dynamic language do so doing that and not doing any static optimization would seem to be "correct" and is what most people expects the fix of #265 to do. In another world, it seems correct since most people won't want to run the old method when a newer one is available (I won't be surprised if they do in the future though).

In this sense, I think all three versions could match what one expect in terms of the world the callback is executed in (basically anything newer than the one on construction should be fine). However, since most people also expect the cfunction to not have any dynamic dispatch and this is also a documented feature used in Base and packages I think we should use the version in this PR.

I think that cfunction clearly is intended for creating a callback, so we want it to have the new-scope behavior.

What's "call-forward" vs "call-back"? What's the difference between the lt kwarg in sort and the function pointer argument in qsort? Why should one execute in the current world whereas the other execute in the latest one?

vtjnash · 2017-01-05T18:06:39Z

that's the only solution I can think of that doesn't break @threadcall.

@threadcall was a broken hack when it was merged. What we do here shouldn't be predicated on the requirement of reusing the existing hacks.

However, since most people also expect the cfunction to not have any dynamic dispatch and this is also a documented feature used in Base and packages I think we should use the version in this PR

That's not a good expectation, since it's neither documented nor accurate.

vtjnash · 2017-01-05T20:54:16Z

x-ref my proposal for adding a Callback type with approximately identical correspondence such that ccall : call :: cfunction : Callback, as motivated by the discussion here : https://discourse.julialang.org/t/proposal-for-a-first-class-dispatch-wrapper/1127/2?u=jameson

StefanKarpinski · 2017-01-26T02:49:57Z

@vtjnash: what's going on with this?

vtjnash · 2017-01-26T07:20:30Z

someone should merge #20167 and close this

StefanKarpinski · 2017-01-30T18:39:01Z

@vtjnash, #20167 has been merged, please close this if it's no longer relevant.

yuyichao added the compiler:codegen Generation of LLVM IR and native code label Dec 31, 2016

yuyichao added this to the 0.6.0 milestone Dec 31, 2016

yuyichao requested a review from vtjnash December 31, 2016 23:49

yuyichao mentioned this pull request Dec 31, 2016

Fix for 0.6 JuliaMath/Cubature.jl#23

Merged

Make cfunction optmization in codegen valid

37c85de

yuyichao force-pushed the yyc/codegen/cfunction branch from c904726 to 37c85de Compare January 1, 2017 00:31

yuyichao changed the title ~~[WIP] Make cfunction optmization in codegen valid~~ Make cfunction optmization in codegen valid Jan 1, 2017

vtjnash reviewed Jan 1, 2017

View reviewed changes

vtjnash requested changes Jan 3, 2017

View reviewed changes

vtjnash closed this Jan 30, 2017

yuyichao deleted the yyc/codegen/cfunction branch January 30, 2017 18:51

vtjnash mentioned this pull request Feb 17, 2017

Dispatch issues with size() on v0.6-dev JuliaArrays/StaticArrays.jl#106

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make cfunction optmization in codegen valid #19801

Make cfunction optmization in codegen valid #19801

yuyichao commented Dec 31, 2016

yuyichao commented Jan 1, 2017

yuyichao commented Jan 1, 2017

vtjnash Jan 1, 2017

vtjnash Jan 1, 2017

yuyichao Jan 1, 2017

StefanKarpinski commented Jan 3, 2017

vtjnash left a comment

yuyichao commented Jan 3, 2017

vtjnash commented Jan 4, 2017

JeffBezanson commented Jan 4, 2017

vtjnash commented Jan 4, 2017

JeffBezanson commented Jan 4, 2017

vtjnash commented Jan 4, 2017

JeffBezanson commented Jan 4, 2017

vtjnash commented Jan 4, 2017

JeffBezanson commented Jan 5, 2017

yuyichao commented Jan 5, 2017

JeffBezanson commented Jan 5, 2017

yuyichao commented Jan 5, 2017

vtjnash commented Jan 5, 2017 •

edited

Loading

vtjnash commented Jan 5, 2017

JeffBezanson commented Jan 5, 2017

vtjnash commented Jan 5, 2017

JeffBezanson commented Jan 5, 2017

vtjnash commented Jan 5, 2017

yuyichao commented Jan 5, 2017

yuyichao commented Jan 5, 2017

vtjnash commented Jan 5, 2017

vtjnash commented Jan 5, 2017 •

edited

Loading

StefanKarpinski commented Jan 26, 2017

vtjnash commented Jan 26, 2017

StefanKarpinski commented Jan 30, 2017

Make cfunction optmization in codegen valid #19801

Make cfunction optmization in codegen valid #19801

Conversation

yuyichao commented Dec 31, 2016

yuyichao commented Jan 1, 2017

yuyichao commented Jan 1, 2017

vtjnash Jan 1, 2017

Choose a reason for hiding this comment

vtjnash Jan 1, 2017

Choose a reason for hiding this comment

yuyichao Jan 1, 2017

Choose a reason for hiding this comment

StefanKarpinski commented Jan 3, 2017

vtjnash left a comment

Choose a reason for hiding this comment

yuyichao commented Jan 3, 2017

vtjnash commented Jan 4, 2017

JeffBezanson commented Jan 4, 2017

vtjnash commented Jan 4, 2017

JeffBezanson commented Jan 4, 2017

vtjnash commented Jan 4, 2017

JeffBezanson commented Jan 4, 2017

vtjnash commented Jan 4, 2017

JeffBezanson commented Jan 5, 2017

yuyichao commented Jan 5, 2017

JeffBezanson commented Jan 5, 2017

yuyichao commented Jan 5, 2017

vtjnash commented Jan 5, 2017 • edited Loading

vtjnash commented Jan 5, 2017

JeffBezanson commented Jan 5, 2017

vtjnash commented Jan 5, 2017

JeffBezanson commented Jan 5, 2017

vtjnash commented Jan 5, 2017

yuyichao commented Jan 5, 2017

yuyichao commented Jan 5, 2017

vtjnash commented Jan 5, 2017

vtjnash commented Jan 5, 2017 • edited Loading

StefanKarpinski commented Jan 26, 2017

vtjnash commented Jan 26, 2017

StefanKarpinski commented Jan 30, 2017

vtjnash commented Jan 5, 2017 •

edited

Loading

vtjnash commented Jan 5, 2017 •

edited

Loading