-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[RFC][mxnet 2.0][item 10.1] MXNet Imperative Op Invocation Overhead #17097
Comments
Hi @reminisce , I have several questions.
To my knowledge, if using pybind11 or Python’s C extension, it need to re-build interfaces to support other version of Python. |
I have a good idea, I think we can build our own custom ctypes system or some other middleware to solve it.
|
@wkcn that can be done as part of |
How much engineering efforts does it take to do (1)? |
I think we can first replace the code-gen here with pybind11. https://github.com/apache/incubator-mxnet/blob/521c477ad32864d887481abf6c53acae3b717cf6/python/mxnet/ndarray/register.py#L115-L269 |
Here is another candidate that I highly recommend: adopt TVM's FFI convention. The historical problem of MXNet FFI was the blowing amount of the C API bindings as we add new features. This creates a huge amount of maintenance burden. The real problem was not really about which FFI system to adopt(cython and pybind are fine in that end, except for the cost of compilation), but more of the cost to maintain the FFI. MXNet used to have a fast cython binding, but that was abandoned because we keep add new APIs we cannot keep up both ctypes and cython. When developing TVM we learnt from the lesson and restrict the API to a limited set of runtime APIs that does not change, and have a stable cython, ctypes binding for them. The runtime support a type-erased function(PackedFunc), which can be efficiently called from any of the frontend language, and all the APIs are exposed through the PackedFunc. On the python side an additional wrapping is created for better documentation and call into the PackedFunc. See more in https://docs.tvm.ai/dev/runtime.html The system works great for over a few years now. Of course I understand there has been legacy issues in MXNet that is why I did not bring this proposal up. But given this is a proposal for 2.0, I would encourage everyone to give a serious thought about this possibility. |
@tqchen thanks for sharing this! Is there any reference benchmark result that stress tests the ffi overhead? |
I don't have any benchmarks at hand, but would be great if someone can help creating one with cython enabled. |
OK here is a script to quickly check. Note that it is important to compile TVM with Cython by typing make cython3 in the root folder. TVM by default uses cython if it is available, but we use the TVM_FFI env variable to make sure it is the case. import timeit
setup = """
import tvm
x = tvm.nd.array([0])
y = tvm.nd.array([1])
nop = tvm._api_internal._nop
"""
timer = timeit.Timer(setup=setup,
stmt='nop(x, y)')
timer.timeit(1)
num_repeat = 1000
print(timer.timeit(num_repeat) / num_repeat) Results on my laptop(macbook pro 13inch)
|
I think interested folks can follow the last post here #14883 (comment) the general takeaway seems to be that the overhead is quite close to numpy's one |
@tqchen Thanks for sharing the benchmark results. We did consider using TVM FFI as a candidate and I strongly agree with your suggestion on making a limited set of runtime API for sustainable maintainability. The Python op API overhead in general is more obvious when passing Python native data structures to the backend, such as lists, tuples, etc. I modified your script by passing a Python tuple as an argument and the overhead is around 19us with Cython enabled, while pybind is normally less than 400ns, and numpy comes with even much lower overhead. Did I make any mistakes in coding up the test script or it's something that can be addressed in TVM FFI? Thanks. import timeit
setup = """
import tvm
#x = tvm.nd.array([0])
#y = tvm.nd.array([1])
nop = tvm._api_internal._nop
"""
timer = timeit.Timer(setup=setup,
stmt='nop((1, 2, 3, 4))')
timer.timeit(1)
num_repeat = 1000
print(timer.timeit(num_repeat) / num_repeat) |
First of all, it would really be great if the issue of native data structure could been bought up at the first place in the RFC :) Every design has a tradeoff, and most of the TVM FFI's design started as an artifact of of Amdahl's law. In particular, it is interesting to ask how much native python data structure we need. e.g. most of the fast path(e.g. add/sub) and common operators the problem does not involve tuple as an argument. As a layman example, I don't care if my save to json function takes 100us to finish(because that is not going to be bottleneck of my program). When tuple and strings are involved, the typical operators (such as conv2d) are usually large. Moreover, when such operators are constructed in a two phase manner, or passed through an partially specialized annotator that translates the program, the cost of the constant tuple are only constructed once rather than in a loop (just like the following program). class Module:
def __init__(self):
## constructed once
self.conv2d = conv2d(kernel=(2,2))
def forward(self, x):
# run multiple times
return self.conv2d(x)
# hybridize can transform the code below to the above form.
@hybridize
def myfunc(x):
y = conv2d(x, kernel=(2,2)) Of course, if we really say that "hey, we want the int tuple case to be fast", just like the case of copy in #14883 . We can introduce native tuple objects into the TVM FFI to improve the performance of the passing tuple as arguments. The general design philosophy is we only need make the things that needs to be fast fast. The FFI can handle objects that it recognizes efficiently, while have a reasonable slow fallback for cases that are not necessarily the bottleneck. As I said in the beginning of the post, the spectrum of "fast argument" should really be discussed in the RFC at the first place. It would be great if we can have a more constructive discussion about these use cases, then add technical reasoning, before reaching a verdict. |
I think we need to support |
@tqchen Thanks for explaining things inside out. Please know that I'm not against TVM FFI design. In fact, it's great to know that you think passing Python native data structures can be accelerated by engineering through TVM FFI. This is vital for keeping the future MXNet runtime API in a limited set for scalable and sustainable maintainability. Putting the design decision aside, I want to share that there has been extremely strong motivation and need of squeezing out the latency of passing Python native data structures in op interface. Since MXNet is embracing NumPy compatibility, we want to get the op invocation performance on par with NumPy to be appealing to classic machine learning community. We have compared the performance between MXNet and NumPy using a bunch of classic machine learning models and found that optimizing passing Python native data structures is critical for MXNet to be on par with NumPy. Even for deep learning itself, this is also important in some applications. In #16716, With that being said, please allow me to summarize what we have reached here. I think we are aligned on exploring TVM FFI to have a clear engineering view of accelerating passing Python native data structures as arguments. We can start from tuples, and extend the findings to lists and strings later. Thank everyone for a great discussion and sorry for the late responses since I have been on vacation this week. I didn't expect this task item of PoC to become a full-fledged RFC that has involved this many interested folks. I will be sure to make the post more descriptive and self-explanatory next time. Have a nice holiday! :) |
To followup on this thread about brining native support to tvm ffi, I will first discuss the ways to address tuple, and then discuss some of the pros and cons. First of all, we all know that at a time point we are going to translate the python data structure we know into C++. The main question is where that translation can happen. In the pybind case, the translation happens in the c++ side by directly passing pyobject to the c++. In the case of cython, the translation happens in the C API level. In the case of TVM FFI, the translation can happen at a python wrapping. The following code gives two examples of such translation(myempty0, and myempty1), the support for In the first approach(myempty0), we directly unpack the tuple as positional arguments, and encode the data structure as a flattened argument. In the second approach(myempty1), it first creates a IntTuple object that the C++ side can recognize and then pass it to the C++ side. Note that at the moment all the operations are done through python, if there is a concern in terms of wrapping, we can certainly bring some of them into cython. We could introduce a third approach(myempty2): which directly passes a import timeit
import tvm
nop = tvm._api_internal._nop
setup = """
import tvm
x = tvm.nd.array([0])
y = tvm.nd.array([1])
nop = tvm._api_internal._nop
"""
timer = timeit.Timer(setup=setup,
stmt='nop((1,2,1))')
timer.timeit(1)
num_repeat = 1000
print("tvm.nowrap:", timer.timeit(num_repeat) / num_repeat)
setup = """
import numpy as np
"""
timer = timeit.Timer(setup=setup,
stmt='np.empty((1,2,1))')
timer.timeit(1)
print("numpy.emmpty:", timer.timeit(num_repeat) / num_repeat)
def myempty0(shape):
return nop(*shape)
def myempty1(shape):
return nop(tvm.container.IntTuple(*shape))
setup = """
import numpy as np
import tvm
from __main__ import myempty0, myempty1
"""
timer = timeit.Timer(setup=setup,
stmt='myempty0((1,2,1))')
timer.timeit(1)
print("tvm.myempty0:", timer.timeit(num_repeat) / num_repeat)
timer = timeit.Timer(setup=setup,
stmt='myempty1((1,2,1))')
timer.timeit(1)
print("tvm.myempty1:", timer.timeit(num_repeat) / num_repeat)
setup = """
import tvm
nop = tvm._api_internal._nop
"""
timer = timeit.Timer(setup=setup,
stmt='nop("mystr")')
timer.timeit(1)
num_repeat = 1000
print("tvm.str_arg:", timer.timeit(num_repeat) / num_repeat) Here are results on my computer:
As we can see, the DiscussionAs explained in the beginning, the real question was where should the wrapping happen. In the case of TVM, usually the wrapping happens at the native language(python) level, because we know there is a need of the python side wrapper for better code, type checking and docs. The translation forces the python arguments into arguments that can be recognized by the runtime. In the case of pybind, the translation happens at the C++ level, by calling into the python C API(the myempty2 approach is similar to this one). The advantage of exposing PyObject and related operations to the c++ level is certainly the deferred marshaling of data structures. On the other hand, such approach directly ties the FFI with python. It means other language frontends can no longer take benefit of the new FFI. On a similar direction, if we want to package some of the functions into a minimum runtime that is independent of python, we can no longer do that. This is why while in theory we could bring PyObject(or a related Proxy) to TVM runtime, we have not done so far. Of course this is an interesting tradeoff, and everyone is welcomed to discuss their thoughts. Here is a summary of points that can be discussed:
|
Some additional thoughts along the line:
|
@reminisce Do you have data on what is the time split between the FFI itself and engine scheduling inside MXNet backend? |
@tqchen For the "fast path" structures that we need to support, I'm considering the following:
import mxnet.numpy as mnp
import numpy as np
a = mnp.array(np.array(1))
b = a + np.array(1) Also, there are two scenarios when list will be involved:
|
@sxjscience here are some quick thoughts (of course passing pyobject kind of "solves" the problem, so I am discussing the wrapping that can be done through the tvm ffi).
The py_slice is the most tricky case, my guess is that it could be accelerated through a cython layer that translate the slice into flattened representation, of course it is not too ideal and pybind maybe better for this case if we only want to handle it through c++. |
@ptrendx I benchmarked |
Ok, so it is about 50:50 split. Is there also work underway to profile what is the reason of the time spent in the engine? |
@ptrendx Yes, there is an effort of profiling engine code flow using VTune. We hope the exercise can pinpoint the hotspots that contribute to the most part of latency. Further time split for pure C++ part between setup code (shape/type inference, memory allocation, dependency setup) and op scheduling is also around 50% vs. 50%. For the "fast path" data structures, I'm summarizing the items as follows (including the ones suggested by @sxjscience):
|
Thanks for discussions so far, to clarify the techinal questions and discuss tradeoffs further. The following fast-path can be addressed in the TVM FFI:
The following items needs to be discussed
Of course, all of the above cases can be solved by pybind, or a mix of pybind and TVM FFI. It would certainly be interesting to discuss the possible engineering path. Technical Choices and TradeoffsThe main techinque trade-off points that influences the decision are as follows:
|
After some thoughts along the direction, I find a better and fun answer to the above question to support tuple/ellipsis/slice in tvm ffi effectively. I hacked up a POC in https://github.com/tqchen/tvm/tree/poc-pyffi (lastest commit) that supports the following benchmark script(disclaimer: it is only a POC so not intended for use or fully optimized, but it demonstrates all the technical flows necessary to make a fully functioning FFI). import timeit
import tvm
nop = tvm._api_internal._nop
setup = """
import tvm
nop = tvm._api_internal._nop
"""
timer = timeit.Timer(setup=setup,
stmt='nop((None,..., slice(0, 100, 2)))')
timer.timeit(1)
num_repeat = 1000
print("tvm.tuple_slice_ellipsis_combo:", timer.timeit(num_repeat) / num_repeat)
setup = """
import numpy as np
"""
timer = timeit.Timer(setup=setup,
stmt='np.empty((1,2,1))')
timer.timeit(1)
print("numpy.emmpty:", timer.timeit(num_repeat) / num_repeat)
setup = """
import tvm
nop = tvm._api_internal._nop
"""
timer = timeit.Timer(setup=setup,
stmt='nop("mystr")')
timer.timeit(1)
num_repeat = 1000
print("tvm.str_arg:", timer.timeit(num_repeat) / num_repeat) On my laptop(macbook 13inch), the results are as follows
What is Implemented in the POCIn the POC, we introduced specific objects for Ellipsis, Slice and Tuple(already supported in ADT). During a PackedFunc call, a python tuple/ellipsis/slice was converted into the object that is supported by the backend. We implemented a cython version(the previous recursive conversion was in python) to back it up. The reason that we are able to create Object in the cython side is because all TVM object has been recently converted to be POD-C compatible, so the object can be created in the cython side without crossing DLL boundary and passed to the c++ backend. We can see from the benchmark that the cost of such deep-copy was at a reasonable level. We also only used the default memory allocator, so there could be space for further improvements. Technical Choices and TradeoffsPlease also see tradeoff discussions in the last post. As we can see, the main difference here is where to do the conversion, and whether do we do lazy/deep copy:
The laziness certainly avoids a copy in cases where we do not necessarily need to book-keep the created argument. On the other hand, supporting a common data structure in the c++ side means the binding can potentially be reused by other language frontends. |
Thank @tqchen for sharing the PoC code within such a short timeframe. :) The numbers look promising even with Python native objects deeply copied. Pybind performs deep copy by default unless the receiving object in C++ end is marked as |
Following this branch I made a simple POC on MXNet side (code here). It turns out that passing a python |
@hzfan thanks for implementing a poc:) However, these is a subtle but important difference which worth discussing in here :) I will use cython-ffi to refer to the above poc, and tvm-ffi to refer to tvm's poc
The difference again boils down to the design point of what is a clear cut of FFI conventions. Ideally, it would be: a stable set of C API and object structures that does not change over time. |
What's the point of having an API if you type erase it? Then you might as
well have a single function API with a type erased callback name to select
the function to call. In the end you move the burden away from the API to
the callers and inside the API to the dispatchers. For going this route of
uber-clever template tricks to generate code, I think it's better to just
put in place proper code generation for maintainability. Could you provide
a bit more details about tradeoffs? Everything has tradeoffs, I don't
believe any solution which is sold as a panacea, there's no silver bullet.
…On Thu, Dec 19, 2019 at 10:21 AM Tianqi Chen ***@***.***> wrote:
I have another candidate that would highly recommend: adopt TVM's FFI
convention.
The historical problem of MXNet FFI was the blowing amount of the C API
bindings as we add new features. This creates a huge amount of maintenance
burden.
The real problem was not really about which FFI system to adopt(cython and
pybind are fine in that end, except for the cost of compilation), but more
of the cost to maintain the FFI. MXNet used to have a fast cython binding,
but that was abandoned because we keep add new APIs we cannot keep up both
ctypes and cython.
When developing TVM we learnt from the lesson and restrict the API to a
limited set of runtime APIs that does not change, and have a stable cython,
ctypes binding for them. The runtime support a type-erased
function(PackedFunc), which can be efficiently called from any of the
frontend language, and all the APIs are exposed through the PackedFunc. On
the python side an additional wrapping is created for better documentation
and call into the PackedFunc. See more in
https://docs.tvm.ai/dev/runtime.html The system works great for over a
few years now.
Of course I understand there has been legacy issues in MXNet that is why I
did not bring this proposal up. But given this is a proposal for 2.0, I
would encourage everyone to give a serious thought about this possibility.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#17097>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHCV2NICGEG3VMB2JTNHW3QZO3RLANCNFSM4J3WHPZQ>
.
|
Pybind is nice, I used Boost python many years ago, which I think is based
on. The problem with this is the hourglass C bindings, you have to go from
Python to C++ / Pybind, down to C and to the engine, this seems like a lot
of boilerplate.
…On Mon, Dec 16, 2019 at 10:02 PM reminisce ***@***.***> wrote:
MXNet imperative operator invocation overhead is as large as 30-60us,
which is significant compared to the official NumPy operators with ~600ns
overhead. This has negatively impacted the performance of applying MXNet to
the models where many operators' kernel runtime duration is short,
especially in the area of classic machine learning. We plan to address the
problem in two steps:
1.
Short term: Use pybind11 to replace Python op API and ctypes/c api.
Preliminary experiments show that the pure Python-C++ turnaround time by
using Pybind is between 400-600ns, while the current Python op API using
ctypes/c api costs more than 10us. We believe with the correct
implementation, we can reduce the op invocation overhead to 2us including
the time on FFI and engine.
2.
Long term: Adopt Python's C extension interface. NumPy did this by
developing its own C API. This provides considerably less overhead compared
to other solutions. However, it would cost much more engineering efforts by
integrating this with our existing operator workflow in C++.
@hzfan <https://github.com/hzfan> @hgt312 <https://github.com/hgt312>
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#17097>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHCV2J7KYUKKTFZTKN6TB3QZBTOZANCNFSM4J3WHPZQ>
.
|
@larroy indeed every solution has trade-offs, and these tradeoffs are discussed in the above posts when we compare solutions, and they are backed by benchmarks :) it would be great if you can also suggest potential tradeoffs here. When you expose an API from typed language(c++) to a dynamic language(python), you have to type erase it, given that the python functions don't have the type, and you have to pass the information along. The only difference is where you do the type checking(that the python type corresponds to the right c++ type), and translation(translating to the c++ type). For example, in the case of pybind, the erasure is done implicitly when you call the python function, then checking and translation happens when you call into the c++ function. In the case of creating a C API for each feature and wrap things in the python side, the type checking is done in the python side, and translation as well. In the case of tvm ffi, the type translation is done in the python/cython side, while the type checking is done in the c++. To dive deeper into the tradeoffs for PackedFunc calling convention. The convention erases the type by having the type code stored into the arguments. This brings additional cost of passing arguments into heap, as opposed to registers. So they might not be designed for inline functions that needs to happen at the order of 1e-9s, however, for API functions that needs to run around 1e-7 or even 1e-8 level, this convention is pretty good. In terms of the calling cost, it really depends on whether the caller and callee are strongly typed.
As we can see, the only place where dispatching is necessary is the dynamic type handling case. Even in these cases, if there is a strong need of specialization, we can directly force the type by running checking on the caller, and pass in the right type code (the engineering burden is the same as wrapping the C API). However, the benchmark suggests that the dynamic dispatching cost is reasonable, and satisfies the API speed. Coming back to the tradeoff, the main tradeoff here is the engineering burden to keep an hourglass design(with fixed set of API) vs efficiency. While my post did not suggest that TVM's ffi is a silver bullet, it does works pretty well for our use cases. hope it helps |
Thanks for the explanation. I'm not so concerned about complexity of
dispatching. If I understood you correctly the main benefit that you
explain for the TVM project was not having to change the C API, but still
you need to do type checking in both ends, or at least on the receiving end
of the API, correct? I think we have discussed similar things in the past
and we might have different views on strongly typed vs dynamic typed. A
priori I prefer to see an API which can be evolved and changed, I find it
more explicit and clearer that what I think you do with PackedFun which I
have looked at briefly but not used extensively. If one is going to call
into the C API using pybind, does it make sense to layer a C++ API on top
of the C API for this?
Also these microbenchmarks are nice, but we also need to consider the
overhead in typical workloads and see if it's still significant.
CFFI is also another alternative.
I couldn't access your pointers like:
https://github.com/tqchen/tvm/tree/pyffi
…On Thu, Dec 26, 2019 at 2:00 PM Tianqi Chen ***@***.***> wrote:
@larroy indeed every solution has trade-offs, and these tradeoffs are
discussed in the above posts when we compare solutions, and they are backed
by benchmarks :) it would be great if you can also suggest potential
tradeoffs here.
When you expose an API from typed language(c++) to a dynamic
language(python), you have to type erase it, given that the python
functions don't have the type, and you have to pass the information along.
The only difference is where you do the type checking(that the python type
corresponds to the right c++ type), and translation(translating to the c++
type).
For example, in the case of pybind, the erasure is done implicitly when
you call the python function, then checking and translation happens when
you call into the c++ function.
In the case of creating a C API for each feature and wrap things in the
python side, the type checking is done in the python side, and translation
as well.
In the case of tvm ffi, the type translation is done in the python/cython
side, while the type checking is done in the c++.
To dive deeper into the tradeoffs for PackedFunc calling convention. The
convention erases the type by having the type code stored into the
arguments. This brings additional cost of passing arguments into heap, as
opposed to registers. So they might not be designed for inline functions
that needs to happen at the order of 1e-9s, however, for API functions that
needs to run around 1e-7 or even 1e-8 level, this convention is pretty good.
In terms of the calling cost, it really depends on whether the caller and
callee are strongly typed.
- If caller is strongly typed, then assigning type code is O(1)
- If caller is a dynamic type(like python) then we need to have a
dispatcher to dispatch and select the right type code
- If callee is strongly typed, then the cost of checking is O(1) by just
check the code to be the correct one
- If the callee is dynamic type, then a dispatching need to happen, which
have another level of hashtable lookup O(1)
As we can see, the only place where dispatching is necessary is the
dynamic type handling case. Even in these cases, if there is a strong need
of specialization, we can directly force the type by running checking on
the caller, and pass in the right type code (the engineering burden is the
same as wrapping the C API). However, the benchmark suggests that the
dynamic dispatching cost is reasonable, and satisfies the API speed.
Coming back to the tradeoff, the main tradeoff here is the engineering
burden to keep an hourglass design(with fixed set of API) vs efficiency.
While my post did not suggest that TVM's ffi is a silver bullet, it does
works pretty well for our use cases. hope it helps
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
#17097 (comment)
|
LOL, the last one was my comment, not @szha :-D |
Test
On Fri, Dec 27, 2019 at 11:54 AM Pedro Larroy <[email protected]>
wrote:
… Thanks for the explanation. I'm not so concerned about complexity of
dispatching. If I understood you correctly the main benefit that you
explain for the TVM project was not having to change the C API, but still
you need to do type checking in both ends, or at least on the receiving end
of the API, correct? I think we have discussed similar things in the past
and we might have different views on strongly typed vs dynamic typed. A
priori I prefer to see an API which can be evolved and changed, I find it
more explicit and clearer that what I think you do with PackedFun which I
have looked at briefly but not used extensively. If one is going to call
into the C API using pybind, does it make sense to layer a C++ API on top
of the C API for this?
Also these microbenchmarks are nice, but we also need to consider the
overhead in typical workloads and see if it's still significant.
CFFI is also another alternative.
I couldn't access your pointers like:
https://github.com/tqchen/tvm/tree/pyffi
On Thu, Dec 26, 2019 at 2:00 PM Tianqi Chen ***@***.***>
wrote:
> @larroy indeed every solution has trade-offs, and these tradeoffs are
> discussed in the above posts when we compare solutions, and they are backed
> by benchmarks :) it would be great if you can also suggest potential
> tradeoffs here.
>
> When you expose an API from typed language(c++) to a dynamic
> language(python), you have to type erase it, given that the python
> functions don't have the type, and you have to pass the information along.
>
> The only difference is where you do the type checking(that the python
> type corresponds to the right c++ type), and translation(translating to the
> c++ type).
>
> For example, in the case of pybind, the erasure is done implicitly when
> you call the python function, then checking and translation happens when
> you call into the c++ function.
>
> In the case of creating a C API for each feature and wrap things in the
> python side, the type checking is done in the python side, and translation
> as well.
>
> In the case of tvm ffi, the type translation is done in the python/cython
> side, while the type checking is done in the c++.
>
> To dive deeper into the tradeoffs for PackedFunc calling convention. The
> convention erases the type by having the type code stored into the
> arguments. This brings additional cost of passing arguments into heap, as
> opposed to registers. So they might not be designed for inline functions
> that needs to happen at the order of 1e-9s, however, for API functions that
> needs to run around 1e-7 or even 1e-8 level, this convention is pretty good.
>
> In terms of the calling cost, it really depends on whether the caller and
> callee are strongly typed.
> - If caller is strongly typed, then assigning type code is O(1)
> - If caller is a dynamic type(like python) then we need to have a
> dispatcher to dispatch and select the right type code
> - If callee is strongly typed, then the cost of checking is O(1) by just
> check the code to be the correct one
> - If the callee is dynamic type, then a dispatching need to happen, which
> have another level of hashtable lookup O(1)
>
> As we can see, the only place where dispatching is necessary is the
> dynamic type handling case. Even in these cases, if there is a strong need
> of specialization, we can directly force the type by running checking on
> the caller, and pass in the right type code (the engineering burden is the
> same as wrapping the C API). However, the benchmark suggests that the
> dynamic dispatching cost is reasonable, and satisfies the API speed.
>
> Coming back to the tradeoff, the main tradeoff here is the engineering
> burden to keep an hourglass design(with fixed set of API) vs efficiency.
> While my post did not suggest that TVM's ffi is a silver bullet, it does
> works pretty well for our use cases. hope it helps
>
>
> --
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly or view it on GitHub:
>
> #17097 (comment)
|
re the need for explicit type checking code in TVM FFI. Actually there is no explicit code for type checking as they are generated automatically via template expansion(on the receiving end), also we also have a "strong typed" signature that wraps the packed function interface, which gives you compile time type checking https://github.com/apache/incubator-tvm/blob/master/include/tvm/runtime/packed_func.h#L191 For dynamic language side(python) the exposed function is still type erased(as python is a dynamic language). Note that the view dynamic vs static typed language does not really apply to this case, because the main goal(exposing to python) means type-erasure(as python is dynamically typed). The main goal would be how to reduce the number of abstraction layers.
If we apply reasoning, most API cost is going to be FFI cost + exec cost, and I think the conclusion so far is we want FFI cost to be around 1e-7s to 1e-6s, which is the limit of any cost . |
I created a follow-up design proposal on cwiki. TVM FFI works well with MXNet and the overhead for |
Thanks @hzfan I would also high recommending taking a close look at the TVM's object protocol, and try to push most of the things through the Object eventually(Create temporary support for legacy cases like TShape is fine, but eventually pushing things as object will have a greater uniformity, and brings benefit such as putting everything into a container) |
@tqchen Thanks for sharing this. I don’t know if I understand correctly. For now arguments except primitives pass through FFI via Object (like ADTObj). It is then converted to TShape in backend and TShape is not involved in FFI directly. As you said, Object allows me to conveniently put various kinds of things into a container (ADTObj), without losing their types. For example, now a tuple of tuples like ((2, 2), (2, 2)) is allowed. Also sorry for the late reply. I have been on a vacation this week :) |
MXNet imperative operator invocation overhead is as large as 30-60us, which is significant compared to the official NumPy operators with ~600ns overhead. This has negatively impacted the performance of applying MXNet to the models where many operators' kernel runtime duration is short, especially in the area of classic machine learning. We plan to address the problem in two steps:
Short term: Use pybind11 to replace Python op API and ctypes/c api. Preliminary experiments show that the pure Python-C++ turnaround time by using Pybind is between 400-600ns, while the current Python op API using ctypes/c api costs more than 10us. We believe with the correct implementation, we can reduce the op invocation overhead to 2us including the time on FFI and engine.
Long term: Adopt Python's C extension interface. NumPy did this by developing its own C API. This provides considerably less overhead compared to other solutions. However, it would cost much more engineering efforts by integrating this with our existing operator workflow in C++.
@hzfan @hgt312
The text was updated successfully, but these errors were encountered: