-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PROFILER] Theoretical roofline models #11066
Conversation
d38b5b4
to
8a7c2d7
Compare
I agree that the analysis namespace is confusing due to overloading. I personally would prefer something like either |
I would also recommend opening a discuss thread for posterity when introducing a top-level namespace if we think that is a better path. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't read the actual beefy stuff, but will later today. For now, some nits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm overall looks fine to me. But suggest someone else also take a look who knows this roofline/tir stuff better than I
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool stuff @tkonolige
python/tvm/utils/__init__.py
Outdated
|
||
def _vec_width_registers(target, vec_width, num_vector_registers): | ||
if vec_width is None: | ||
if target.device_name == "": # indicates x86 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not necessarily, could be arm, hexagon, etc., right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that the llvm
target (which is where device_name==""
) implies x86. If we look at all llvm
strategies in topi (called cpu
there), they assume x86.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah there is some oddness going on #11066 (comment) <-- perhaps this assumption is not entirely accurate or maybe llvm acts as a nice compatibility layer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, having an empty device name is more just a special case for llvm.
target = tvm.target.Target("
llvm -keys=hexagon,cpu
-link-params=0 -mattr=+hvxv68,
+hvx-length128b, +hvx-qfloat, -hvx-ieee-fp
-mcpu=hexagonv68 -mtriple=hexagon
")
will also have an empty device name similar to Andrew's comment. I think we need a different way to dispatch here.
In general I'd like to see this information queried as attrs from the target which, when not provided, can be queried from the device via an attr preprocessor used to update the attributes for a target kind -- which queries the local or remote device api for the necessary information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the hexagon target string you have provided there is correct. If I look at every "builtin" target provided by tvm, they all set the -device
flag to something (with the exception of hexagon).
In [2]: tvm.target.arm_cpu()
Out[2]: llvm -keys=arm_cpu,cpu -device=arm_cpu -link-params=0 -model=unknown
In [3]: tvm.target.hexagon()
Out[3]: hexagon -keys=hexagon -link-params=0 -mattr=+hvxv66,+hvx-length128b -mcpu=hexagonv66 -mtriple=hexagon
In [4]: tvm.target.cuda()
Out[4]: cuda -keys=cuda,gpu -arch=sm_86 -max_num_threads=1024 -model=unknown -thread_warp_size=32
In [5]: tvm.target.bifrost()
Out[5]: opencl -keys=bifrost,opencl,gpu -device=bifrost -max_num_threads=256 -model=unknown -thread_warp_size=1
In [6]: tvm.target.intel_graphics()
Out[6]: opencl -keys=intel_graphics,opencl,gpu -device=intel_graphics -max_num_threads=256 -model=unknown -thread_warp_size=16
In [7]: tvm.target.mali()
Out[7]: opencl -keys=mali,opencl,gpu -device=mali -max_num_threads=256 -model=unknown -thread_warp_size=1
In [8]: tvm.target.rasp()
Out[8]: llvm -keys=arm_cpu,cpu -device=arm_cpu -link-params=0 -mattr=+neon -model=bcm2837 -mtriple=armv7l-linux-gnueabihf
In [9]: tvm.target.riscv_cpu()
Out[9]: llvm -keys=arm_cpu,cpu -device=arm_cpu -link-params=0 -mabi=lp64d -mcpu=sifive-u54 -model=sifive-u54 -mtriple=riscv64-unknown-linux-gnu
In [10]: tvm.target.rocm()
Out[10]: rocm -keys=rocm,gpu -max_num_threads=256 -mcpu=gfx900 -model=unknown -mtriple=amdgcn-amd-amdhsa-hcc -thread_warp_size=64
(I have been unable to find any documentation on what device
actually means/does).
If you have a better way to detect x86, I'd be happy to use it. The only way I've seen this detection done in the codebase is by looking at mcpu
.
I'd love to have this information available from the targets, but all the x86 code in topi appears to do it the way I do it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-device
is used by TOPI to provide a layer of specialization when the target is generic. So with mali for example, we can have separate topi impls for opencl. And I agree that the way TOPI does hardware specific attribute lookup is how you are doing it here, I just don't like it :).
It's essentially a side channel for information to flow into the compiler from outside of the target that assumes that the hardware we are compiling for is locally accessible which only works in a limited set of cases. In my view, any device specific attribute that the compiler needs should be retrievable from the target directly, and we now have a path to do this with the attribute preprocessor I mentioned above that directly queries the DeviceAPI for this information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my view, any device specific attribute that the compiler needs should be retrievable from the target directly, and we now have a path to do this with the attribute preprocessor I mentioned above that directly queries the DeviceAPI for this information.
I agree on this point, but I think doing this change is out of scope of this PR. I'd have to change a bunch of stuff within topi.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree there are lots of places where this is assumed in TOPI, and I don't want you to be burdened with that in this PR, but for the cases in which target.device_name is not set and the machine is not x86 the profiler will silently do the wrong thing. This applies for hexagon, cuda, opencl, rocm, and others.
At minimum I guess we need to check that the target is llvm, and the device name is empty.
Can we also make it more strict such that get_simd_32bit_lanes
does not return a default if the mcpu name is not found in its knowledgebase, or is that default behavior being relied on? If it's being relied on maybe we simple make get_simd_32bit_lanes
take a parameter that defaults to True which indicates whether we should guess the number of lanes, and when false an error is thrown. Then here we can set that param to False so that for the case of profiling we don't make target specific assumptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed this check to make sure the target is llvm and that there aren't any keys besides "cpu". I think that should cover our bases for now.
get_simd_32bit_lanes
defaults to 4 if it can't figure out what the target is. Not sure this is a reasonable default, but I think we should leave changes to get_simd_32bit_lanes
to a separate PR.
# don't have an operator that uses a specific number of flops. | ||
assert call["Percent of Theoretical Optimal"].ratio >= 0 | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple classification tests would be rather impressive. e.g. asserting that dense (as above) of sufficient size is compute bound, and some injective op is memory bound
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default (untuned) schedules for dense seems to always be memory bound instead of compute bound, so I can't really test for that without tuning. I'll try some other operators and see if it is true for them too.
63b8746
to
4286aa3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the great feature and discussions. I look forward to using this
Thank you @tkonolige for the contribution and @csullivan @AndrewZhaoLuo for reviewing. As the current proposal now we introduced a new top-level namespace ( After a new namespace is introduced, it usually less common to have implementations directly sit in |
564cc1b
to
241043e
Compare
@tqchen I've move the implementation into I've also posted to discuss as requested: https://discuss.tvm.apache.org/t/new-top-level-python-namespace-tvm-utils/12683. |
Thanks @tkonolige I have not yet approved the change explicitly because of the suggestions (as in the comment part of this review). Given the suggestions are not specific for a particular line of code(but architectural) in general I did not use the review part, and instead chose to give comment explicitly in the comment thread. Hopefully that has given enough context that aligns with the guideline without being too rigid. |
`tvm.analysis.roofline_analysis` adds estimated roofline performance to a profiling report. The roofline model measures how close an operator gets to best possible memory bandwidth or FLOP/s depending on whether it is memory or compute bound. This computation uses the runtime of the operator along with two numbers extracted from the TIR code: bytes of memory touched and number of floating point operations. Because these numbers are extracted from TIR, they may not be 100% accurate. The best possible memory bandwidth and FLOP/s are measured by running small programs that are memory and compute bound respectively. For now, this function only works with llvm cpu targets, but it should be possible to extend to GPU targets.
241043e
to
12cbf0e
Compare
I'm going to merge, looks like in discussion thread there is little controversy |
`tvm.analysis.roofline_analysis` adds estimated roofline performance to a profiling report. The roofline model measures how close an operator gets to best possible memory bandwidth or FLOP/s depending on whether it is memory or compute bound. This computation uses the runtime of the operator along with two numbers extracted from the TIR code: bytes of memory touched and number of floating point operations. Because these numbers are extracted from TIR, they may not be 100% accurate. The best possible memory bandwidth and FLOP/s are measured by running small programs that are memory and compute bound respectively. For now, this function only works with llvm cpu targets, but it should be possible to extend to GPU targets.
`tvm.analysis.roofline_analysis` adds estimated roofline performance to a profiling report. The roofline model measures how close an operator gets to best possible memory bandwidth or FLOP/s depending on whether it is memory or compute bound. This computation uses the runtime of the operator along with two numbers extracted from the TIR code: bytes of memory touched and number of floating point operations. Because these numbers are extracted from TIR, they may not be 100% accurate. The best possible memory bandwidth and FLOP/s are measured by running small programs that are memory and compute bound respectively. For now, this function only works with llvm cpu targets, but it should be possible to extend to GPU targets.
tvm.analysis.roofline_analysis
add estimated roofline performance to a profiling report. The roofline model measures how close a operator gets to best possible memory bandwidth or FLOP/s depending on whether it is memory or compute bound. This computation uses the runtime of the operator along with two numbers extracted from the TIR code: bytes of memory touched and number of floating point operations. Because these numbers are extracted from TIR, they may not be 100% accurate. The best possible memory bandwidth and FLOP/s are measured by running small programs that are memory and compute bound respectively.For now, this function only works with llvm cpu targets, but it should be possible to extend to GPU targets.
@AndrewZhaoLuo @mbrookhart @masahi