-
Notifications
You must be signed in to change notification settings - Fork 631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable batch mode benchmarking on iree-android-benchmark #5106
Conversation
3815d56
to
93daa9d
Compare
This depends on #5082 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Just one question.
|
||
|
||
# The batch numbers are roughly computed to let it benchmark more than 3 | ||
# seconds. | ||
# Do not set batch size on Pixel 4 for GPU targets, because it will get killed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll see fluctuation for Pixel 4 then. Is there a way to disable mako upload failure for Pixel 4 GPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean to disable the regression analyzer but still uploading the numbers? Or you mean to disable the whole benchmarking for Pixel4 GPU?
I can make the former one work by tweaking the uploader program, this should not be hard. The latter one is super easy, removing the target is enough.
@@ -47,6 +47,11 @@ def get_driver(self) -> str: | |||
""" Returns a string indicates the driver of the target.""" | |||
return self.name.split("-")[0] | |||
|
|||
def add_batch_flag(self, size): | |||
self.compilation_flags.append( | |||
f"--iree-hal-benchmark-dispatch-repeat-count={size}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm so this seems like kind of a hacky way to make dispatches kernels dominate the benchmarking time. Like we're no longer benchmarking mobile bert or whatever model, we're benchmarking specifically the dispatches of that model, which is a different thing. I'm not saying that isn't a good thing to benchmark, but it's a different thing to benchmark for sure. That means this is no longer recording "how fast is IREE at running model X". Are we setting up new benchmark reports (or whatever mako calls them) to record this new metric?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is to make dispatches kernels dominate the benchmarking time. I think this is what we are looking for at this stage.
I feel this is also what we are measuring with tflite benchmark tools. I dig into the implementation and found that it excludes initialized session and only measures generated_code::Inference
. What is excluded in IREE is something similar, like vm stack allocation, for example. There are more but I can't tell now.
Initialized session in 609.945ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=64 first=13039 curr=8455 min=6052 max=13039 avg=7632.81 std=1138
Ideally we would like to let benchmark to report something like ex.subtmit_and_wait, or iree_hal_semaphore_wait_with_deadline. E.g., #4686 (comment)
We can also have other metrics to include overhead or initial times. But I think we are looking for real execution time at this stage. And this is an easy way to get it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I totally agree that we sometimes want end-to-end time including overheads, initialization stage, etc.
When I worked on accelerators, I was suffered from performance measurement with or without runtime overheads. The numbers from simulator were awesome, but the actual end-to-end execution time was worse in the beginning. Then we broke the performance down by something like Tracy and identified issues. There was also something only causes overhead in the first run. E.g., you will have to allocate some resources in the beginning. In the second run and later runs, they are eliminated. Case by case, sometimes initialization time happens in system level which is must be there, not count to application performance. Sometimes you will run it for many times and you don't care for the first run.
And yes again, this is different from what we've measured. Let's re-define the focus of benchmarking are for dispatches?
Are we setting up new benchmark reports (or whatever mako calls them) to record this new metric?
The answer is no. I think in IREE's scope, we can focus on dispatches kernels performance. This also matches to how the competitor (like tflite) does. We can revisit to benchmark what we had before when there is a need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah so my main issue here is that we're basically changing the definition of a metric while we're measuring it. Like the performance graph now shows a "performance improvement":
but that's really just a change to what we're measuring. It also seems a bit disingenuous to call this a "mobile bert" benchmark (or whatever model). This doesn't represent how fast IREE could run a mobile bert inference under even ideal conditions.
I'm not disagreeing that it may be the most useful metric for us right now.
(side note: we already do have initial runs that are not measured and run the whole model multiple times and amortize over that)
This produces more stable numbers because it can amortize overhead in batch mode.
Due to issues on Pixel 4 GPU, we can't let it run more than two seconds. Otherwise it will get killed. Thus, the batch mode is not enabled for Pixel 4 GPU.
Pixel 4
MobileBert - CPU: ~660 ms
MobileBert - GPU: ~952 ms (unchanged)
MobileNetV2 - CPU: ~380 ms
MobileNetV2 - GPU: ~1024 ms (unchanged)
S20
MobileBert - CPU: ~750 ms
MobileBert - GPU: ~270 ms
MobileNetV2 - CPU: ~266 ms
MobileNetV2 - GPU: ~55 ms