Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable batch mode benchmarking on iree-android-benchmark #5106

Merged
merged 4 commits into from
Mar 17, 2021

Conversation

hanhanW
Copy link
Contributor

@hanhanW hanhanW commented Mar 15, 2021

This produces more stable numbers because it can amortize overhead in batch mode.

Due to issues on Pixel 4 GPU, we can't let it run more than two seconds. Otherwise it will get killed. Thus, the batch mode is not enabled for Pixel 4 GPU.

Pixel 4

MobileBert - CPU: ~660 ms
MobileBert - GPU: ~952 ms (unchanged)
MobileNetV2 - CPU: ~380 ms
MobileNetV2 - GPU: ~1024 ms (unchanged)

S20

MobileBert - CPU: ~750 ms
MobileBert - GPU: ~270 ms
MobileNetV2 - CPU: ~266 ms
MobileNetV2 - GPU: ~55 ms

@google-cla google-cla bot added the cla: yes label Mar 15, 2021
@hanhanW hanhanW marked this pull request as draft March 15, 2021 10:28
@hanhanW hanhanW changed the title Batch bench Enable batch mode benchmarking on iree-android-benchmark Mar 15, 2021
@hanhanW hanhanW force-pushed the batch-bench branch 2 times, most recently from 3815d56 to 93daa9d Compare March 15, 2021 15:38
@hanhanW hanhanW changed the base branch from main to hanhan-pr-5082 March 15, 2021 16:12
@hanhanW hanhanW marked this pull request as ready for review March 15, 2021 16:12
@hanhanW
Copy link
Contributor Author

hanhanW commented Mar 15, 2021

This depends on #5082

Copy link
Contributor

@antiagainst antiagainst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just one question.



# The batch numbers are roughly computed to let it benchmark more than 3
# seconds.
# Do not set batch size on Pixel 4 for GPU targets, because it will get killed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll see fluctuation for Pixel 4 then. Is there a way to disable mako upload failure for Pixel 4 GPU?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean to disable the regression analyzer but still uploading the numbers? Or you mean to disable the whole benchmarking for Pixel4 GPU?

I can make the former one work by tweaking the uploader program, this should not be hard. The latter one is super easy, removing the target is enough.

@@ -47,6 +47,11 @@ def get_driver(self) -> str:
""" Returns a string indicates the driver of the target."""
return self.name.split("-")[0]

def add_batch_flag(self, size):
self.compilation_flags.append(
f"--iree-hal-benchmark-dispatch-repeat-count={size}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm so this seems like kind of a hacky way to make dispatches kernels dominate the benchmarking time. Like we're no longer benchmarking mobile bert or whatever model, we're benchmarking specifically the dispatches of that model, which is a different thing. I'm not saying that isn't a good thing to benchmark, but it's a different thing to benchmark for sure. That means this is no longer recording "how fast is IREE at running model X". Are we setting up new benchmark reports (or whatever mako calls them) to record this new metric?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is to make dispatches kernels dominate the benchmarking time. I think this is what we are looking for at this stage.

I feel this is also what we are measuring with tflite benchmark tools. I dig into the implementation and found that it excludes initialized session and only measures generated_code::Inference. What is excluded in IREE is something similar, like vm stack allocation, for example. There are more but I can't tell now.

Initialized session in 609.945ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=64 first=13039 curr=8455 min=6052 max=13039 avg=7632.81 std=1138

Ideally we would like to let benchmark to report something like ex.subtmit_and_wait, or iree_hal_semaphore_wait_with_deadline. E.g., #4686 (comment)

We can also have other metrics to include overhead or initial times. But I think we are looking for real execution time at this stage. And this is an easy way to get it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree that we sometimes want end-to-end time including overheads, initialization stage, etc.

When I worked on accelerators, I was suffered from performance measurement with or without runtime overheads. The numbers from simulator were awesome, but the actual end-to-end execution time was worse in the beginning. Then we broke the performance down by something like Tracy and identified issues. There was also something only causes overhead in the first run. E.g., you will have to allocate some resources in the beginning. In the second run and later runs, they are eliminated. Case by case, sometimes initialization time happens in system level which is must be there, not count to application performance. Sometimes you will run it for many times and you don't care for the first run.

And yes again, this is different from what we've measured. Let's re-define the focus of benchmarking are for dispatches?

Are we setting up new benchmark reports (or whatever mako calls them) to record this new metric?

The answer is no. I think in IREE's scope, we can focus on dispatches kernels performance. This also matches to how the competitor (like tflite) does. We can revisit to benchmark what we had before when there is a need.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah so my main issue here is that we're basically changing the definition of a metric while we're measuring it. Like the performance graph now shows a "performance improvement":

fake_performance_improvement

but that's really just a change to what we're measuring. It also seems a bit disingenuous to call this a "mobile bert" benchmark (or whatever model). This doesn't represent how fast IREE could run a mobile bert inference under even ideal conditions.

I'm not disagreeing that it may be the most useful metric for us right now.

(side note: we already do have initial runs that are not measured and run the whole model multiple times and amortize over that)

@hanhanW hanhanW changed the base branch from hanhan-pr-5082 to main March 16, 2021 10:56
@hanhanW hanhanW requested a review from GMNGeoffrey March 16, 2021 16:13
@hanhanW hanhanW merged commit 2eededf into iree-org:main Mar 17, 2021
@hanhanW hanhanW deleted the batch-bench branch March 17, 2021 04:18
@not-jenni not-jenni mentioned this pull request Mar 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants