Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

profiling_based_partitioner doesn't divide evenly the time of the segments. #23

Open
HanChangHun opened this issue Jun 3, 2022 · 7 comments
Assignees
Labels
Hardware:M.2 Accelerator with dual Edge TPU Coral M.2 Accelerator with Dual Edge TPU issues subtype:Mendel Linux Mendel Linux Build/installation issues type:bug Bug

Comments

@HanChangHun
Copy link

HanChangHun commented Jun 3, 2022

Description

The diff_threshold_ns that is profiling_based_partitioner's option is not working well.

It doesn't compare the difference (in ns) between the slowest segment (upper bound) and the fastest segment (lower bound).
But it compares last_segment_latency and target_latency.

So I was able to get the result with the slowest segment is too slow and the fastest segment is speedy.

Maybe the source code(last_segment_latency - target_latency) should be changed.

Click to expand!

Issue Type

Bug

Operating System

Mendel Linux, Linux

Coral Device

M.2 Accelerator with dual Edge TPU

Other Devices

No response

Programming Language

C++

Relevant Log Output

segment latencies
[1.4403, 1.3686, 0.354, 0.161]  # difference is bigger than 1ms
[1.2178, 1.3376, 0.9702, 0.0683, 0.1601]
[1.1891, 1.3306, 0.8717, 0.1092, 0.0511, 0.1604]
[2.9966, 6.0864]  # difference is so big!
[2.6992, 1.9771, 2.9165]
[2.5653, 1.9029, 1.7592, 1.1261]
[2.3772, 1.5753, 1.7227, 1.4968, 0.6235]
@google-coral-bot google-coral-bot bot added Hardware:M.2 Accelerator with dual Edge TPU Coral M.2 Accelerator with Dual Edge TPU issues subtype:Mendel Linux Mendel Linux Build/installation issues type:bug Bug labels Jun 3, 2022
@hjonnala
Copy link

hjonnala commented Jun 3, 2022

Hello @HanChangHun It could be due to input output latency. can you please share the latency results with single model benchmark for each segment file in txt file. Thanks! google-coral/edgetpu#593 (comment)

@HanChangHun
Copy link
Author

HanChangHun commented Jun 3, 2022

I changed the profile-based partitioner to perform the partitioning in a single edge tpu and to share the SRAM of single edge tpu.
So, the latency is different from the usual profile-based partitioning of example inception v2.

But, It is difficult to see the big time gap caused by input-output data transfer time.

The logs are as follows:

2022-06-03 23:43:41
Running ./single_model_benchmark
Run on (16 X 4800 MHz CPU s)
CPU Caches:
  L1 Data 48K (x8)
  L1 Instruction 32K (x8)
  L2 Unified 512K (x8)
  L3 Unified 16384K (x1)
Load Average: 0.14, 0.88, 0.86
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
BM_Model         2.94 ms        0.268 ms         1000 inception_v2_224_quant_segment_0_of_2_edgetpu.tflite

2022-06-03 23:43:53
Running ./single_model_benchmark
Run on (16 X 4800 MHz CPU s)
CPU Caches:
  L1 Data 48K (x8)
  L1 Instruction 32K (x8)
  L2 Unified 512K (x8)
  L3 Unified 16384K (x1)
Load Average: 0.11, 0.85, 0.85
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
BM_Model         6.04 ms        0.223 ms         1000 inception_v2_224_quant_segment_1_of_2_edgetpu.tflite

Another example is inception v2 splitting in 4. The gap between the slowest latency and fastest latency is greater than 1ms. (I setted diff_threshold_ns as 1000000)

2022-06-03 23:53:31
Running ./single_model_benchmark
Run on (16 X 4800 MHz CPU s)
CPU Caches:
  L1 Data 48K (x8)
  L1 Instruction 32K (x8)
  L2 Unified 512K (x8)
  L3 Unified 16384K (x1)
Load Average: 0.41, 0.27, 0.50
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
BM_Model         2.51 ms        0.269 ms         2698 inception_v2_224_quant_segment_0_of_4_edgetpu.tflite

2022-06-03 23:53:41
Running ./single_model_benchmark
Run on (16 X 4800 MHz CPU s)
CPU Caches:
  L1 Data 48K (x8)
  L1 Instruction 32K (x8)
  L2 Unified 512K (x8)
  L3 Unified 16384K (x1)
Load Average: 0.35, 0.26, 0.50
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
BM_Model         1.89 ms        0.299 ms         2507 inception_v2_224_quant_segment_1_of_4_edgetpu.tflite

2022-06-03 23:53:48
Running ./single_model_benchmark
Run on (16 X 4800 MHz CPU s)
CPU Caches:
  L1 Data 48K (x8)
  L1 Instruction 32K (x8)
  L2 Unified 512K (x8)
  L3 Unified 16384K (x1)
Load Average: 0.32, 0.25, 0.49
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
BM_Model         1.72 ms        0.239 ms         2811 inception_v2_224_quant_segment_2_of_4_edgetpu.tflite

2022-06-03 23:53:55
Running ./single_model_benchmark
Run on (16 X 4800 MHz CPU s)
CPU Caches:
  L1 Data 48K (x8)
  L1 Instruction 32K (x8)
  L2 Unified 512K (x8)
  L3 Unified 16384K (x1)
Load Average: 0.27, 0.24, 0.49
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
BM_Model         1.09 ms        0.168 ms         3925 inception_v2_224_quant_segment_3_of_4_edgetpu.tflite

Thank you for the fast response.

@hjonnala
Copy link

hjonnala commented Jun 3, 2022

Hmm..profile-based partitioner is not intended to perform the partitioning in a single edge tpu. Please check this page for details and requirements to use this tool: https://github.com/google-coral/libcoral/blob/master/coral/tools/partitioner/README.md. Thanks!

@HanChangHun
Copy link
Author

Thank you for your response.

I aimed to utilize the existing code with only one edgetpu. So I changed the code that utilizes multiple edge tpu into a code that utilizes only one edge tpu.
However, there was no other modification, so I thought that the partitioning part in the existing code was not considering the latency of the slowest segment and the latency of the fastest latency.

I modified the existing code, so it would be difficult for you to answer.
Thank you for your kind reply!

@hjonnala
Copy link

hjonnala commented Jun 4, 2022

Can you please try this code with two TPUs and with two segments on inception v3 model and share the logs and single model benchmark results for output models. Thanks!

@HanChangHun
Copy link
Author

This code doesn't contain lower bound and upper bound update codes. So, I was changed somewhere and run with co-compilation.

The result is like this. Inception V3 with 2 segments, Inception V3 with 3 segments, and Inception V3 with 4 segments.
It looks split model evenly. And gap between slowest and fastest are not over diff_threshold_ns(=1000000).

# Inception V3 with 2 segments
# 24.1ms and 24.9ms
target_latency:  24704940.8, num_ops: [84 48], latencies: [24120441 24918119]

# Inception V3 with 3 segments
# 16ms, 16.4ms, 16.7ms
target_latency:  16749807.1125, num_ops: [65 38 29], latencies: [16061107 16405574 16782598]

# Inception V3 with 4 segments
# 12.8ms, 13.1ms, 13.3ms and 13.7ms
target_latency:  13378520.2, num_ops: [56 27 25 24], latencies: [12896665 13151985 13315528 13781784]

your code is very helpful!
Thank you!

@hjonnala
Copy link

hjonnala commented Jun 5, 2022

The diff_threshold_ns that is profiling_based_partitioner's option is not working well.

It doesn't compare the difference (in ns) between the slowest segment (upper bound) and the fastest segment (lower bound). But it compares last_segment_latency and target_latency.

So I was able to get the result with the slowest segment is too slow and the fastest segment is speedy.

Maybe the source code(last_segment_latency - target_latency) should be changed.

Awesome, Fell free to submit a Pull Request for this bug for the developer's review. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Hardware:M.2 Accelerator with dual Edge TPU Coral M.2 Accelerator with Dual Edge TPU issues subtype:Mendel Linux Mendel Linux Build/installation issues type:bug Bug
Projects
None yet
Development

No branches or pull requests

2 participants