-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
profiling_based_partitioner doesn't divide evenly the time of the segments. #23
Comments
Hello @HanChangHun It could be due to input output latency. can you please share the latency results with single model benchmark for each segment file in txt file. Thanks! google-coral/edgetpu#593 (comment) |
I changed the profile-based partitioner to perform the partitioning in a single edge tpu and to share the SRAM of single edge tpu. But, It is difficult to see the big time gap caused by input-output data transfer time. The logs are as follows: 2022-06-03 23:43:41
Running ./single_model_benchmark
Run on (16 X 4800 MHz CPU s)
CPU Caches:
L1 Data 48K (x8)
L1 Instruction 32K (x8)
L2 Unified 512K (x8)
L3 Unified 16384K (x1)
Load Average: 0.14, 0.88, 0.86
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------
BM_Model 2.94 ms 0.268 ms 1000 inception_v2_224_quant_segment_0_of_2_edgetpu.tflite
2022-06-03 23:43:53
Running ./single_model_benchmark
Run on (16 X 4800 MHz CPU s)
CPU Caches:
L1 Data 48K (x8)
L1 Instruction 32K (x8)
L2 Unified 512K (x8)
L3 Unified 16384K (x1)
Load Average: 0.11, 0.85, 0.85
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------
BM_Model 6.04 ms 0.223 ms 1000 inception_v2_224_quant_segment_1_of_2_edgetpu.tflite Another example is inception v2 splitting in 4. The gap between the slowest latency and fastest latency is greater than 1ms. (I setted 2022-06-03 23:53:31
Running ./single_model_benchmark
Run on (16 X 4800 MHz CPU s)
CPU Caches:
L1 Data 48K (x8)
L1 Instruction 32K (x8)
L2 Unified 512K (x8)
L3 Unified 16384K (x1)
Load Average: 0.41, 0.27, 0.50
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------
BM_Model 2.51 ms 0.269 ms 2698 inception_v2_224_quant_segment_0_of_4_edgetpu.tflite
2022-06-03 23:53:41
Running ./single_model_benchmark
Run on (16 X 4800 MHz CPU s)
CPU Caches:
L1 Data 48K (x8)
L1 Instruction 32K (x8)
L2 Unified 512K (x8)
L3 Unified 16384K (x1)
Load Average: 0.35, 0.26, 0.50
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------
BM_Model 1.89 ms 0.299 ms 2507 inception_v2_224_quant_segment_1_of_4_edgetpu.tflite
2022-06-03 23:53:48
Running ./single_model_benchmark
Run on (16 X 4800 MHz CPU s)
CPU Caches:
L1 Data 48K (x8)
L1 Instruction 32K (x8)
L2 Unified 512K (x8)
L3 Unified 16384K (x1)
Load Average: 0.32, 0.25, 0.49
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------
BM_Model 1.72 ms 0.239 ms 2811 inception_v2_224_quant_segment_2_of_4_edgetpu.tflite
2022-06-03 23:53:55
Running ./single_model_benchmark
Run on (16 X 4800 MHz CPU s)
CPU Caches:
L1 Data 48K (x8)
L1 Instruction 32K (x8)
L2 Unified 512K (x8)
L3 Unified 16384K (x1)
Load Average: 0.27, 0.24, 0.49
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------
BM_Model 1.09 ms 0.168 ms 3925 inception_v2_224_quant_segment_3_of_4_edgetpu.tflite
Thank you for the fast response. |
Hmm..profile-based partitioner is not intended to perform the partitioning in a single edge tpu. Please check this page for details and requirements to use this tool: https://github.com/google-coral/libcoral/blob/master/coral/tools/partitioner/README.md. Thanks! |
Thank you for your response. I aimed to utilize the existing code with only one edgetpu. So I changed the code that utilizes multiple edge tpu into a code that utilizes only one edge tpu. I modified the existing code, so it would be difficult for you to answer. |
Can you please try this code with two TPUs and with two segments on inception v3 model and share the logs and single model benchmark results for output models. Thanks! |
This code doesn't contain lower bound and upper bound update codes. So, I was changed somewhere and run with co-compilation. The result is like this. Inception V3 with 2 segments, Inception V3 with 3 segments, and Inception V3 with 4 segments. # Inception V3 with 2 segments
# 24.1ms and 24.9ms
target_latency: 24704940.8, num_ops: [84 48], latencies: [24120441 24918119]
# Inception V3 with 3 segments
# 16ms, 16.4ms, 16.7ms
target_latency: 16749807.1125, num_ops: [65 38 29], latencies: [16061107 16405574 16782598]
# Inception V3 with 4 segments
# 12.8ms, 13.1ms, 13.3ms and 13.7ms
target_latency: 13378520.2, num_ops: [56 27 25 24], latencies: [12896665 13151985 13315528 13781784] your code is very helpful! |
Awesome, Fell free to submit a Pull Request for this bug for the developer's review. Thanks! |
Description
The
diff_threshold_ns
that is profiling_based_partitioner's option is not working well.It doesn't compare the difference (in ns) between the slowest segment (upper bound) and the fastest segment (lower bound).
But it compares
last_segment_latency
andtarget_latency
.So I was able to get the result with the slowest segment is too slow and the fastest segment is speedy.
Maybe the source code(
last_segment_latency - target_latency
) should be changed.Click to expand!
Issue Type
Bug
Operating System
Mendel Linux, Linux
Coral Device
M.2 Accelerator with dual Edge TPU
Other Devices
No response
Programming Language
C++
Relevant Log Output
The text was updated successfully, but these errors were encountered: