-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-40357: [C++] Add benchmark for ToTensor conversions #40358
GH-40357: [C++] Add benchmark for ToTensor conversions #40358
Conversation
Can you show the result of running them? And we might want to use some more data to get a more reliable result? |
This was the result output:
WIll use |
The result from running
|
Current output when running
|
Output from running the benchmarks on the latest commit:
|
cpp/src/arrow/tensor_benchmark.cc
Outdated
RegressionArgs args(state); | ||
std::shared_ptr<DataType> ty = TypeTraits<ValueType>::type_singleton(); | ||
|
||
const int64_t kNumRows = args.size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would maybe still do the / 8
( or division by sizeof(CType)
), because the reported "Time" of some of benchmarks is still in the > second range
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New result, dividing size by sizeof(CType)
:
Running /var/folders/gw/q7wqd4tx18n_9t4kbkd0bj1m0000gn/T/arrow-archery-wyoew3d4/WORKSPACE/build/release/arrow-tensor-benchmark
Run on (8 X 24 MHz CPU s)
CPU Caches:
L1 Data 64 KiB
L1 Instruction 128 KiB
L2 Unified 4096 KiB (x8)
Load Average: 19.13, 18.37, 12.07
-----------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-----------------------------------------------------------------------------------------------------
BatchToTensorSimple<UInt8Type>/65536 440965 ns 434790 ns 1628 bytes_per_second=143.748Mi/s items_per_second=15.073G/s null_percent=0 size=65.536k
BatchToTensorSimple<UInt8Type>/4194304 52116387 ns 39301000 ns 18 bytes_per_second=101.779Mi/s items_per_second=10.6723G/s null_percent=0 size=4.1943M
BatchToTensorSimple<UInt16Type>/65536 422252 ns 421368 ns 1663 bytes_per_second=148.326Mi/s items_per_second=7.77658G/s null_percent=0 size=65.536k
BatchToTensorSimple<UInt16Type>/4194304 39602325 ns 36205053 ns 19 bytes_per_second=110.482Mi/s items_per_second=5.79243G/s null_percent=0 size=4.1943M
BatchToTensorSimple<UInt32Type>/65536 411546 ns 411012 ns 1696 bytes_per_second=152.064Mi/s items_per_second=3.98625G/s null_percent=0 size=65.536k
BatchToTensorSimple<UInt32Type>/4194304 37668923 ns 35941842 ns 19 bytes_per_second=111.291Mi/s items_per_second=2.91742G/s null_percent=0 size=4.1943M
BatchToTensorSimple<UInt64Type>/65536 409912 ns 409266 ns 1772 bytes_per_second=152.712Mi/s items_per_second=2.00163G/s null_percent=0 size=65.536k
BatchToTensorSimple<UInt64Type>/4194304 40266224 ns 36517789 ns 19 bytes_per_second=109.536Mi/s items_per_second=1.43571G/s null_percent=0 size=4.1943M
BatchToTensorSimple<Int8Type>/65536 404307 ns 403876 ns 1709 bytes_per_second=154.75Mi/s items_per_second=16.2268G/s null_percent=0 size=65.536k
BatchToTensorSimple<Int8Type>/4194304 37406713 ns 35309316 ns 19 bytes_per_second=113.285Mi/s items_per_second=11.8787G/s null_percent=0 size=4.1943M
BatchToTensorSimple<Int16Type>/65536 414663 ns 414136 ns 1649 bytes_per_second=150.916Mi/s items_per_second=7.91237G/s null_percent=0 size=65.536k
BatchToTensorSimple<Int16Type>/4194304 37432355 ns 35457526 ns 19 bytes_per_second=112.811Mi/s items_per_second=5.91455G/s null_percent=0 size=4.1943M
BatchToTensorSimple<Int32Type>/65536 413986 ns 413420 ns 1706 bytes_per_second=151.178Mi/s items_per_second=3.96304G/s null_percent=0 size=65.536k
BatchToTensorSimple<Int32Type>/4194304 47971980 ns 37791471 ns 17 bytes_per_second=105.844Mi/s items_per_second=2.77464G/s null_percent=0 size=4.1943M
BatchToTensorSimple<Int64Type>/65536 415919 ns 415559 ns 1691 bytes_per_second=150.4Mi/s items_per_second=1.97132G/s null_percent=0 size=65.536k
BatchToTensorSimple<Int64Type>/4194304 36665862 ns 35319650 ns 20 bytes_per_second=113.251Mi/s items_per_second=1.48441G/s null_percent=0 size=4.1943M
BatchToTensorSimple<HalfFloatType>/65536 422161 ns 421677 ns 1685 bytes_per_second=148.218Mi/s items_per_second=7.77088G/s null_percent=0 size=65.536k
BatchToTensorSimple<HalfFloatType>/4194304 35648150 ns 34911650 ns 20 bytes_per_second=114.575Mi/s items_per_second=6.00703G/s null_percent=0 size=4.1943M
BatchToTensorSimple<FloatType>/65536 407051 ns 406626 ns 1702 bytes_per_second=153.704Mi/s items_per_second=4.02925G/s null_percent=0 size=65.536k
BatchToTensorSimple<FloatType>/4194304 35324888 ns 34521250 ns 20 bytes_per_second=115.871Mi/s items_per_second=3.03748G/s null_percent=0 size=4.1943M
BatchToTensorSimple<DoubleType>/65536 411345 ns 410348 ns 1740 bytes_per_second=152.31Mi/s items_per_second=1.99635G/s null_percent=0 size=65.536k
BatchToTensorSimple<DoubleType>/4194304 36834741 ns 35409211 ns 19 bytes_per_second=112.965Mi/s items_per_second=1.48065G/s null_percent=0 size=4.1943M
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bytes_per_second=143.748Mi/s items_per_second=15.073G/s
doesn't make sense, does it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess not. What I understand, at least, is that the number for items_per_second
should be approx bytes_per_second
divided by the size of the type. Joris advised me what I could try to debug this but I am not finding anything I could grasp.
I am not really sure if it makes a difference if I only use state.SetBytesProcessed
without state.SetItemsProcessed
. It also looks OK if I just leave both of them out:
Running /var/folders/gw/q7wqd4tx18n_9t4kbkd0bj1m0000gn/T/arrow-archery-vd706e0e/WORKSPACE/build/release/arrow-tensor-benchmark
Run on (8 X 24 MHz CPU s)
CPU Caches:
L1 Data 64 KiB
L1 Instruction 128 KiB
L2 Unified 4096 KiB (x8)
Load Average: 15.15, 15.26, 13.19
-----------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-----------------------------------------------------------------------------------------------------
BatchToTensorSimple<UInt8Type>/65536 429847 ns 429439 ns 1582 bytes_per_second=145.539Mi/s null_percent=0 size=65.536k
BatchToTensorSimple<UInt8Type>/4194304 56283753 ns 44952231 ns 13 bytes_per_second=88.9833Mi/s null_percent=0 size=4.1943M
BatchToTensorSimple<UInt16Type>/65536 470726 ns 462170 ns 1607 bytes_per_second=135.232Mi/s null_percent=0 size=65.536k
BatchToTensorSimple<UInt16Type>/4194304 44393589 ns 37141214 ns 14 bytes_per_second=107.697Mi/s null_percent=0 size=4.1943M
BatchToTensorSimple<UInt32Type>/65536 440997 ns 439951 ns 1260 bytes_per_second=142.061Mi/s null_percent=0 size=65.536k
BatchToTensorSimple<UInt32Type>/4194304 43955912 ns 36447556 ns 18 bytes_per_second=109.747Mi/s null_percent=0 size=4.1943M
BatchToTensorSimple<UInt64Type>/65536 432952 ns 431213 ns 1369 bytes_per_second=144.94Mi/s null_percent=0 size=65.536k
BatchToTensorSimple<UInt64Type>/4194304 40377762 ns 36827529 ns 17 bytes_per_second=108.614Mi/s null_percent=0 size=4.1943M
BatchToTensorSimple<Int8Type>/65536 583566 ns 561105 ns 1667 bytes_per_second=111.387Mi/s null_percent=0 size=65.536k
BatchToTensorSimple<Int8Type>/4194304 69477871 ns 51189900 ns 10 bytes_per_second=78.1404Mi/s null_percent=0 size=4.1943M
BatchToTensorSimple<Int16Type>/65536 466828 ns 460938 ns 1379 bytes_per_second=135.593Mi/s null_percent=0 size=65.536k
BatchToTensorSimple<Int16Type>/4194304 53699115 ns 43646833 ns 12 bytes_per_second=91.6447Mi/s null_percent=0 size=4.1943M
BatchToTensorSimple<Int32Type>/65536 510174 ns 489199 ns 1380 bytes_per_second=127.76Mi/s null_percent=0 size=65.536k
BatchToTensorSimple<Int32Type>/4194304 59453215 ns 43936000 ns 13 bytes_per_second=91.0415Mi/s null_percent=0 size=4.1943M
BatchToTensorSimple<Int64Type>/65536 449931 ns 446273 ns 1581 bytes_per_second=140.049Mi/s null_percent=0 size=65.536k
BatchToTensorSimple<Int64Type>/4194304 44797259 ns 38353000 ns 19 bytes_per_second=104.294Mi/s null_percent=0 size=4.1943M
BatchToTensorSimple<HalfFloatType>/65536 501073 ns 470337 ns 1660 bytes_per_second=132.884Mi/s null_percent=0 size=65.536k
BatchToTensorSimple<HalfFloatType>/4194304 57234822 ns 40693467 ns 15 bytes_per_second=98.2959Mi/s null_percent=0 size=4.1943M
BatchToTensorSimple<FloatType>/65536 420881 ns 419577 ns 1389 bytes_per_second=148.96Mi/s null_percent=0 size=65.536k
BatchToTensorSimple<FloatType>/4194304 41806079 ns 37133778 ns 18 bytes_per_second=107.719Mi/s null_percent=0 size=4.1943M
BatchToTensorSimple<DoubleType>/65536 424610 ns 423430 ns 1346 bytes_per_second=147.604Mi/s null_percent=0 size=65.536k
BatchToTensorSimple<DoubleType>/4194304 37983824 ns 35989222 ns 18 bytes_per_second=111.144Mi/s null_percent=0 size=4.1943M
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be caused by using RegressionArgs
, which also calls SetBytesProcessed
in its destructor (now, if that's the case, then we have some other benchmarks reporting the wrong number as well)
Co-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
8f7fc4f
to
9b5eb40
Compare
Latest output:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, just a minor suggestion
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit fc87fd7. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them. |
Rationale for this change
We should add benchmarks to be sure not to cause regressions while working on additional implementations of
RecordBatch::ToTensor
andTable::ToTensor
.What changes are included in this PR?
New
cpp/src/arrow/to_tensor_benchmark.cc file
.