Speed up `Decimal256` validation based on bytes comparison and add benchmark test #2360

liukun4515 · 2022-08-08T08:38:10Z

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

liukun4515 · 2022-08-08T08:38:44Z

arrow/src/datatypes/datatype.rs

@@ -479,6 +563,76 @@ pub(crate) fn validate_decimal_precision(value: i128, precision: usize) -> Resul
    }
 }

+// duplicate code


copy from decimal.rs file

liukun4515 · 2022-08-08T08:39:21Z

arrow/src/datatypes/datatype.rs

+    Ordering::Equal
+}
+
+pub(crate) fn validate_decimal_precision_with_bytes(lt_value: &[u8], precision: usize) -> Result<i128> {


will change to Result<()>

liukun4515 · 2022-08-08T08:42:00Z

@alamb @viirya @tustvold This is just a drat pr for bench my thoughts brought in #2320

liukun4515 · 2022-08-08T08:55:31Z

I just add benchmark of validation for decimal128array. Got the result below:

validate_decimal128_array_slow 20000
                        time:   [125.08 us 125.62 us 126.22 us]
                        change: [-7.0125% -6.4283% -5.8267%] (p = 0.00 < 0.05)
                        Performance has improved.

validate_decimal128_array_fast 20000
                        time:   [107.54 us 107.72 us 107.88 us]
                        change: [-13.320% -13.002% -12.692%] (p = 0.00 < 0.05)
                        Performance has improved.

From the current code, there is a 10~20% performance improvement.

tustvold · 2022-08-08T10:46:39Z

arrow/src/array/array_decimal.rs

@@ -336,6 +336,31 @@ impl BasicDecimalArray<Decimal128, Decimal128Array> for Decimal128Array {
        }
        Ok(())
    }
+
+    fn validate_decimal_with_bytes(&self, precision: usize) -> Result<()> {


Unless I'm missing something, the performance benefit from this is derived from the additional work being performed by the iterator. In particular BasicDecimal::new. Perhaps we should fix this, as this will have benefits all over the codebase? For one thing BasicDecimal could easily take a slice with a known size.

yes, i think the overhead is from the xxDecimal::new() and convert to i128.

Using this pr method, we can improve the performance from reducing above overhead.

In current iter or index decimal array, the result of value is Decimal128 and Decimal256.
In many case, we don't need the struct of decimal, and just need the little-endian bytes.

Can we add more method to index or iter decimal array, and the result type of value is &[u8]?
Do you mean this?
@tustvold

yes, i think the overhead is from the xxDecimal::new() and convert to i128.

LLVM should be smart enough to elide this, it likely isn't because there are bound checks on the byte array length (which are completely unnecessary) and it isn't smart enough to elide these, which prevents eliding the rest.

It should definitely be possible to make these methods perform the same, it just might require a bit of hand-holding to help LLVM. This would then benefit all call-sites that use the iterator

yes, i think the overhead is from the xxDecimal::new() and convert to i128.

LLVM should be smart enough to elide this, it likely isn't because there are bound checks on the byte array length (which are completely unnecessary) and it isn't smart enough to elide these, which prevents eliding the rest.

It should definitely be possible to make these methods perform the same, it just might require a bit of hand-holding to help LLVM. This would then benefit all call-sites that use the iterator

Do you have some suggestions or plan for that?

I can add it to my list of things to do this week

I can add it to my list of things to do this week

Do you think this PR is on the right path, If so, I will going on this work.
Refactor the decimal256array validation, and add iter bytes array from the decimal array.
@tustvold

@tustvold @alamb From the result of benchmark, I think this refactor is workable and reasonable.

If you don't mind, I would like to take some time at some point in the next couple of days to take a holistic look at what is going on with decimals, and then get back to you

It should definitely be possible to make these methods perform the same, it just might require a bit of hand-holding to help LLVM. This would then benefit all call-sites that use the iterator

This sounds promising.

@tustvold I remove the changes for validation for decimal128 and just refactor the logic of validation decimal256.
PTAL

liukun4515 · 2022-08-08T11:06:15Z

I guess that we can get more improvement for decimal256array type

alamb · 2022-08-08T21:07:20Z

Hi @liukun4515 -- I don't fully understand this approach for validating Decimal values though I have not studied it as carefully as I would like to. Most other arrays have two paths:

Create from known good data (used in Builders where each individual value was validated individually) -- ArrayData::new_unchecked typically
Create from an array of arbitrary user input and then do a vectorized validation (ArrayData::validate)

Since this PR seems to do something else a bit different I need to think about it some more, but I don't think I'll have time until later this week or this weekend to do so (I got caught up in #2369 for longer than I expected and now I am behind in some other areas)

It sounds like @tustvold has some thoughts about the "big picture" of Decimal in general so perhaps that will help

alamb · 2022-08-08T21:29:05Z

Possibly also related: #2362

liukun4515 · 2022-08-09T10:07:21Z

refactor the validation for decimal256 @viirya

validate_decimal256_array_slow 2000
                        time:   [160.63 ms 161.90 ms 163.14 ms]
                        change: [-27.571% -25.268% -23.024%] (p = 0.00 < 0.05)
                        Performance has improved.

validate_decimal256_array_fast 2000                                                                             ^@
                        time:   [3.0104 ms 3.0427 ms 3.0766 ms]

almost 50x faster than before.
You can get the benchmark code in the decimal_validate.rs file.

liukun4515 · 2022-08-09T10:09:58Z

I will try to replace the validation base on the i128 or str to byte_array. @alamb @viirya @tustvold

…h_bytes

HaoYang670 · 2022-08-11T03:13:20Z

I find some wired things about decimal128/i128, So I remove the refactor of decimal128 and just left decimal256. PTAL

What's weird thing you met? @liukun4515

liukun4515 · 2022-08-11T05:53:32Z

I find some wired things about decimal128/i128, So I remove the refactor of decimal128 and just left decimal256. PTAL

What's weird thing did you meet? @liukun4515

Diff sizes of Decimal128Array have diff benchmark result.
For example, when the array size is 2000, the performance of comparison using the bytes array: [u8;16] format is better than using i128 directly.
But when the array size is 20000, the result is opposite.

I'am working on that to find out the reason, but don't want to block this work forward.
Remove the part which make us confused, and make the pr review easily.
@HaoYang670

alamb

This PR looks like a nice improvement to me -- I realize it doesn't affect but the improvement to validate min/max values of Decimals using a precomputed tables of binary representations rather than &str seems like a very reasonable performance improvement to me

I realize there is still an outstanding question on #2387 of if we should be doing decimal value validation at all, but I think we can improve the current implementation even as we refine the overall semantics

@tustvold @HaoYang670 @viirya any concern with merging this PR as is?

arrow/src/datatypes/datatype.rs

viirya

A few typos.

Co-authored-by: Liang-Chi Hsieh <[email protected]>

HaoYang670 · 2022-08-12T01:12:03Z

arrow/src/array/array_decimal.rs

-        if precision < self.precision {
-            for v in self.iter().flatten() {
-                validate_decimal256_precision(&v.to_big_int(), precision)?;
+    fn validate_decimal_precision(&self, precision: usize) -> Result<()> {


Personally, I prefer this function to be an independent fn, but not a method of Decimal256Array,
because

DecimalArray has its own precision, providing another precision to its method is somewhat weird.

This method is just for Decimal256Array, but not the generic decimal array. Moving it out the impl Decimal256Array can make the code cleaner.

Personally, I prefer this function to be an independent fn, but not a method of Decimal256Array, because

If this is method is only for Decimal256Array, why we should move it out of the impl? I am confused about this suggestion.
Besides, It's a private method.
If move it out of the impl Decimal256Array, how to get the data of the Self?

DecimalArray has its own precision, providing another precision to its method is somewhat weird.

with_precision_and_scale also provide the precision and scale, is this also weird?

This method is just for Decimal256Array, but not the generic decimal array. Moving it out the impl Decimal256Array can make the code cleaner.

I feel confused about this reason. The method is only used to Decimal256Array, why not treat it as private function?

HaoYang670 · 2022-08-12T01:12:55Z

arrow/src/array/array_decimal.rs

-        if precision < self.precision {
-            for v in self.iter().flatten() {
-                validate_decimal256_precision(&v.to_big_int(), precision)?;
+    fn validate_decimal_precision(&self, precision: usize) -> Result<()> {


Do we need this check if precision >= self.precision?

HaoYang670 · 2022-08-12T01:17:31Z

arrow/src/array/array_decimal.rs

+            } else {
+                let offset = current + data.offset();
+                current += 1;
+                let raw_val = unsafe {


This could be extracted as a separate method DecimalArray::raw_value.

value_unchecked also use it.
Maybe you can extract them with follow-up pr

HaoYang670 · 2022-08-12T01:39:25Z

arrow/src/array/array_decimal.rs

+        let current_end = self.data.len();
+        let mut current: usize = 0;
+        let data = &self.data;
+
+        while current != current_end {
+            if self.is_null(current) {
+                current += 1;
+                continue;
+            } else {
+                let offset = current + data.offset();
+                current += 1;
+                let raw_val = unsafe {
+                    let pos = self.value_offset_at(offset);
+                    std::slice::from_raw_parts(
+                        self.raw_value_data_ptr().offset(pos as isize),
+                        Self::VALUE_LENGTH as usize,
+                    )
+                };
+                validate_decimal256_precision_with_lt_bytes(raw_val, precision)?;
            }
        }
        Ok(())


Suggested change

let current_end = self.data.len();

let mut current: usize = 0;

let data = &self.data;

while current != current_end {

if self.is_null(current) {

current += 1;

continue;

} else {

let offset = current + data.offset();

current += 1;

let raw_val = unsafe {

let pos = self.value_offset_at(offset);

std::slice::from_raw_parts(

self.raw_value_data_ptr().offset(pos as isize),

Self::VALUE_LENGTH as usize,

)

};

validate_decimal256_precision_with_lt_bytes(raw_val, precision)?;

}

}

Ok(())

(0..self.len())

.filter(|idx| self.is_valid(*idx))

.try_for_each(|idx| {

let raw_val = unsafe {

let pos = self.value_offset(idx);

std::slice::from_raw_parts(

self.raw_value_data_ptr().offset(pos as isize),

Self::VALUE_LENGTH as usize,

)

};

validate_decimal256_precision_with_lt_bytes(raw_val, precision)

})

Not apply the suggestion, because of the performance regression.
The performance of your version:

validate_decimal256_array 20000 time: [393.73 us 402.59 us 412.64 us] change: [+0.0864% +1.9579% +3.8640%] (p = 0.05 < 0.05) Change within noise threshold.

My version:

validate_decimal256_array 20000 time: [282.54 us 289.68 us 297.22 us] change: [-29.976% -27.993% -25.897%] (p = 0.00 < 0.05) Performance has improved.

I guess the reason is loop twice and create an intermediate Iter

@HaoYang670 I think this is an interesting result.

Could you please try

(0..self.len()) .try_for_each(|idx| { if self.is_valid(idx) { ... } }

?
I am afraid the filter function introduce some overhead.

I think this change make sense.
Will try it and post the benchmark result.

There is no performance regression without the filter, and I will apply the change to the codebase.

liukun4515 · 2022-08-12T03:52:56Z

If the changes looks good to you, please approve it. @HaoYang670
After the ci is passed, I will merge it.

liukun4515 · 2022-08-12T05:51:17Z

[05:20:36] 'compile:es2015:umd' errored after 36 s
[05:20:36] Error: gulp-google-closure-compiler: java.util.zip.ZipException: invalid entry CRC (expected 0x4e1f14a4 but got 0xb1e0eb5b)
	at java.util.zip.ZipInputStream.readEnd(ZipInputStream.java:410)
	at java.util.zip.ZipInputStream.read(ZipInputStream.java:199)
	at java.util.zip.ZipInputStream.closeEntry(ZipInputStream.java:143)
	at java.util.zip.ZipInputStream.getNextEntry(ZipInputStream.java:121)
	at com.google.javascript.jscomp.AbstractCommandLineRunner.getBuiltinExterns(AbstractCommandLineRunner.java:500)
	at com.google.javascript.jscomp.CommandLineRunner.createExterns(CommandLineRunner.java:2084)
	at com.google.javascript.jscomp.AbstractCommandLineRunner.doRun(AbstractCommandLineRunner.java:1187)
	at com.google.javascript.jscomp.AbstractCommandLineRunner.run(AbstractCommandLineRunner.java:551)
	at com.google.javascript.jscomp.CommandLineRunner.main(CommandLineRunner.java:2246)
Error writing to stdin of the compiler. write EPIPE

CustomError: gulp-google-closure-compiler: Compilation errors occurred
    at CompilationStream._compilationComplete (/arrow/js/node_modules/google-closure-compiler/lib/gulp/index.js:238:28)
    at /arrow/js/node_modules/google-closure-compiler/lib/gulp/index.js:208:14
    at processTicksAndRejections (node:internal/process/task_queues:96:5)

    at formatError (/arrow/js/node_modules/gulp-cli/lib/versioned/^4.0.0/format-error.js:21:10)
    at Gulp.<anonymous> (/arrow/js/node_modules/gulp-cli/lib/versioned/^4.0.0/log/events.js:33:15)
    at Gulp.emit (node:events:538:35)
    at Gulp.emit (node:domain:475:12)
    at Object.error (/arrow/js/node_modules/undertaker/lib/helpers/createExtensions.js:61:10)
    at handler (/arrow/js/node_modules/now-and-later/lib/mapSeries.js:47:14)
    at f (/arrow/js/node_modules/once/once.js:25:25)
    at f (/arrow/js/node_modules/once/once.js:25:25)
    at tryCatch (/arrow/js/node_modules/bach/node_modules/async-done/index.js:24:15)
    at done (/arrow/js/node_modules/bach/node_modules/async-done/index.js:40:12)
[05:20:36] 'build:es2015:umd' errored after 2.35 min
[05:20:36] 'build:apache-arrow' errored after 2.35 min
[05:20:36] 'build' errored after 2.37 min
error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
1
Error: `docker-compose --file /home/runner/work/arrow-rs/arrow-rs/docker-compose.yml run --rm -e ARCHERY_INTEGRATION_WITH_RUST=1 conda-integration` exited with a non-zero exit code 1, see the process log above.

The it failed, but nothing about this pr.

ursabot · 2022-08-12T06:02:12Z

Benchmark runs are scheduled for baseline = e7dcfbc and contender = b235173. b235173 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

alamb · 2022-08-12T16:24:50Z

Thanks everyone for getting this through. Good team work 👍

github-actions bot added the arrow Changes to the arrow crate label Aug 8, 2022

liukun4515 commented Aug 8, 2022

View reviewed changes

tustvold marked this pull request as draft August 8, 2022 09:06

liukun4515 requested review from viirya, alamb and tustvold and removed request for viirya and alamb August 8, 2022 09:20

liukun4515 marked this pull request as ready for review August 8, 2022 09:26

liukun4515 marked this pull request as draft August 8, 2022 10:07

tustvold reviewed Aug 8, 2022

View reviewed changes

alamb mentioned this pull request Aug 8, 2022

[WIP]Change precision of decimal without validation #2357

Closed

speed up the decimal validation and add benchmark test

d553a99

liukun4515 mentioned this pull request Aug 9, 2022

Rewrite Decimal and DecimalArray using const_generic #2383

Merged

liukun4515 added 2 commits August 9, 2022 16:28

update speed

a64b8ce

speed up decimal256 validation

f03830b

liukun4515 force-pushed the valid_decimal_with_bytes branch from 6ebf1da to f03830b Compare August 9, 2022 10:05

github-actions bot added the parquet Changes to the parquet crate label Aug 9, 2022

alamb mentioned this pull request Aug 9, 2022

DecimalArray Benchmarks #2388

Closed

liukun4515 added 2 commits August 10, 2022 16:37

Merge remote-tracking branch 'upstream/master' into valid_decimal_wit…

1961c74

…h_bytes

fix test

ce67555

liukun4515 requested review from tustvold and HaoYang670 August 10, 2022 12:27

liukun4515 mentioned this pull request Aug 11, 2022

Use Fixed-Length Array in BasicDecimal new and raw_value #2405

Merged

alamb approved these changes Aug 11, 2022

View reviewed changes

alamb changed the title ~~speed up the decimal256 validation based on bytes comparison and add benchmark test~~ Speed up Decimal256 validation based on bytes comparison and add benchmark test Aug 11, 2022

viirya reviewed Aug 11, 2022

View reviewed changes

arrow/src/datatypes/datatype.rs Outdated Show resolved Hide resolved

viirya reviewed Aug 11, 2022

View reviewed changes

arrow/src/datatypes/datatype.rs Outdated Show resolved Hide resolved

viirya approved these changes Aug 11, 2022

View reviewed changes

liukun4515 and others added 2 commits August 12, 2022 09:17

Update arrow/src/datatypes/datatype.rs

fccf3b3

Co-authored-by: Liang-Chi Hsieh <[email protected]>

Update arrow/src/datatypes/datatype.rs

0bd4df4

Co-authored-by: Liang-Chi Hsieh <[email protected]>

HaoYang670 reviewed Aug 12, 2022

View reviewed changes

address comment

fac1f71

HaoYang670 approved these changes Aug 12, 2022

View reviewed changes

address comment

3aab001

liukun4515 force-pushed the valid_decimal_with_bytes branch from 2b353a3 to 3aab001 Compare August 12, 2022 05:15

liukun4515 merged commit b235173 into apache:master Aug 12, 2022

liukun4515 deleted the valid_decimal_with_bytes branch August 12, 2022 06:02

liukun4515 mentioned this pull request Aug 12, 2022

refactor: refine validation for decimal128 array #2428

Merged

tustvold mentioned this pull request Sep 26, 2022

Add i256 (#2637) #2781

Merged

Speed up Decimal256 validation based on bytes comparison and add benchmark test #2360

Speed up Decimal256 validation based on bytes comparison and add benchmark test #2360

Conversation

liukun4515 commented Aug 8, 2022

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liukun4515 commented Aug 8, 2022

liukun4515 commented Aug 8, 2022 • edited Loading

tustvold Aug 8, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liukun4515 commented Aug 8, 2022

alamb commented Aug 8, 2022

alamb commented Aug 8, 2022

liukun4515 commented Aug 9, 2022 • edited Loading

liukun4515 commented Aug 9, 2022

HaoYang670 commented Aug 11, 2022 • edited Loading

liukun4515 commented Aug 11, 2022 • edited Loading

alamb left a comment

Choose a reason for hiding this comment

viirya left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liukun4515 Aug 12, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liukun4515 commented Aug 12, 2022

liukun4515 commented Aug 12, 2022

ursabot commented Aug 12, 2022

alamb commented Aug 12, 2022

Speed up `Decimal256` validation based on bytes comparison and add benchmark test #2360

Speed up `Decimal256` validation based on bytes comparison and add benchmark test #2360

liukun4515 commented Aug 8, 2022 •

edited

Loading

tustvold Aug 8, 2022 •

edited

Loading

liukun4515 commented Aug 9, 2022 •

edited

Loading

HaoYang670 commented Aug 11, 2022 •

edited

Loading

liukun4515 commented Aug 11, 2022 •

edited

Loading

liukun4515 Aug 12, 2022 •

edited

Loading