Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up Decimal256 validation based on bytes comparison and add benchmark test #2360

Merged
merged 16 commits into from
Aug 12, 2022

Conversation

liukun4515
Copy link
Contributor

Which issue does this PR close?

Closes #2320

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Aug 8, 2022
@@ -479,6 +563,76 @@ pub(crate) fn validate_decimal_precision(value: i128, precision: usize) -> Resul
}
}

// duplicate code
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy from decimal.rs file

Ordering::Equal
}

pub(crate) fn validate_decimal_precision_with_bytes(lt_value: &[u8], precision: usize) -> Result<i128> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will change to Result<()>

@liukun4515
Copy link
Contributor Author

@alamb @viirya @tustvold This is just a drat pr for bench my thoughts brought in #2320

@liukun4515
Copy link
Contributor Author

liukun4515 commented Aug 8, 2022

I just add benchmark of validation for decimal128array. Got the result below:

validate_decimal128_array_slow 20000
                        time:   [125.08 us 125.62 us 126.22 us]
                        change: [-7.0125% -6.4283% -5.8267%] (p = 0.00 < 0.05)
                        Performance has improved.

validate_decimal128_array_fast 20000
                        time:   [107.54 us 107.72 us 107.88 us]
                        change: [-13.320% -13.002% -12.692%] (p = 0.00 < 0.05)
                        Performance has improved.

From the current code, there is a 10~20% performance improvement.

@tustvold tustvold marked this pull request as draft August 8, 2022 09:06
@liukun4515 liukun4515 requested review from viirya, alamb and tustvold and removed request for viirya and alamb August 8, 2022 09:20
@liukun4515 liukun4515 marked this pull request as ready for review August 8, 2022 09:26
@liukun4515 liukun4515 marked this pull request as draft August 8, 2022 10:07
@liukun4515 liukun4515 marked this pull request as draft August 8, 2022 10:07
@@ -336,6 +336,31 @@ impl BasicDecimalArray<Decimal128, Decimal128Array> for Decimal128Array {
}
Ok(())
}

fn validate_decimal_with_bytes(&self, precision: usize) -> Result<()> {
Copy link
Contributor

@tustvold tustvold Aug 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm missing something, the performance benefit from this is derived from the additional work being performed by the iterator. In particular BasicDecimal::new. Perhaps we should fix this, as this will have benefits all over the codebase? For one thing BasicDecimal could easily take a slice with a known size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i think the overhead is from the xxDecimal::new() and convert to i128.

Using this pr method, we can improve the performance from reducing above overhead.

In current iter or index decimal array, the result of value is Decimal128 and Decimal256.
In many case, we don't need the struct of decimal, and just need the little-endian bytes.

Can we add more method to index or iter decimal array, and the result type of value is &[u8]?
Do you mean this?
@tustvold

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i think the overhead is from the xxDecimal::new() and convert to i128.

LLVM should be smart enough to elide this, it likely isn't because there are bound checks on the byte array length (which are completely unnecessary) and it isn't smart enough to elide these, which prevents eliding the rest.

It should definitely be possible to make these methods perform the same, it just might require a bit of hand-holding to help LLVM. This would then benefit all call-sites that use the iterator

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i think the overhead is from the xxDecimal::new() and convert to i128.

LLVM should be smart enough to elide this, it likely isn't because there are bound checks on the byte array length (which are completely unnecessary) and it isn't smart enough to elide these, which prevents eliding the rest.

It should definitely be possible to make these methods perform the same, it just might require a bit of hand-holding to help LLVM. This would then benefit all call-sites that use the iterator

Do you have some suggestions or plan for that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add it to my list of things to do this week

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add it to my list of things to do this week

Do you think this PR is on the right path, If so, I will going on this work.
Refactor the decimal256array validation, and add iter bytes array from the decimal array.
@tustvold

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tustvold @alamb From the result of benchmark, I think this refactor is workable and reasonable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't mind, I would like to take some time at some point in the next couple of days to take a holistic look at what is going on with decimals, and then get back to you

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should definitely be possible to make these methods perform the same, it just might require a bit of hand-holding to help LLVM. This would then benefit all call-sites that use the iterator

This sounds promising.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tustvold I remove the changes for validation for decimal128 and just refactor the logic of validation decimal256.
PTAL

@liukun4515
Copy link
Contributor Author

I guess that we can get more improvement for decimal256array type

@alamb
Copy link
Contributor

alamb commented Aug 8, 2022

Hi @liukun4515 -- I don't fully understand this approach for validating Decimal values though I have not studied it as carefully as I would like to. Most other arrays have two paths:

  1. Create from known good data (used in Builders where each individual value was validated individually) -- ArrayData::new_unchecked typically
  2. Create from an array of arbitrary user input and then do a vectorized validation (ArrayData::validate)

Since this PR seems to do something else a bit different I need to think about it some more, but I don't think I'll have time until later this week or this weekend to do so (I got caught up in #2369 for longer than I expected and now I am behind in some other areas)

It sounds like @tustvold has some thoughts about the "big picture" of Decimal in general so perhaps that will help

@alamb
Copy link
Contributor

alamb commented Aug 8, 2022

Possibly also related: #2362

@liukun4515 liukun4515 force-pushed the valid_decimal_with_bytes branch from 6ebf1da to f03830b Compare August 9, 2022 10:05
@github-actions github-actions bot added the parquet Changes to the parquet crate label Aug 9, 2022
@liukun4515
Copy link
Contributor Author

liukun4515 commented Aug 9, 2022

refactor the validation for decimal256 @viirya

validate_decimal256_array_slow 2000
                        time:   [160.63 ms 161.90 ms 163.14 ms]
                        change: [-27.571% -25.268% -23.024%] (p = 0.00 < 0.05)
                        Performance has improved.

validate_decimal256_array_fast 2000                                                                             ^@
                        time:   [3.0104 ms 3.0427 ms 3.0766 ms]

almost 50x faster than before.
You can get the benchmark code in the decimal_validate.rs file.

@liukun4515
Copy link
Contributor Author

I will try to replace the validation base on the i128 or str to byte_array. @alamb @viirya @tustvold

@alamb alamb mentioned this pull request Aug 9, 2022
@HaoYang670
Copy link
Contributor

HaoYang670 commented Aug 11, 2022

I find some wired things about decimal128/i128, So I remove the refactor of decimal128 and just left decimal256. PTAL

What's weird thing you met? @liukun4515

@liukun4515
Copy link
Contributor Author

liukun4515 commented Aug 11, 2022

I find some wired things about decimal128/i128, So I remove the refactor of decimal128 and just left decimal256. PTAL

What's weird thing did you meet? @liukun4515

Diff sizes of Decimal128Array have diff benchmark result.
For example, when the array size is 2000, the performance of comparison using the bytes array: [u8;16] format is better than using i128 directly.
But when the array size is 20000, the result is opposite.

I'am working on that to find out the reason, but don't want to block this work forward.
Remove the part which make us confused, and make the pr review easily.
@HaoYang670

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks like a nice improvement to me -- I realize it doesn't affect but the improvement to validate min/max values of Decimals using a precomputed tables of binary representations rather than &str seems like a very reasonable performance improvement to me

I realize there is still an outstanding question on #2387 of if we should be doing decimal value validation at all, but I think we can improve the current implementation even as we refine the overall semantics

@tustvold @HaoYang670 @viirya any concern with merging this PR as is?

@alamb alamb changed the title speed up the decimal256 validation based on bytes comparison and add benchmark test Speed up Decimal256 validation based on bytes comparison and add benchmark test Aug 11, 2022
Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few typos.

liukun4515 and others added 2 commits August 12, 2022 09:17
if precision < self.precision {
for v in self.iter().flatten() {
validate_decimal256_precision(&v.to_big_int(), precision)?;
fn validate_decimal_precision(&self, precision: usize) -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I prefer this function to be an independent fn, but not a method of Decimal256Array,
because

  1. DecimalArray has its own precision, providing another precision to its method is somewhat weird.
  2. This method is just for Decimal256Array, but not the generic decimal array. Moving it out the impl Decimal256Array can make the code cleaner.

Copy link
Contributor Author

@liukun4515 liukun4515 Aug 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I prefer this function to be an independent fn, but not a method of Decimal256Array, because

If this is method is only for Decimal256Array, why we should move it out of the impl? I am confused about this suggestion.
Besides, It's a private method.
If move it out of the impl Decimal256Array, how to get the data of the Self?

  1. DecimalArray has its own precision, providing another precision to its method is somewhat weird.

with_precision_and_scale also provide the precision and scale, is this also weird?

  1. This method is just for Decimal256Array, but not the generic decimal array. Moving it out the impl Decimal256Array can make the code cleaner.

I feel confused about this reason. The method is only used to Decimal256Array, why not treat it as private function?

if precision < self.precision {
for v in self.iter().flatten() {
validate_decimal256_precision(&v.to_big_int(), precision)?;
fn validate_decimal_precision(&self, precision: usize) -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this check if precision >= self.precision?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch

} else {
let offset = current + data.offset();
current += 1;
let raw_val = unsafe {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be extracted as a separate method DecimalArray::raw_value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value_unchecked also use it.
Maybe you can extract them with follow-up pr

Comment on lines 335 to 356
let current_end = self.data.len();
let mut current: usize = 0;
let data = &self.data;

while current != current_end {
if self.is_null(current) {
current += 1;
continue;
} else {
let offset = current + data.offset();
current += 1;
let raw_val = unsafe {
let pos = self.value_offset_at(offset);
std::slice::from_raw_parts(
self.raw_value_data_ptr().offset(pos as isize),
Self::VALUE_LENGTH as usize,
)
};
validate_decimal256_precision_with_lt_bytes(raw_val, precision)?;
}
}
Ok(())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let current_end = self.data.len();
let mut current: usize = 0;
let data = &self.data;
while current != current_end {
if self.is_null(current) {
current += 1;
continue;
} else {
let offset = current + data.offset();
current += 1;
let raw_val = unsafe {
let pos = self.value_offset_at(offset);
std::slice::from_raw_parts(
self.raw_value_data_ptr().offset(pos as isize),
Self::VALUE_LENGTH as usize,
)
};
validate_decimal256_precision_with_lt_bytes(raw_val, precision)?;
}
}
Ok(())
(0..self.len())
.filter(|idx| self.is_valid(*idx))
.try_for_each(|idx| {
let raw_val = unsafe {
let pos = self.value_offset(idx);
std::slice::from_raw_parts(
self.raw_value_data_ptr().offset(pos as isize),
Self::VALUE_LENGTH as usize,
)
};
validate_decimal256_precision_with_lt_bytes(raw_val, precision)
})

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not apply the suggestion, because of the performance regression.
The performance of your version:

validate_decimal256_array 20000
                        time:   [393.73 us 402.59 us 412.64 us]
                        change: [+0.0864% +1.9579% +3.8640%] (p = 0.05 < 0.05)
                        Change within noise threshold.

My version:

validate_decimal256_array 20000
                        time:   [282.54 us 289.68 us 297.22 us]
                        change: [-29.976% -27.993% -25.897%] (p = 0.00 < 0.05)
                        Performance has improved.

I guess the reason is loop twice and create an intermediate Iter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HaoYang670 I think this is an interesting result.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please try

  (0..self.len())
      .try_for_each(|idx| {
            if self.is_valid(idx) { ... }
}

?
I am afraid the filter function introduce some overhead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change make sense.
Will try it and post the benchmark result.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no performance regression without the filter, and I will apply the change to the codebase.

@liukun4515
Copy link
Contributor Author

If the changes looks good to you, please approve it. @HaoYang670
After the ci is passed, I will merge it.

@liukun4515 liukun4515 force-pushed the valid_decimal_with_bytes branch from 2b353a3 to 3aab001 Compare August 12, 2022 05:15
@liukun4515
Copy link
Contributor Author

[05:20:36] 'compile:es2015:umd' errored after 36 s
[05:20:36] Error: gulp-google-closure-compiler: java.util.zip.ZipException: invalid entry CRC (expected 0x4e1f14a4 but got 0xb1e0eb5b)
	at java.util.zip.ZipInputStream.readEnd(ZipInputStream.java:410)
	at java.util.zip.ZipInputStream.read(ZipInputStream.java:199)
	at java.util.zip.ZipInputStream.closeEntry(ZipInputStream.java:143)
	at java.util.zip.ZipInputStream.getNextEntry(ZipInputStream.java:121)
	at com.google.javascript.jscomp.AbstractCommandLineRunner.getBuiltinExterns(AbstractCommandLineRunner.java:500)
	at com.google.javascript.jscomp.CommandLineRunner.createExterns(CommandLineRunner.java:2084)
	at com.google.javascript.jscomp.AbstractCommandLineRunner.doRun(AbstractCommandLineRunner.java:1187)
	at com.google.javascript.jscomp.AbstractCommandLineRunner.run(AbstractCommandLineRunner.java:551)
	at com.google.javascript.jscomp.CommandLineRunner.main(CommandLineRunner.java:2246)
Error writing to stdin of the compiler. write EPIPE

CustomError: gulp-google-closure-compiler: Compilation errors occurred
    at CompilationStream._compilationComplete (/arrow/js/node_modules/google-closure-compiler/lib/gulp/index.js:238:28)
    at /arrow/js/node_modules/google-closure-compiler/lib/gulp/index.js:208:14
    at processTicksAndRejections (node:internal/process/task_queues:96:5)

    at formatError (/arrow/js/node_modules/gulp-cli/lib/versioned/^4.0.0/format-error.js:21:10)
    at Gulp.<anonymous> (/arrow/js/node_modules/gulp-cli/lib/versioned/^4.0.0/log/events.js:33:15)
    at Gulp.emit (node:events:538:35)
    at Gulp.emit (node:domain:475:12)
    at Object.error (/arrow/js/node_modules/undertaker/lib/helpers/createExtensions.js:61:10)
    at handler (/arrow/js/node_modules/now-and-later/lib/mapSeries.js:47:14)
    at f (/arrow/js/node_modules/once/once.js:25:25)
    at f (/arrow/js/node_modules/once/once.js:25:25)
    at tryCatch (/arrow/js/node_modules/bach/node_modules/async-done/index.js:24:15)
    at done (/arrow/js/node_modules/bach/node_modules/async-done/index.js:40:12)
[05:20:36] 'build:es2015:umd' errored after 2.35 min
[05:20:36] 'build:apache-arrow' errored after 2.35 min
[05:20:36] 'build' errored after 2.37 min
error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
1
Error: `docker-compose --file /home/runner/work/arrow-rs/arrow-rs/docker-compose.yml run --rm -e ARCHERY_INTEGRATION_WITH_RUST=1 conda-integration` exited with a non-zero exit code 1, see the process log above.

The it failed, but nothing about this pr.

@liukun4515 liukun4515 merged commit b235173 into apache:master Aug 12, 2022
@liukun4515 liukun4515 deleted the valid_decimal_with_bytes branch August 12, 2022 06:02
@ursabot
Copy link

ursabot commented Aug 12, 2022

Benchmark runs are scheduled for baseline = e7dcfbc and contender = b235173. b235173 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@alamb
Copy link
Contributor

alamb commented Aug 12, 2022

Thanks everyone for getting this through. Good team work 👍

@tustvold tustvold mentioned this pull request Sep 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize the validation of Decimal256
6 participants