Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-36730][SQL] Use V2 Filter in V2 file source #36332

Closed
wants to merge 1 commit into from

Conversation

huaxingao
Copy link
Contributor

Co-Authored-By: DB Tsai [email protected]
Co-Authored-By: Huaxin Gao [email protected]

What changes were proposed in this pull request?

updated the V2 file source to use V2 Filters. ParquetFilter hasn't been updated to use the V2 Filters yet and will be changed in the next PR.

Why are the changes needed?

V2 Filter migration

Does this PR introduce any user-facing change?

no

How was this patch tested?

New and existing test suites

@huaxingao
Copy link
Contributor Author

cc @cloud-fan @viirya @beliefer

Comment on lines 73 to 74
val actualFilters = pushedFilters.map(_.toV1)
.filterNot(_.references.contains(parsedOptions.columnNameOfCorruptRecord))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, why we cannot work on v2 here and other places directly? I feel it is verbose and redundant to see toV1, toV2 there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently OrcFilters, ParquetFilters, JacksonParser, UnivocityParser only take v1 filters. There are actually quite some work to refactor these to make them also work with v2 filters. I prefer to have separate PRs later on for these changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Maybe we can create some JIRAs for these planed changes and put the JIRA number into some comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments. Thanks

Copy link
Contributor

@beliefer beliefer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh, I just want to know why we not refactor FileScanBuilder by replaces SupportsPushDownCatalystFilters to SupportsPushDownV2Filters ?

if (children()[1] instanceof LiteralValue) {
// e.g. a = 1
return new EqualTo(children()[0].describe(),
CatalystTypeConverters.convertToScala(((LiteralValue)children()[1]).value(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two indents

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed my ide java continuation indent to 4 for another project. Changed back to 2 space. All the indentation should be good now.

} else if (children()[0] instanceof LiteralValue) {
// e.g. 1 = a
return new EqualTo(children()[1].describe(),
CatalystTypeConverters.convertToScala(((LiteralValue)children()[0]).value(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@@ -59,7 +59,7 @@ case class AvroScan(
readDataSchema,
readPartitionSchema,
parsedOptions,
pushedFilters)
pushedFilters.map(_.toV1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we no need toV1 in future ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right

public org.apache.spark.sql.sources.Filter toV1() {
String expressionStr = "";
for (Expression e : children()) {
expressionStr += e.describe() + ", ";
Copy link
Contributor

@LuciferYang LuciferYang Apr 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should use StringBuilder here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed. Thanks

@huaxingao
Copy link
Contributor Author

Uh, I just want to know why we not refactor FileScanBuilder by replaces SupportsPushDownCatalystFilters to SupportsPushDownV2Filters ?

We actually intentionally want to push down catalyst Expression instead of filter in file source, because in file source we need to do partition pruning, which uses catalyst Expression.

@@ -146,4 +149,210 @@ public class Predicate extends GeneralScalarExpression {
public Predicate(String name, Expression[] children) {
super(name, children);
}

public org.apache.spark.sql.sources.Filter toV1() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't add this public API. Can we have a private internal util function to do it?

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Oct 23, 2022
@github-actions github-actions bot closed this Oct 24, 2022
dongjoon-hyun pushed a commit that referenced this pull request Nov 4, 2023
### What changes were proposed in this pull request?
This pr upgrade Apache Arrow from 13.0.0 to 14.0.0.

### Why are the changes needed?
The Apache Arrow 14.0.0 release brings a number of enhancements and bug fixes.
‎
In terms of bug fixes, the release addresses several critical issues that were causing failures in integration jobs with Spark([GH-36332](apache/arrow#36332)) and problems with importing empty data arrays([GH-37056](apache/arrow#37056)). It also optimizes the process of appending variable length vectors([GH-37829](apache/arrow#37829)) and includes C++ libraries for MacOS AARCH 64 in Java-Jars([GH-38076](apache/arrow#38076)).
‎
The new features and improvements focus on enhancing the handling and manipulation of data. This includes the introduction of DefaultVectorComparators for large types([GH-25659](apache/arrow#25659)), support for extended expressions in ScannerBuilder([GH-34252](apache/arrow#34252)), and the exposure of the VectorAppender class([GH-37246](apache/arrow#37246)).
‎
The release also brings enhancements to the development and testing process, with the CI environment now using JDK 21([GH-36994](apache/arrow#36994)). In addition, the release introduces vector validation consistent with C++, ensuring consistency across different languages([GH-37702](apache/arrow#37702)).
‎
Furthermore, the usability of VarChar writers and binary writers has been improved with the addition of extra input methods([GH-37705](apache/arrow#37705)), and VarCharWriter now supports writing from `Text` and `String`([GH-37706](apache/arrow#37706)). The release also adds typed getters for StructVector, improving the ease of accessing data([GH-37863](apache/arrow#37863)).

The full release notes as follows:
- https://arrow.apache.org/release/14.0.0.html

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #43650 from LuciferYang/arrow-14.

Lead-authored-by: yangjie01 <[email protected]>
Co-authored-by: YangJie <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants