[SPARK-36730][SQL] Use V2 Filter in V2 file source #36332

huaxingao · 2022-04-24T06:00:07Z

Co-Authored-By: DB Tsai [email protected]
Co-Authored-By: Huaxin Gao [email protected]

What changes were proposed in this pull request?

updated the V2 file source to use V2 Filters. ParquetFilter hasn't been updated to use the V2 Filters yet and will be changed in the next PR.

Why are the changes needed?

V2 Filter migration

Does this PR introduce any user-facing change?

no

How was this patch tested?

New and existing test suites

huaxingao · 2022-04-25T05:49:59Z

cc @cloud-fan @viirya @beliefer

viirya · 2022-04-25T06:03:08Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/csv/CSVScan.scala

+    val actualFilters = pushedFilters.map(_.toV1)
+      .filterNot(_.references.contains(parsedOptions.columnNameOfCorruptRecord))


Hmm, why we cannot work on v2 here and other places directly? I feel it is verbose and redundant to see toV1, toV2 there.

Currently OrcFilters, ParquetFilters, JacksonParser, UnivocityParser only take v1 filters. There are actually quite some work to refactor these to make them also work with v2 filters. I prefer to have separate PRs later on for these changes.

I see. Maybe we can create some JIRAs for these planed changes and put the JIRA number into some comments.

Added comments. Thanks

beliefer

Uh, I just want to know why we not refactor FileScanBuilder by replaces SupportsPushDownCatalystFilters to SupportsPushDownV2Filters ?

beliefer · 2022-04-25T06:47:59Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/filter/Predicate.java

+      if (children()[1] instanceof LiteralValue) {
+        // e.g. a = 1
+        return new EqualTo(children()[0].describe(),
+            CatalystTypeConverters.convertToScala(((LiteralValue)children()[1]).value(),


Two indents

I changed my ide java continuation indent to 4 for another project. Changed back to 2 space. All the indentation should be good now.

beliefer · 2022-04-25T06:48:20Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/filter/Predicate.java

+      } else if (children()[0] instanceof LiteralValue) {
+        // e.g. 1 = a
+        return new EqualTo(children()[1].describe(),
+            CatalystTypeConverters.convertToScala(((LiteralValue)children()[0]).value(),


beliefer · 2022-04-25T06:59:18Z

connector/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroScan.scala

@@ -59,7 +59,7 @@ case class AvroScan(
      readDataSchema,
      readPartitionSchema,
      parsedOptions,
-      pushedFilters)
+      pushedFilters.map(_.toV1))


It seems we no need toV1 in future ?

LuciferYang · 2022-04-25T12:03:50Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/filter/Predicate.java

+  public org.apache.spark.sql.sources.Filter toV1() {
+    String expressionStr = "";
+    for (Expression e : children()) {
+      expressionStr += e.describe() + ", ";


should use StringBuilder here

Changed. Thanks

huaxingao · 2022-04-25T15:56:42Z

Uh, I just want to know why we not refactor FileScanBuilder by replaces SupportsPushDownCatalystFilters to SupportsPushDownV2Filters ?

We actually intentionally want to push down catalyst Expression instead of filter in file source, because in file source we need to do partition pruning, which uses catalyst Expression.

cloud-fan · 2022-04-27T15:44:13Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/filter/Predicate.java

@@ -146,4 +149,210 @@ public class Predicate extends GeneralScalarExpression {
  public Predicate(String name, Expression[] children) {
    super(name, children);
  }
+
+  public org.apache.spark.sql.sources.Filter toV1() {


We shouldn't add this public API. Can we have a private internal util function to do it?

github-actions · 2022-10-23T00:31:56Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

### What changes were proposed in this pull request? This pr upgrade Apache Arrow from 13.0.0 to 14.0.0. ### Why are the changes needed? The Apache Arrow 14.0.0 release brings a number of enhancements and bug fixes. ‎ In terms of bug fixes, the release addresses several critical issues that were causing failures in integration jobs with Spark([GH-36332](apache/arrow#36332)) and problems with importing empty data arrays([GH-37056](apache/arrow#37056)). It also optimizes the process of appending variable length vectors([GH-37829](apache/arrow#37829)) and includes C++ libraries for MacOS AARCH 64 in Java-Jars([GH-38076](apache/arrow#38076)). ‎ The new features and improvements focus on enhancing the handling and manipulation of data. This includes the introduction of DefaultVectorComparators for large types([GH-25659](apache/arrow#25659)), support for extended expressions in ScannerBuilder([GH-34252](apache/arrow#34252)), and the exposure of the VectorAppender class([GH-37246](apache/arrow#37246)). ‎ The release also brings enhancements to the development and testing process, with the CI environment now using JDK 21([GH-36994](apache/arrow#36994)). In addition, the release introduces vector validation consistent with C++, ensuring consistency across different languages([GH-37702](apache/arrow#37702)). ‎ Furthermore, the usability of VarChar writers and binary writers has been improved with the addition of extra input methods([GH-37705](apache/arrow#37705)), and VarCharWriter now supports writing from `Text` and `String`([GH-37706](apache/arrow#37706)). The release also adds typed getters for StructVector, improving the ease of accessing data([GH-37863](apache/arrow#37863)). The full release notes as follows: - https://arrow.apache.org/release/14.0.0.html ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #43650 from LuciferYang/arrow-14. Lead-authored-by: yangjie01 <[email protected]> Co-authored-by: YangJie <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

github-actions bot added PYTHON SQL labels Apr 24, 2022

viirya reviewed Apr 25, 2022

View reviewed changes

beliefer reviewed Apr 25, 2022

View reviewed changes

LuciferYang reviewed Apr 25, 2022

View reviewed changes

cloud-fan reviewed Apr 27, 2022

View reviewed changes

Literal values should be on the right side of the data source filter

149aa35

huaxingao force-pushed the v2filter branch from fccf770 to 149aa35 Compare July 14, 2022 21:24

github-actions bot added the Stale label Oct 23, 2022

github-actions bot closed this Oct 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-36730][SQL] Use V2 Filter in V2 file source #36332

[SPARK-36730][SQL] Use V2 Filter in V2 file source #36332

huaxingao commented Apr 24, 2022

huaxingao commented Apr 25, 2022

viirya Apr 25, 2022

huaxingao Apr 25, 2022

viirya Apr 25, 2022

huaxingao Apr 25, 2022

beliefer left a comment

beliefer Apr 25, 2022

huaxingao Apr 25, 2022

beliefer Apr 25, 2022

beliefer Apr 25, 2022

huaxingao Apr 25, 2022

LuciferYang Apr 25, 2022 •

edited

Loading

huaxingao Apr 25, 2022

huaxingao commented Apr 25, 2022

cloud-fan Apr 27, 2022

github-actions bot commented Oct 23, 2022

		val actualFilters = pushedFilters.map(_.toV1)
		.filterNot(_.references.contains(parsedOptions.columnNameOfCorruptRecord))

[SPARK-36730][SQL] Use V2 Filter in V2 file source #36332

[SPARK-36730][SQL] Use V2 Filter in V2 file source #36332

Conversation

huaxingao commented Apr 24, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

huaxingao commented Apr 25, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beliefer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LuciferYang Apr 25, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huaxingao commented Apr 25, 2022

Choose a reason for hiding this comment

github-actions bot commented Oct 23, 2022

LuciferYang Apr 25, 2022 •

edited

Loading