Skip to content

Commit

Permalink
KE-35827 new second storage sql pushdown (apache#453)
Browse files Browse the repository at this point in the history
* [SPARK-36556][SQL] Add DSV2 filters

Co-Authored-By: DB Tsai d_tsaiapple.com
Co-Authored-By: Huaxin Gao huaxin_gaoapple.com

### What changes were proposed in this pull request?
Add DSV2 Filters and use these in V2 codepath.

### Why are the changes needed?
The motivation of adding DSV2 filters:
1. The values in V1 filters are Scala types. When translating catalyst `Expression` to V1 filers, we have to call `convertToScala` to convert from Catalyst types used internally in rows to standard Scala types, and later convert Scala types back to Catalyst types. This is very inefficient. In V2 filters, we use `Expression`  for filter values, so the conversion from  Catalyst types to Scala types and Scala types back to Catalyst types are avoided.
2. Improve nested column filter support.
3. Make the filters work better with the rest of the DSV2 APIs.

### Does this PR introduce _any_ user-facing change?
Yes. The new V2 filters

### How was this patch tested?
new test

Closes #33803 from huaxingao/filter.

Lead-authored-by: Huaxin Gao <[email protected]>
Co-authored-by: DB Tsai <[email protected]>
Signed-off-by: Liang-Chi Hsieh <[email protected]>

* [SPARK-36760][SQL] Add interface SupportsPushDownV2Filters

Co-Authored-By: DB Tsai d_tsaiapple.com
Co-Authored-By: Huaxin Gao huaxin_gaoapple.com
### What changes were proposed in this pull request?
This is the 2nd PR for V2 Filter support. This PR does the following:

- Add interface SupportsPushDownV2Filters

Future work:
- refactor `OrcFilters`, `ParquetFilters`, `JacksonParser`, `UnivocityParser` so both V1 file source and V2 file source can use them
- For V2 file source: implement  v2 filter -> parquet/orc filter. csv and Json don't have real filters, but also need to change the current code to have v2 filter -> `JacksonParser`/`UnivocityParser`
- For V1 file source, keep what we currently have: v1 filter -> parquet/orc filter
- We don't need v1filter.toV2 and v2filter.toV1 since we have two separate paths

The reasons that we have reached the above conclusion:
- The major motivation to implement V2Filter is to eliminate the unnecessary conversion between Catalyst types and Scala types when using Filters.
- We provide this `SupportsPushDownV2Filters` in this PR so V2 data source (e.g. iceberg) can implement it and use V2 Filters
- There are lots of work to implement v2 filters in the V2 file sources because of the following reasons:

possible approaches for implementing V2Filter:
1. keep what we have for file source v1: v1 filter -> parquet/orc filter
    file source v2 we will implement v2 filter -> parquet/orc filter
    We don't need v1->v2 and v2->v1
    problem with this approach: there are lots of code duplication

2.  We will implement v2 filter -> parquet/orc filter
     file source v1: v1 filter -> v2 filter -> parquet/orc filter
     We will need V1 -> V2
     This is the approach I am using in https://github.com/apache/spark/pull/33973
     In that PR, I have
     v2 orc: v2 filter -> orc filter
     V1 orc: v1 -> v2 -> orc filter

     v2 csv: v2->v1, new UnivocityParser
     v1 csv: new UnivocityParser

    v2 Json: v2->v1, new JacksonParser
    v1 Json: new JacksonParser

    csv and Json don't have real filters, they just use filter references, should be OK to use either v1 and v2. Easier to use
    v1 because no need to change.

    I haven't finished parquet yet. The PR doesn't have the parquet V2Filter implementation, but I plan to have
    v2 parquet: v2 filter -> parquet filter
    v1 parquet: v1 -> v2 -> parquet filter

    Problem with this approach:
    1. It's not easy to implement V1->V2  because V2 filter have `LiteralValue` and needs type info. We already lost the type information when we convert Expression filer to v1 filter.
    2. parquet is OK
        Use Timestamp as example, parquet filter takes long for timestamp
        v2 parquet: v2 filter -> parquet filter
       timestamp
       Expression (Long) -> v2 filter (LiteralValue  Long)-> parquet filter (Long)

       V1 parquet: v1 -> v2 -> parquet filter
       timestamp
       Expression (Long) -> v1 filter (timestamp) -> v2 filter (LiteralValue  Long)-> parquet filter (Long)

       but we have problem for orc because orc filter takes java Timestamp
       v2 orc: v2 filter -> orc filter
       timestamp
       Expression (Long) -> v2 filter (LiteralValue  Long)->  parquet filter (Timestamp)

       V1 orc: v1 -> v2 -> orc filter
       Expression (Long) ->  v1 filter (timestamp) -> v2 filter (LiteralValue  Long)-> parquet filter (Timestamp)
      This defeats the purpose of implementing v2 filters.
3.  keep what we have for file source v1: v1 filter -> parquet/orc filter
     file source v2: v2 filter -> v1 filter -> parquet/orc filter
     We will need V2 -> V1
     we have similar problem as approach 2.

So the conclusion is: approach 1 (keep what we have for file source v1: v1 filter -> parquet/orc filter
    file source v2 we will implement v2 filter -> parquet/orc filter) is better, but there are lots of code duplication. We will need to refactor `OrcFilters`, `ParquetFilters`, `JacksonParser`, `UnivocityParser` so both V1 file source and V2 file source can use them.

### Why are the changes needed?
Use V2Filters to eliminate the unnecessary conversion between Catalyst types and Scala types.

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
Added new UT

Closes #34001 from huaxingao/v2filter.

Lead-authored-by: Huaxin Gao <[email protected]>
Co-authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-37020][SQL] DS V2 LIMIT push down

### What changes were proposed in this pull request?
Push down limit to data source for better performance

### Why are the changes needed?
For LIMIT, e.g. `SELECT * FROM table LIMIT 10`, Spark retrieves all the data from table and then returns 10 rows. If we can push LIMIT to data source side, the data transferred to Spark will be dramatically reduced.

### Does this PR introduce _any_ user-facing change?
Yes. new interface `SupportsPushDownLimit`

### How was this patch tested?
new test

Closes #34291 from huaxingao/pushdownLimit.

Authored-by: Huaxin Gao <[email protected]>
Signed-off-by: Huaxin Gao <[email protected]>

* [SPARK-37038][SQL] DSV2 Sample Push Down

### What changes were proposed in this pull request?

Push down Sample to data source for better performance. If Sample is pushed down, it will be removed from logical plan so it will not be applied at Spark any more.

Current Plan without Sample push down:
```
== Parsed Logical Plan ==
'Project [*]
+- 'Sample 0.0, 0.8, false, 157
   +- 'UnresolvedRelation [postgresql, new_table], [], false

== Analyzed Logical Plan ==
col1: int, col2: int
Project [col1#163, col2#164]
+- Sample 0.0, 0.8, false, 157
   +- SubqueryAlias postgresql.new_table
      +- RelationV2[col1#163, col2#164] new_table

== Optimized Logical Plan ==
Sample 0.0, 0.8, false, 157
+- RelationV2[col1#163, col2#164] new_table

== Physical Plan ==
*(1) Sample 0.0, 0.8, false, 157
+- *(1) Scan org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCScan$$anon$16dde4769 [col1#163,col2#164] PushedAggregates: [], PushedFilters: [], PushedGroupby: [],  ReadSchema: struct<col1:int,col2:int>
```
after Sample push down:
```
== Parsed Logical Plan ==
'Project [*]
+- 'Sample 0.0, 0.8, false, 187
   +- 'UnresolvedRelation [postgresql, new_table], [], false

== Analyzed Logical Plan ==
col1: int, col2: int
Project [col1#163, col2#164]
+- Sample 0.0, 0.8, false, 187
   +- SubqueryAlias postgresql.new_table
      +- RelationV2[col1#163, col2#164] new_table

== Optimized Logical Plan ==
RelationV2[col1#163, col2#164] new_table

== Physical Plan ==
*(1) Scan org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCScan$$anon$165b57543 [col1#163,col2#164] PushedAggregates: [], PushedFilters: [], PushedGroupby: [], PushedSample: TABLESAMPLE  0.0 0.8 false 187, ReadSchema: struct<col1:int,col2:int>
```
The new interface is implemented using JDBC for POC and end to end test. TABLESAMPLE is not supported by all the databases. It is implemented using postgresql in this PR.

### Why are the changes needed?
Reduce IO and improve performance. For SAMPLE, e.g. `SELECT * FROM t TABLESAMPLE (1 PERCENT)`, Spark retrieves all the data from table and then return 1% rows. It will dramatically reduce the transferred data size and improve performance if we can push Sample to data source side.

### Does this PR introduce any user-facing change?
Yes. new interface `SupportsPushDownTableSample`

### How was this patch tested?
New test

Closes #34451 from huaxingao/sample.

Authored-by: Huaxin Gao <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-37286][SQL] Move compileAggregates from JDBCRDD to JdbcDialect

### What changes were proposed in this pull request?
Currently, the method `compileAggregates` is a member of `JDBCRDD`. But it is not reasonable, because the JDBC source knowns how to compile aggregate expressions to itself's dialect well.

### Why are the changes needed?
JDBC source knowns how to compile aggregate expressions to itself's dialect well.
After this PR, we can extend the pushdown(e.g. aggregate) based on different dialect between different JDBC database.

There are two situations:
First, database A and B implement a different number of aggregate functions that meet the SQL standard.

### Does this PR introduce _any_ user-facing change?
'No'. Just change the inner implementation.

### How was this patch tested?
Jenkins tests.

Closes #34554 from beliefer/SPARK-37286.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-37286][DOCS][FOLLOWUP] Fix the wrong parameter name for Javadoc

### What changes were proposed in this pull request?

This PR fixes an issue that the Javadoc generation fails due to the wrong parameter name of a method added in SPARK-37286 (#34554).
https://github.com/apache/spark/runs/4409267346?check_suite_focus=true#step:9:5081

### Why are the changes needed?

To keep the build clean.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GA itself.

Closes #34801 from sarutak/followup-SPARK-37286.

Authored-by: Kousuke Saruta <[email protected]>
Signed-off-by: Sean Owen <[email protected]>

* [SPARK-37262][SQL] Don't log empty aggregate and group by in JDBCScan

### What changes were proposed in this pull request?
Currently, the empty pushed aggregate and pushed group by are logged in Explain for JDBCScan
```
Scan org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCScan$$anon$172e75786 [NAME#1,SALARY#2] PushedAggregates: [], PushedFilters: [IsNotNull(SALARY), GreaterThan(SALARY,100.00)], PushedGroupby: [], ReadSchema: struct<NAME:string,SALARY:decimal(20,2)>
```

After the fix, the JDBCSScan will be
```
Scan org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCScan$$anon$172e75786 [NAME#1,SALARY#2] PushedFilters: [IsNotNull(SALARY), GreaterThan(SALARY,100.00)], ReadSchema: struct<NAME:string,SALARY:decimal(20,2)>
```

### Why are the changes needed?
address this comment https://github.com/apache/spark/pull/34451#discussion_r740220800

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
existing tests

Closes #34540 from huaxingao/aggExplain.

Authored-by: Huaxin Gao <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-37483][SQL] Support push down top N to JDBC data source V2

### What changes were proposed in this pull request?
Currently, Spark supports push down limit to data source.
However, in the user's scenario, limit must have the premise of order by. Because limit and order by are more valuable together.

On the other hand, push down top N(same as order by ... limit N) outputs the data with basic order to Spark sort, the the sort of Spark may have some performance improvement.

### Why are the changes needed?
1. push down top N is very useful for users scenario.
2. push down top N could improves the performance of sort.

### Does this PR introduce _any_ user-facing change?
'No'. Just change the physical execute.

### How was this patch tested?
New tests.

Closes #34918 from beliefer/SPARK-37483.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-37644][SQL] Support datasource v2 complete aggregate pushdown

### What changes were proposed in this pull request?
Currently , Spark supports push down aggregate with partial-agg and final-agg . For some data source (e.g. JDBC ) , we can avoid partial-agg and final-agg by running completely on database.

### Why are the changes needed?
Improve performance for aggregate pushdown.

### Does this PR introduce _any_ user-facing change?
'No'. Just change the inner implement.

### How was this patch tested?
New tests.

Closes #34904 from beliefer/SPARK-37644.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-37627][SQL] Add sorted column in BucketTransform

### What changes were proposed in this pull request?
In V1, we can create table with sorted bucket like the following:
```
      sql("CREATE TABLE tbl(a INT, b INT) USING parquet " +
        "CLUSTERED BY (a) SORTED BY (b) INTO 5 BUCKETS")
```
However, creating table with sorted bucket in V2 failed with Exception
`org.apache.spark.sql.AnalysisException: Cannot convert bucketing with sort columns to a transform.`

### Why are the changes needed?
This PR adds sorted column in BucketTransform so we can create table in V2 with sorted bucket

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
new UT

Closes #34879 from huaxingao/sortedBucket.

Authored-by: Huaxin Gao <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-37789][SQL] Add a class to represent general aggregate functions in DS V2

### What changes were proposed in this pull request?

There are a lot of aggregate functions in SQL and it's a lot of work to add them one by one in the DS v2 API. This PR proposes to add a new `GeneralAggregateFunc` class to represent all the general SQL aggregate functions. Since it's general, Spark doesn't know its aggregation buffer and can only push down the aggregation to the source completely.

As an example, this PR also translates `AVG` to `GeneralAggregateFunc` and pushes it to JDBC V2.

### Why are the changes needed?

To add aggregate functions in DS v2 easier.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

JDBC v2 test

Closes #35070 from cloud-fan/agg.

Lead-authored-by: Wenchen Fan <[email protected]>
Co-authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-37644][SQL][FOLLOWUP] When partition column is same as group by key, pushing down aggregate completely

### What changes were proposed in this pull request?
When JDBC option specifying the "partitionColumn" and it's the same as group by key, the aggregate push-down should be completely.

### Why are the changes needed?
Improve the datasource v2 complete aggregate pushdown.

### Does this PR introduce _any_ user-facing change?
'No'. Just change the inner implement.

### How was this patch tested?
New tests.

Closes #35052 from beliefer/SPARK-37644-followup.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-37527][SQL] Translate more standard aggregate functions for pushdown

### What changes were proposed in this pull request?
Currently, Spark aggregate pushdown will translate some standard aggregate functions, so that compile these functions to adapt specify database.
After this job, users could override `JdbcDialect.compileAggregate` to implement some standard aggregate functions supported by some database.
This PR just translate the ANSI standard aggregate functions. The mainstream database supports these functions show below:
| Name | ClickHouse | Presto | Teradata | Snowflake | Oracle | Postgresql | Vertica | MySQL | RedShift | ElasticSearch | Impala | Druid | SyBase | DB2 | H2 | Exasol | Mariadb | Phoenix | Yellowbrick | Singlestore | Influxdata | Dolphindb | Intersystems |
|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|
| `VAR_POP` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes | No | Yes | Yes |
| `VAR_SAMP` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No |  Yes | Yes | Yes | No | Yes | Yes | No | Yes | Yes |
| `STDDEV_POP` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| `STDDEV_SAMP` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No |  Yes | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes |
| `COVAR_POP` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | No | No | No | No | Yes | No |  Yes | Yes | No | No | No | No | Yes | Yes | No |
| `COVAR_SAMP` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | No | No | No | No | Yes | No |  Yes | Yes | No | No | No | No | No | No | No |
| `CORR` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | No | No | No | No | Yes | No |  Yes | Yes | No | No | No | No | No | Yes | No |

Because some aggregate functions will be converted by Optimizer show below, this PR no need to match them.

|Input|Parsed|Optimized|
|------|--------------------|----------|
|`Every`| `aggregate.BoolAnd` |`Min`|
|`Any`| `aggregate.BoolOr` |`Max`|
|`Some`| `aggregate.BoolOr` |`Max`|

### Why are the changes needed?
Make the implement of `*Dialect` could extends the aggregate functions by override `JdbcDialect.compileAggregate`.

### Does this PR introduce _any_ user-facing change?
Yes. Users could pushdown more aggregate functions.

### How was this patch tested?
Exists tests.

Closes #35101 from beliefer/SPARK-37527-new2.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Huaxin Gao <[email protected]>

* [SPARK-37734][SQL][TESTS] Upgrade h2 from 1.4.195 to 2.0.204

### What changes were proposed in this pull request?
This PR aims to upgrade `com.h2database` from 1.4.195 to 2.0.202

### Why are the changes needed?
Fix one vulnerability, ref: https://www.tenable.com/cve/CVE-2021-23463

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
Jenkins test.

Closes #35013 from beliefer/SPARK-37734.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-37527][SQL] Compile `COVAR_POP`, `COVAR_SAMP` and `CORR` in `H2Dialet`

### What changes were proposed in this pull request?
https://github.com/apache/spark/pull/35101 translate `COVAR_POP`, `COVAR_SAMP` and `CORR`, but the H2 lower version cannot support them.

After https://github.com/apache/spark/pull/35013, we can compile the three aggregate functions in `H2Dialet` now.

### Why are the changes needed?
Supplement the implement of `H2Dialet`.

### Does this PR introduce _any_ user-facing change?
'Yes'. Spark could complete push-down `COVAR_POP`, `COVAR_SAMP` and `CORR` into H2.

### How was this patch tested?
Test updated.

Closes #35145 from beliefer/SPARK-37527_followup.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-37839][SQL] DS V2 supports partial aggregate push-down `AVG`

### What changes were proposed in this pull request?
`max`,`min`,`count`,`sum`,`avg` are the most commonly used aggregation functions.
Currently, DS V2 supports complete aggregate push-down of `avg`. But, supports partial aggregate push-down of `avg` is very useful.

The aggregate push-down algorithm is:

1. Spark translates group expressions of `Aggregate` to DS V2 `Aggregation`.
2. Spark calls `supportCompletePushDown` to check if it can completely push down aggregate.
3. If `supportCompletePushDown` returns true, we preserves the aggregate expressions as final aggregate expressions. Otherwise, we split `AVG` into 2 functions: `SUM` and `COUNT`.
4. Spark translates final aggregate expressions and group expressions of `Aggregate` to DS V2 `Aggregation` again, and pushes the `Aggregation` to JDBC source.
5. Spark constructs the final aggregate.

### Why are the changes needed?
DS V2 supports partial aggregate push-down `AVG`

### Does this PR introduce _any_ user-facing change?
'Yes'. DS V2 could partial aggregate push-down `AVG`

### How was this patch tested?
New tests.

Closes #35130 from beliefer/SPARK-37839.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-36526][SQL] DSV2 Index Support: Add supportsIndex interface

### What changes were proposed in this pull request?
Indexes are database objects created on one or more columns of a table. Indexes are used to improve query performance. A detailed explanation of database index is here https://en.wikipedia.org/wiki/Database_index

 This PR adds `supportsIndex` interface that provides APIs to work with indexes.

### Why are the changes needed?
Many data sources support index to improvement query performance. In order to take advantage of the index support in data source, this `supportsIndex` interface is added to let user to create/drop an index, list indexes, etc.

### Does this PR introduce _any_ user-facing change?
yes, the following new APIs are added:

- createIndex
- dropIndex
- indexExists
- listIndexes

New SQL syntax:
```

CREATE [index_type] INDEX [index_name] ON [TABLE] table_name (column_index_property_list)[OPTIONS indexPropertyList]

    column_index_property_list: column_name [OPTIONS(indexPropertyList)]  [ ,  . . . ]
    indexPropertyList: index_property_name = index_property_value [ ,  . . . ]

DROP INDEX index_name

```
### How was this patch tested?
only interface is added for now. Tests will be added when doing the implementation

Closes #33754 from huaxingao/index_interface.

Authored-by: Huaxin Gao <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-36913][SQL] Implement createIndex and IndexExists in DS V2 JDBC (MySQL dialect)

### What changes were proposed in this pull request?
Implementing `createIndex`/`IndexExists` in DS V2 JDBC

### Why are the changes needed?
This is a subtask of the V2 Index support. I am implementing index support for DS V2 JDBC so we can have a POC and an end to end testing. This PR implements `createIndex` and `IndexExists`. Next PR will implement `listIndexes` and `dropIndex`. I intentionally make the PR small so it's easier to review.

Index is not supported by h2 database and create/drop index are not standard SQL syntax. This PR only implements `createIndex` and `IndexExists` in `MySQL` dialect.

### Does this PR introduce _any_ user-facing change?
Yes, `createIndex`/`IndexExist` in DS V2 JDBC

### How was this patch tested?
new test

Closes #34164 from huaxingao/createIndexJDBC.

Authored-by: Huaxin Gao <[email protected]>
Signed-off-by: Liang-Chi Hsieh <[email protected]>

* [SPARK-36914][SQL] Implement dropIndex and listIndexes in JDBC (MySQL dialect)

### What changes were proposed in this pull request?
This PR implements `dropIndex` and `listIndexes` in MySQL dialect

### Why are the changes needed?
As a subtask of the V2 Index support, this PR completes the implementation for JDBC V2 index support.

### Does this PR introduce _any_ user-facing change?
Yes, `dropIndex/listIndexes` in DS V2 JDBC

### How was this patch tested?
new tests

Closes #34236 from huaxingao/listIndexJDBC.

Authored-by: Huaxin Gao <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-37343][SQL] Implement createIndex, IndexExists and dropIndex in JDBC (Postgres dialect)

### What changes were proposed in this pull request?
Implementing `createIndex`/`IndexExists`/`dropIndex` in DS V2 JDBC for Postgres dialect.

### Why are the changes needed?
This is a subtask of the V2 Index support. This PR implements `createIndex`, `IndexExists` and `dropIndex`. After review for some changes in this PR, I will create new PR for `listIndexs`, or add it in this PR.

This PR only implements `createIndex`, `IndexExists` and `dropIndex` in Postgres dialect.

### Does this PR introduce _any_ user-facing change?
Yes, `createIndex`/`IndexExists`/`dropIndex` in DS V2 JDBC

### How was this patch tested?
New test.

Closes #34673 from dchvn/Dsv2_index_postgres.

Authored-by: dch nguyen <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-37867][SQL] Compile aggregate functions of build-in JDBC dialect

### What changes were proposed in this pull request?
DS V2 translate a lot of standard aggregate functions.
Currently, only H2Dialect compile these standard aggregate functions. This PR compile these standard aggregate functions for other build-in JDBC dialect.

### Why are the changes needed?
Make build-in JDBC dialect support complete aggregate push-down.

### Does this PR introduce _any_ user-facing change?
'Yes'.
Users could use complete aggregate push-down with build-in JDBC dialect.

### How was this patch tested?
New tests.

Closes #35166 from beliefer/SPARK-37867.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-37929][SQL][FOLLOWUP] Support cascade mode for JDBC V2

### What changes were proposed in this pull request?
https://github.com/apache/spark/pull/35246 support `cascade` mode for dropNamespace API.
This PR followup https://github.com/apache/spark/pull/35246 to make JDBC V2 respect `cascade`.

### Why are the changes needed?
Let JDBC V2 respect `cascade`.

### Does this PR introduce _any_ user-facing change?
Yes.
Users could manipulate `drop namespace` with `cascade` on JDBC V2.

### How was this patch tested?
New tests.

Closes #35271 from beliefer/SPARK-37929-followup.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-38035][SQL] Add docker tests for build-in JDBC dialect

### What changes were proposed in this pull request?
Currently, Spark only have `PostgresNamespaceSuite` to test DS V2 namespace in docker environment.
But missing tests for other build-in JDBC dialect (e.g. Oracle, MySQL).

This PR also found some compatible issue. For example, the JDBC api `conn.getMetaData.getSchemas` works bad for MySQL.

### Why are the changes needed?
We need add tests for other build-in JDBC dialect.

### Does this PR introduce _any_ user-facing change?
'No'. Just add tests which face developers.

### How was this patch tested?
New tests.

Closes #35333 from beliefer/SPARK-38035.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-38054][SQL] Supports list namespaces in JDBC v2 MySQL dialect

### What changes were proposed in this pull request?
Currently, `JDBCTableCatalog.scala` query namespaces show below.
```
      val schemaBuilder = ArrayBuilder.make[Array[String]]
      val rs = conn.getMetaData.getSchemas()
      while (rs.next()) {
        schemaBuilder += Array(rs.getString(1))
      }
      schemaBuilder.result
```

But the code cannot get any information when using MySQL JDBC driver.
This PR uses `SHOW SCHEMAS` to query namespaces of MySQL.
This PR also fix other issues below:

- Release the docker tests in `MySQLNamespaceSuite.scala`.
- Because MySQL doesn't support create comment of schema, let's throws `SQLFeatureNotSupportedException`.
- Because MySQL doesn't support `DROP SCHEMA` in `RESTRICT` mode, let's throws `SQLFeatureNotSupportedException`.
- Reactor `JdbcUtils.executeQuery` to avoid `java.sql.SQLException: Operation not allowed after ResultSet closed`.

### Why are the changes needed?
MySQL dialect supports query namespaces.

### Does this PR introduce _any_ user-facing change?
'Yes'.
Some API changed.

### How was this patch tested?
New tests.

Closes #35355 from beliefer/SPARK-38054.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-36351][SQL] Refactor filter push down in file source v2

### What changes were proposed in this pull request?

Currently in `V2ScanRelationPushDown`, we push the filters (partition filters + data filters) to file source, and then pass all the filters (partition filters + data filters) as post scan filters to v2 Scan, and later in `PruneFileSourcePartitions`, we separate partition filters and data filters, set them in the format of `Expression` to file source.

Changes in this PR:
When we push filters to file sources in `V2ScanRelationPushDown`, since we already have the information about partition column , we want to separate partition filter and data filter there.

The benefit of doing this:
- we can handle all the filter related work for v2 file source at one place instead of two (`V2ScanRelationPushDown` and `PruneFileSourcePartitions`), so the code will be cleaner and easier to maintain.
- we actually have to separate partition filters and data filters at `V2ScanRelationPushDown`, otherwise, there is no way to find out which filters are partition filters, and we can't push down aggregate for parquet even if we only have partition filter.
- By separating the filters early at `V2ScanRelationPushDown`, we only needs to check data filters to find out which one needs to be converted to data source filters (e.g. Parquet predicates, ORC predicates) and pushed down to file source, right now we are checking all the filters (both partition filters and data filters)
- Similarly, we can only pass data filters as post scan filters to v2 Scan, because partition filters are used for partition pruning only, no need to pass them as post scan filters.

In order to do this, we will have the following changes

-  add `pushFilters` in file source v2. In this method:
    - push both Expression partition filter and Expression data filter to file source. Have to use Expression filters because we need these for partition pruning.
    - data filters are used for filter push down. If file source needs to push down data filters, it translates the data filters from `Expression` to `Sources.Filer`, and then decides which filters to push down.
    - partition filters are used for partition pruning.
- file source v2 no need to implement `SupportsPushdownFilters` any more, because when we separating the two types of filters, we have already set them on file data sources. It's redundant to use `SupportsPushdownFilters` to set the filters again on file data sources.

### Why are the changes needed?

see section one

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing tests

Closes #33650 from huaxingao/partition_filter.

Authored-by: Huaxin Gao <[email protected]>
Signed-off-by: Liang-Chi Hsieh <[email protected]>

* [SPARK-36645][SQL] Aggregate (Min/Max/Count) push down for Parquet

### What changes were proposed in this pull request?
Push down Min/Max/Count to Parquet with the following restrictions:

- nested types such as Array, Map or Struct will not be pushed down
- Timestamp not pushed down because INT96 sort order is undefined, Parquet doesn't return statistics for INT96
- If the aggregate column is on partition column, only Count will be pushed, Min or Max will not be pushed down because Parquet doesn't return max/min for partition column.
- If somehow the file doesn't have stats for the aggregate columns, Spark will throw Exception.
- Currently, if filter/GROUP BY is involved, Min/Max/Count will not be pushed down, but the restriction will be lifted if the filter or GROUP BY is on partition column (https://issues.apache.org/jira/browse/SPARK-36646 and https://issues.apache.org/jira/browse/SPARK-36647)

### Why are the changes needed?
Since parquet has the statistics information for min, max and count, we want to take advantage of this info and push down Min/Max/Count to parquet layer for better performance.

### Does this PR introduce _any_ user-facing change?
Yes, `SQLConf.PARQUET_AGGREGATE_PUSHDOWN_ENABLED` was added. If sets to true, we will push down Min/Max/Count to Parquet.

### How was this patch tested?
new test suites

Closes #33639 from huaxingao/parquet_agg.

Authored-by: Huaxin Gao <[email protected]>
Signed-off-by: Liang-Chi Hsieh <[email protected]>

* [SPARK-34960][SQL] Aggregate push down for ORC

### What changes were proposed in this pull request?

This PR is to add aggregate push down feature for ORC data source v2 reader.

At a high level, the PR does:

* The supported aggregate expression is MIN/MAX/COUNT same as [Parquet aggregate push down](https://github.com/apache/spark/pull/33639).
* BooleanType, ByteType, ShortType, IntegerType, LongType, FloatType, DoubleType, DateType are allowed in MIN/MAXX aggregate push down. All other columns types are not allowed in MIN/MAX aggregate push down.
* All columns types are supported in COUNT aggregate push down.
* Nested column's sub-fields are disallowed in aggregate push down.
* If the file does not have valid statistics, Spark will throw exception and fail query.
* If aggregate has filter or group-by column, aggregate will not be pushed down.

At code level, the PR does:
* `OrcScanBuilder`: `pushAggregation()` checks whether the aggregation can be pushed down. The most checking logic is shared between Parquet and ORC, extracted into `AggregatePushDownUtils.getSchemaForPushedAggregation()`. `OrcScanBuilder` will create a `OrcScan` with aggregation and aggregation data schema.
* `OrcScan`: `createReaderFactory` creates a ORC reader factory with aggregation and schema. Similar change with `ParquetScan`.
* `OrcPartitionReaderFactory`: `buildReaderWithAggregates` creates a ORC reader with aggregate push down (i.e. read ORC file footer to process columns statistics, instead of reading actual data in the file). `buildColumnarReaderWithAggregates` creates a columnar ORC reader similarly. Both delegate the real work to read footer in `OrcUtils.createAggInternalRowFromFooter`.
* `OrcUtils.createAggInternalRowFromFooter`: reads ORC file footer to process columns statistics (real heavy lift happens here). Similar to `ParquetUtils.createAggInternalRowFromFooter`. Leverage utility method such as `OrcFooterReader.readStatistics`.
* `OrcFooterReader`: `readStatistics` reads the ORC `ColumnStatistics[]` into Spark `OrcColumnStatistics`. The transformation is needed here, because ORC `ColumnStatistics[]` stores all columns statistics in a flatten array style, and hard to process. Spark `OrcColumnStatistics` stores the statistics in nested tree structure (e.g. like `StructType`). This is used by `OrcUtils.createAggInternalRowFromFooter`
* `OrcColumnStatistics`: the easy-to-manipulate structure for ORC `ColumnStatistics`. This is used by `OrcFooterReader.readStatistics`.

### Why are the changes needed?

To improve the performance of query with aggregate.

### Does this PR introduce _any_ user-facing change?

Yes. A user-facing config `spark.sql.orc.aggregatePushdown` is added to control enabling/disabling the aggregate push down for ORC. By default the feature is disabled.

### How was this patch tested?

Added unit test in `FileSourceAggregatePushDownSuite.scala`. Refactored all unit tests in https://github.com/apache/spark/pull/33639, and it now works for both Parquet and ORC.

Closes #34298 from c21/orc-agg.

Authored-by: Cheng Su <[email protected]>
Signed-off-by: Liang-Chi Hsieh <[email protected]>

* [SPARK-37960][SQL] A new framework to represent catalyst expressions in DS v2 APIs

### What changes were proposed in this pull request?
This PR provides a new framework to represent catalyst expressions in DS v2 APIs.
`GeneralSQLExpression` is a general SQL expression to represent catalyst expression in DS v2 API.
`ExpressionSQLBuilder` is a builder to generate `GeneralSQLExpression` from catalyst expressions.
`CASE ... WHEN ... ELSE ... END` is just the first use case.

This PR also supports aggregate push down with `CASE ... WHEN ... ELSE ... END`.

### Why are the changes needed?
Support aggregate push down with `CASE ... WHEN ... ELSE ... END`.

### Does this PR introduce _any_ user-facing change?
Yes. Users could use `CASE ... WHEN ... ELSE ... END` with aggregate push down.

### How was this patch tested?
New tests.

Closes #35248 from beliefer/SPARK-37960.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-37867][SQL][FOLLOWUP] Compile aggregate functions for build-in DB2 dialect

### What changes were proposed in this pull request?
This PR follows up https://github.com/apache/spark/pull/35166.
The previously referenced DB2 documentation is incorrect, resulting in the lack of compile that supports some aggregate functions.

The correct documentation is https://www.ibm.com/docs/en/db2/11.5?topic=af-regression-functions-regr-avgx-regr-avgy-regr-count

### Why are the changes needed?
Make build-in DB2 dialect support complete aggregate push-down more aggregate functions.

### Does this PR introduce _any_ user-facing change?
'Yes'.
Users could use complete aggregate push-down with build-in DB2 dialect.

### How was this patch tested?
New tests.

Closes #35520 from beliefer/SPARK-37867_followup.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-36568][SQL] Better FileScan statistics estimation

### What changes were proposed in this pull request?
This PR modifies `FileScan.estimateStatistics()` to take the read schema into account.

### Why are the changes needed?
`V2ScanRelationPushDown` can column prune `DataSourceV2ScanRelation`s and change read schema of `Scan` operations. The better statistics returned by `FileScan.estimateStatistics()` can mean better query plans. For example, with this change the broadcast issue in SPARK-36568 can be avoided.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Added new UT.

Closes #33825 from peter-toth/SPARK-36568-scan-statistics-estimation.

Authored-by: Peter Toth <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-37929][SQL] Support cascade mode for `dropNamespace` API

### What changes were proposed in this pull request?
This PR adds a new API `dropNamespace(String[] ns, boolean cascade)` to replace the existing one: Add a boolean parameter `cascade` that supports deleting all the Namespaces and Tables under the namespace.

Also include changing the implementations and tests that are relevant to this API.

### Why are the changes needed?
According to [#cmt](https://github.com/apache/spark/pull/35202#discussion_r784463563), the current `dropNamespace` API doesn't support cascade mode. So this PR replaces that to support cascading.
If cascade is set True, delete all namespaces and tables under the namespace.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing test.

Closes #35246 from dchvn/change_dropnamespace_api.

Authored-by: dch nguyen <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* code format

* [SPARK-38196][SQL] Refactor framework so as JDBC dialect could compile expression by self way

### What changes were proposed in this pull request?
https://github.com/apache/spark/pull/35248 provides a new framework to represent catalyst expressions in DS V2 APIs.
Because the framework translate all catalyst expressions to a unified SQL string and cannot keep compatibility between different JDBC database, the framework works not good.

This PR reactor the framework so as JDBC dialect could compile expression by self way.
First, The framework translate catalyst expressions to DS V2 expression.
Second, The JDBC dialect could compile DS V2 expression to different SQL syntax.

The java doc looks show below:
![image](https://user-images.githubusercontent.com/8486025/156579584-f56cafb5-641f-4c5b-a06e-38f4369051c3.png)

### Why are the changes needed?
Make  the framework be more common use.

### Does this PR introduce _any_ user-facing change?
'No'.
The feature is not released.

### How was this patch tested?
Exists tests.

Closes #35494 from beliefer/SPARK-37960_followup.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-38361][SQL] Add factory method `getConnection` into `JDBCDialect`

### What changes were proposed in this pull request?
At present, the parameter of the factory method for obtaining JDBC connection is empty because the JDBC URL of some databases is fixed and unique.
However, for databases such as ClickHouse, connection is related to the shard node.
So I think the parameter form of `getConnection: Partition = > Connection` is more general.

This PR adds factory method `getConnection` into `JDBCDialect` according to https://github.com/apache/spark/pull/35696#issuecomment-1058060107.

### Why are the changes needed?
Make factory method `getConnection` more general.

### Does this PR introduce _any_ user-facing change?
'No'.
Just inner change.

### How was this patch tested?
Exists test.

Closes #35727 from beliefer/SPARK-38361_new.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* code format

* [SPARK-38560][SQL] If `Sum`, `Count`, `Any` accompany with distinct, cannot do partial agg push down

### What changes were proposed in this pull request?
Spark could partial push down sum(distinct col), count(distinct col) if data source have multiple partitions, and Spark will sum the value again.
So the result may not correctly.

### Why are the changes needed?
Fix the bug push down sum(distinct col), count(distinct col) to data source and return incorrect result.

### Does this PR introduce _any_ user-facing change?
'Yes'.
Users will see the correct behavior.

### How was this patch tested?
New tests.

Closes #35873 from beliefer/SPARK-38560.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-36718][SQL] Only collapse projects if we don't duplicate expensive expressions

### What changes were proposed in this pull request?

The `CollapseProject` rule can combine adjacent projects and merge the project lists. The key idea behind this rule is that the evaluation of project is relatively expensive, and that expression evaluation is cheap and that the expression duplication caused by this rule is not a problem. This last assumption is, unfortunately, not always true:
- A user can invoke some expensive UDF, this now gets invoked more often than originally intended.
- A projection is very cheap in whole stage code generation. The duplication caused by `CollapseProject` does more harm than good here.

This PR addresses this problem, by only collapsing projects when it does not duplicate expensive expressions. In practice this means an input reference may only be consumed once, or when its evaluation does not incur significant overhead (currently attributes, nested column access, aliases & literals fall in this category).

### Why are the changes needed?

We have seen multiple complains about `CollapseProject` in the past, due to it may duplicate expensive expressions. The most recent one is https://github.com/apache/spark/pull/33903 .

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

a new UT and existing test

Closes #33958 from cloud-fan/collapse.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-38432][SQL] Refactor framework so as JDBC dialect could compile filter by self way

### What changes were proposed in this pull request?
Currently, Spark DS V2 could push down filters into JDBC source. However, only the most basic form of filter is supported.
On the other hand, some JDBC source could not compile the filters by themselves way.

This PR reactor the framework so as JDBC dialect could compile expression by self way.
First, The framework translate catalyst expressions to DS V2 filters.
Second, The JDBC dialect could compile DS V2 filters to different SQL syntax.

### Why are the changes needed?
Make  the framework be more common use.

### Does this PR introduce _any_ user-facing change?
'No'.
The feature is not released.

### How was this patch tested?
Exists tests.

Closes #35768 from beliefer/SPARK-38432_new.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-38432][SQL][FOLLOWUP] Supplement test case for overflow and add comments

### What changes were proposed in this pull request?
This PR follows up https://github.com/apache/spark/pull/35768 and improves the code.

1. Supplement test case for overflow
2. Not throw IllegalArgumentException
3. Improve V2ExpressionSQLBuilder
4. Add comments in V2ExpressionBuilder

### Why are the changes needed?
Supplement test case for overflow and add comments.

### Does this PR introduce _any_ user-facing change?
'No'.
V2 aggregate pushdown not released yet.

### How was this patch tested?
New tests.

Closes #35933 from beliefer/SPARK-38432_followup.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-38533][SQL] DS V2 aggregate push-down supports project with alias

### What changes were proposed in this pull request?
Currently, Spark DS V2 aggregate push-down doesn't supports project with alias.

Refer https://github.com/apache/spark/blob/c91c2e9afec0d5d5bbbd2e155057fe409c5bb928/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala#L96

This PR let it works good with alias.

**The first example:**
the origin plan show below:
```
Aggregate [DEPT#0], [DEPT#0, sum(mySalary#8) AS total#14]
+- Project [DEPT#0, SALARY#2 AS mySalary#8]
   +- ScanBuilderHolder [DEPT#0, NAME#1, SALARY#2, BONUS#3], RelationV2[DEPT#0, NAME#1, SALARY#2, BONUS#3] test.employee, JDBCScanBuilder(org.apache.spark.sql.test.TestSparkSession77978658,StructType(StructField(DEPT,IntegerType,true),StructField(NAME,StringType,true),StructField(SALARY,DecimalType(20,2),true),StructField(BONUS,DoubleType,true)),org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions5f8da82)
```
If we can complete push down the aggregate, then the plan will be:
```
Project [DEPT#0, SUM(SALARY)#18 AS sum(SALARY#2)#13 AS total#14]
+- RelationV2[DEPT#0, SUM(SALARY)#18] test.employee
```
If we can partial push down the aggregate, then the plan will be:
```
Aggregate [DEPT#0], [DEPT#0, sum(cast(SUM(SALARY)#18 as decimal(20,2))) AS total#14]
+- RelationV2[DEPT#0, SUM(SALARY)#18] test.employee
```

**The second example:**
the origin plan show below:
```
Aggregate [myDept#33], [myDept#33, sum(mySalary#34) AS total#40]
+- Project [DEPT#25 AS myDept#33, SALARY#27 AS mySalary#34]
   +- ScanBuilderHolder [DEPT#25, NAME#26, SALARY#27, BONUS#28], RelationV2[DEPT#25, NAME#26, SALARY#27, BONUS#28] test.employee, JDBCScanBuilder(org.apache.spark.sql.test.TestSparkSession25c4f621,StructType(StructField(DEPT,IntegerType,true),StructField(NAME,StringType,true),StructField(SALARY,DecimalType(20,2),true),StructField(BONUS,DoubleType,true)),org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions345d641e)
```
If we can complete push down the aggregate, then the plan will be:
```
Project [DEPT#25 AS myDept#33, SUM(SALARY)#44 AS sum(SALARY#27)#39 AS total#40]
+- RelationV2[DEPT#25, SUM(SALARY)#44] test.employee
```
If we can partial push down the aggregate, then the plan will be:
```
Aggregate [myDept#33], [DEPT#25 AS myDept#33, sum(cast(SUM(SALARY)#56 as decimal(20,2))) AS total#52]
+- RelationV2[DEPT#25, SUM(SALARY)#56] test.employee
```

### Why are the changes needed?
Alias is more useful.

### Does this PR introduce _any_ user-facing change?
'Yes'.
Users could see DS V2 aggregate push-down supports project with alias.

### How was this patch tested?
New tests.

Closes #35932 from beliefer/SPARK-38533_new.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* code foramt

* [SPARK-37483][SQL][FOLLOWUP] Rename `pushedTopN` to `PushedTopN` and improve JDBCV2Suite

### What changes were proposed in this pull request?
This PR fix three issues.
**First**, create method `checkPushedInfo` and `checkSortRemoved` to reuse code.
**Second**, remove method `checkPushedLimit`, because `checkPushedInfo` can cover it.
**Third**, rename `pushedTopN` to `PushedTopN`, so as consistent with other pushed information.

### Why are the changes needed?
Reuse code and let pushed information more correctly.

### Does this PR introduce _any_ user-facing change?
'No'. New feature and improve the tests.

### How was this patch tested?
Adjust existing tests.

Closes #35921 from beliefer/SPARK-37483_followup.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-38644][SQL] DS V2 topN push-down supports project with alias

### What changes were proposed in this pull request?
Currently, Spark DS V2 topN push-down doesn't supports project with alias.

This PR let it works good with alias.

**Example**:
the origin plan show below:
```
Sort [mySalary#10 ASC NULLS FIRST], true
+- Project [NAME#1, SALARY#2 AS mySalary#10]
   +- ScanBuilderHolder [DEPT#0, NAME#1, SALARY#2, BONUS#3, IS_MANAGER#4], RelationV2[DEPT#0, NAME#1, SALARY#2, BONUS#3, IS_MANAGER#4] test.employee, JDBCScanBuilder(org.apache.spark.sql.test.TestSparkSession7fd4b9ec,StructType(StructField(DEPT,IntegerType,true),StructField(NAME,StringType,true),StructField(SALARY,DecimalType(20,2),true),StructField(BONUS,DoubleType,true),StructField(IS_MANAGER,BooleanType,true)),org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions3c8e4a82)
```
The `pushedLimit` and `sortOrders` of `JDBCScanBuilder` are empty.

If we can push down the top n, then the plan will be:
```
Project [NAME#1, SALARY#2 AS mySalary#10]
+- ScanBuilderHolder [DEPT#0, NAME#1, SALARY#2, BONUS#3, IS_MANAGER#4], RelationV2[DEPT#0, NAME#1, SALARY#2, BONUS#3, IS_MANAGER#4] test.employee, JDBCScanBuilder(org.apache.spark.sql.test.TestSparkSession7fd4b9ec,StructType(StructField(DEPT,IntegerType,true),StructField(NAME,StringType,true),StructField(SALARY,DecimalType(20,2),true),StructField(BONUS,DoubleType,true),StructField(IS_MANAGER,BooleanType,true)),org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions3c8e4a82)
```
The `pushedLimit` of `JDBCScanBuilder` will be `1` and `sortOrders` of `JDBCScanBuilder` will be `SALARY ASC NULLS FIRST`.

### Why are the changes needed?
Alias is more useful.

### Does this PR introduce _any_ user-facing change?
'Yes'.
Users could see DS V2 topN push-down supports project with alias.

### How was this patch tested?
New tests.

Closes #35961 from beliefer/SPARK-38644.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-38391][SQL] Datasource v2 supports partial topN push-down

### What changes were proposed in this pull request?
Currently , Spark supports push down topN completely . But for some data source (e.g. JDBC ) that have multiple partition , we should preserve partial push down topN.

### Why are the changes needed?
Make behavior of sort pushdown correctly.

### Does this PR introduce _any_ user-facing change?
'No'. Just change the inner implement.

### How was this patch tested?
New tests.

Closes #35710 from beliefer/SPARK-38391.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-38633][SQL] Support push down Cast to JDBC data source V2

### What changes were proposed in this pull request?
Cast is very useful and Spark always use Cast to convert data type automatically.

### Why are the changes needed?
Let more aggregates and filters could be pushed down.

### Does this PR introduce _any_ user-facing change?
'Yes'.
This PR after cut off 3.3.0.

### How was this patch tested?
New tests.

Closes #35947 from beliefer/SPARK-38633.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-38432][SQL][FOLLOWUP] Add test case for push down filter with alias

### What changes were proposed in this pull request?
DS V2 pushdown predicates to data source supports column with alias.
But Spark missing the test case for push down filter with alias.

### Why are the changes needed?
Add test case for push down filter with alias

### Does this PR introduce _any_ user-facing change?
'No'.
Just add a test case.

### How was this patch tested?
New tests.

Closes #35988 from beliefer/SPARK-38432_followup2.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-38633][SQL][FOLLOWUP] JDBCSQLBuilder should build cast to type of databases

### What changes were proposed in this pull request?
DS V2 supports push down CAST to database.
The current implement only uses the typeName of DataType.
For example: `Cast(column, StringType)` will be build to `CAST(column AS String)`.
But it should be `CAST(column AS TEXT)` for Postgres or `CAST(column AS VARCHAR2(255))` for Oracle.

### Why are the changes needed?
Improve the implement of push down CAST.

### Does this PR introduce _any_ user-facing change?
'No'.
Just new feature.

### How was this patch tested?
Exists tests

Closes #35999 from beliefer/SPARK-38633_followup.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-37839][SQL][FOLLOWUP] Check overflow when DS V2 partial aggregate push-down `AVG`

### What changes were proposed in this pull request?
https://github.com/apache/spark/pull/35130 supports partial aggregate push-down `AVG` for DS V2.
The behavior doesn't consistent with `Average` if occurs overflow in ansi mode.
This PR closely follows the implement of `Average` to respect overflow in ansi mode.

### Why are the changes needed?
Make the behavior consistent with `Average` if occurs overflow in ansi mode.

### Does this PR introduce _any_ user-facing change?
'Yes'.
Users could see the exception about overflow throws in ansi mode.

### How was this patch tested?
New tests.

Closes #35320 from beliefer/SPARK-37839_followup.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-37960][SQL][FOLLOWUP] Make the testing CASE WHEN query more reasonable

### What changes were proposed in this pull request?
Some testing CASE WHEN queries are not carefully written and do not make sense. In the future, the optimizer may get smarter and get rid of the CASE WHEN completely, and then we loose test coverage.

This PR updates some CASE WHEN queries to make them more reasonable.

### Why are the changes needed?
future-proof test coverage.

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
N/A

Closes #36032 from beliefer/SPARK-37960_followup2.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-38761][SQL] DS V2 supports push down misc non-aggregate functions

### What changes were proposed in this pull request?
Currently, Spark have some misc non-aggregate functions of ANSI standard. Please refer https://github.com/apache/spark/blob/2f8613f22c0750c00cf1dcfb2f31c431d8dc1be7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L362.
These functions show below:
`abs`,
`coalesce`,
`nullif`,
`CASE WHEN`
DS V2 should supports push down these misc non-aggregate functions.
Because DS V2 already support push down `CASE WHEN`, so this PR no need do the job again.
Because `nullif` extends `RuntimeReplaceable`, so this PR no need do the job too.

### Why are the changes needed?
DS V2 supports push down misc non-aggregate functions

### Does this PR introduce _any_ user-facing change?
'No'.
New feature.

### How was this patch tested?
New tests.

Closes #36039 from beliefer/SPARK-38761.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* [SPARK-38865][SQL][DOCS] Update document of JDBC options for `pushDownAggregate` and `pushDownLimit`

### What changes were proposed in this pull request?
Because the DS v2 pushdown framework refactored, we need to add more doc in `sql-data-sources-jdbc.md` to reflect the new changes.

### Why are the changes needed?
Add doc for new changes for `pushDownAggregate` and `pushDownLimit`.

### Does this PR introduce _any_ user-facing change?
'No'. Updated for new feature.

### How was this patch tested?
N/A

Closes #36152 from beliefer/SPARK-38865.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: huaxingao <[email protected]>

* [SPARK-38855][SQL] DS V2 supports push down math functions

### What changes were proposed in this pull request?
Currently, Spark have some math functions of ANSI standard. Please refer https://github.com/apache/spark/blob/2f8613f22c0750c00cf1dcfb2f31c431d8dc1be7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L388
These functions show below:
`LN`,
`EXP`,
`POWER`,
`SQRT`,
`FLOOR`,
`CEIL`,
`WIDTH_BUCKET`

The mainstream databases support these functions show below.

|  函数   | PostgreSQL  | ClickHouse  | H2  | MySQL  | Oracle  | Redshift  | Presto  | Teradata  | Snowflake  | DB2  | Vertica  | Exasol  | SqlServer  | Yellowbrick  | Impala  | Mariadb | Druid | Pig | SQLite | Influxdata | Singlestore | ElasticSearch |
|  ----  | ----  | ----  | ----  | ----  | ----  | ----  | ----  | ----  | ----  | ----  | ----  | ----  | ----  | ----  | ----  | ----  | ----  | ----  | ----  | ----  | ----  | ----  |
| `LN` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| `EXP` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| `POWER` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes | Yes | Yes |
| `SQRT` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| `FLOOR` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| `CEIL` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| `WIDTH_BUCKET` | Yes | No | No | No | Yes | No | Yes | Yes | Yes | Yes | Yes | No | No | No | Yes | No | No | No | No | No | No | No |

DS V2 should supports push down these math functions.

### Why are the changes needed?
DS V2 supports push down math functions

### Does this PR introduce _any_ user-facing change?
'No'.
New feature.

### How was this patch tested?
New tests.

Closes #36140 from beliefer/SPARK-38855.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

* update spark version to r61

Co-authored-by: Huaxin Gao <[email protected]>
Co-authored-by: DB Tsai <[email protected]>
Co-authored-by: Wenchen Fan <[email protected]>
Co-authored-by: Jiaan Geng <[email protected]>
Co-authored-by: Kousuke Saruta <[email protected]>
Co-authored-by: Wenchen Fan <[email protected]>
Co-authored-by: dch nguyen <[email protected]>
Co-authored-by: Cheng Su <[email protected]>
Co-authored-by: Peter Toth <[email protected]>
Co-authored-by: dch nguyen <[email protected]>
  • Loading branch information
11 people authored May 5, 2022
1 parent 049ef70 commit 1db8bca
Show file tree
Hide file tree
Showing 179 changed files with 8,240 additions and 3,002 deletions.
2 changes: 1 addition & 1 deletion assembly/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
<parent>
<groupId>org.apache.spark</groupId>
<artifactId>spark-parent_2.12</artifactId>
<version>3.2.0-kylin-4.x-r60</version>
<version>3.2.0-kylin-4.x-r61</version>
<relativePath>../pom.xml</relativePath>
</parent>

Expand Down
2 changes: 1 addition & 1 deletion common/kvstore/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
<parent>
<groupId>org.apache.spark</groupId>
<artifactId>spark-parent_2.12</artifactId>
<version>3.2.0-kylin-4.x-r60</version>
<version>3.2.0-kylin-4.x-r61</version>
<relativePath>../../pom.xml</relativePath>
</parent>

Expand Down
2 changes: 1 addition & 1 deletion common/network-common/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
<parent>
<groupId>org.apache.spark</groupId>
<artifactId>spark-parent_2.12</artifactId>
<version>3.2.0-kylin-4.x-r60</version>
<version>3.2.0-kylin-4.x-r61</version>
<relativePath>../../pom.xml</relativePath>
</parent>

Expand Down
2 changes: 1 addition & 1 deletion common/network-shuffle/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
<parent>
<groupId>org.apache.spark</groupId>
<artifactId>spark-parent_2.12</artifactId>
<version>3.2.0-kylin-4.x-r60</version>
<version>3.2.0-kylin-4.x-r61</version>
<relativePath>../../pom.xml</relativePath>
</parent>

Expand Down
2 changes: 1 addition & 1 deletion common/network-yarn/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
<parent>
<groupId>org.apache.spark</groupId>
<artifactId>spark-parent_2.12</artifactId>
<version>3.2.0-kylin-4.x-r60</version>
<version>3.2.0-kylin-4.x-r61</version>
<relativePath>../../pom.xml</relativePath>
</parent>

Expand Down
2 changes: 1 addition & 1 deletion common/sketch/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
<parent>
<groupId>org.apache.spark</groupId>
<artifactId>spark-parent_2.12</artifactId>
<version>3.2.0-kylin-4.x-r60</version>
<version>3.2.0-kylin-4.x-r61</version>
<relativePath>../../pom.xml</relativePath>
</parent>

Expand Down
2 changes: 1 addition & 1 deletion common/tags/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
<parent>
<groupId>org.apache.spark</groupId>
<artifactId>spark-parent_2.12</artifactId>
<version>3.2.0-kylin-4.x-r60</version>
<version>3.2.0-kylin-4.x-r61</version>
<relativePath>../../pom.xml</relativePath>
</parent>

Expand Down
2 changes: 1 addition & 1 deletion common/unsafe/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
<parent>
<groupId>org.apache.spark</groupId>
<artifactId>spark-parent_2.12</artifactId>
<version>3.2.0-kylin-4.x-r60</version>
<version>3.2.0-kylin-4.x-r61</version>
<relativePath>../../pom.xml</relativePath>
</parent>

Expand Down
2 changes: 1 addition & 1 deletion core/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
<parent>
<groupId>org.apache.spark</groupId>
<artifactId>spark-parent_2.12</artifactId>
<version>3.2.0-kylin-4.x-r60</version>
<version>3.2.0-kylin-4.x-r61</version>
<relativePath>../pom.xml</relativePath>
</parent>

Expand Down
28 changes: 23 additions & 5 deletions docs/sql-data-sources-jdbc.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,9 @@ license: |
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
Expand Down Expand Up @@ -191,7 +191,7 @@ logging into the data sources.
<td>write</td>
</td>
</tr>

<tr>
<td><code>cascadeTruncate</code></td>
<td>the default cascading truncate behaviour of the JDBC database in question, specified in the <code>isCascadeTruncate</code> in each JDBCDialect</td>
Expand Down Expand Up @@ -241,7 +241,25 @@ logging into the data sources.
<td><code>pushDownAggregate</code></td>
<td><code>false</code></td>
<td>
The option to enable or disable aggregate push-down into the JDBC data source. The default value is false, in which case Spark will not push down aggregates to the JDBC data source. Otherwise, if sets to true, aggregates will be pushed down to the JDBC data source. Aggregate push-down is usually turned off when the aggregate is performed faster by Spark than by the JDBC data source. Please note that aggregates can be pushed down if and only if all the aggregate functions and the related filters can be pushed down. Spark assumes that the data source can't fully complete the aggregate and does a final aggregate over the data source output.
The option to enable or disable aggregate push-down in V2 JDBC data source. The default value is false, in which case Spark will not push down aggregates to the JDBC data source. Otherwise, if sets to true, aggregates will be pushed down to the JDBC data source. Aggregate push-down is usually turned off when the aggregate is performed faster by Spark than by the JDBC data source. Please note that aggregates can be pushed down if and only if all the aggregate functions and the related filters can be pushed down. If <code>numPartitions</code> equals to 1 or the group by key is the same as <code>partitionColumn</code>, Spark will push down aggregate to data source completely and not apply a final aggregate over the data source output. Otherwise, Spark will apply a final aggregate over the data source output.
</td>
<td>read</td>
</tr>

<tr>
<td><code>pushDownLimit</code></td>
<td><code>false</code></td>
<td>
The option to enable or disable LIMIT push-down into V2 JDBC data source. The LIMIT push-down also includes LIMIT + SORT , a.k.a. the Top N operator. The default value is false, in which case Spark does not push down LIMIT or LIMIT with SORT to the JDBC data source. Otherwise, if sets to true, LIMIT or LIMIT with SORT is pushed down to the JDBC data source. If <code>numPartitions</code> is greater than 1, SPARK still applies LIMIT or LIMIT with SORT on the result from data source even if LIMIT or LIMIT with SORT is pushed down. Otherwise, if LIMIT or LIMIT with SORT is pushed down and <code>numPartitions</code> equals to 1, SPARK will not apply LIMIT or LIMIT with SORT on the result from data source.
</td>
<td>read</td>
</tr>

<tr>
<td><code>pushDownTableSample</code></td>
<td><code>false</code></td>
<td>
The option to enable or disable TABLESAMPLE push-down into V2 JDBC data source. The default value is false, in which case Spark does not push down TABLESAMPLE to the JDBC data source. Otherwise, if value sets to true, TABLESAMPLE is pushed down to the JDBC data source.
</td>
<td>read</td>
</tr>
Expand Down Expand Up @@ -288,7 +306,7 @@ logging into the data sources.

Note that kerberos authentication with keytab is not always supported by the JDBC driver.<br>
Before using <code>keytab</code> and <code>principal</code> configuration options, please make sure the following requirements are met:
* The included JDBC driver version supports kerberos authentication with keytab.
* The included JDBC driver version supports kerberos authentication with keytab.
* There is a built-in connection provider which supports the used database.

There is a built-in connection providers for the following databases:
Expand Down
2 changes: 1 addition & 1 deletion examples/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
<parent>
<groupId>org.apache.spark</groupId>
<artifactId>spark-parent_2.12</artifactId>
<version>3.2.0-kylin-4.x-r60</version>
<version>3.2.0-kylin-4.x-r61</version>
<relativePath>../pom.xml</relativePath>
</parent>

Expand Down
2 changes: 1 addition & 1 deletion external/avro/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
<parent>
<groupId>org.apache.spark</groupId>
<artifactId>spark-parent_2.12</artifactId>
<version>3.2.0-kylin-4.x-r60</version>
<version>3.2.0-kylin-4.x-r61</version>
<relativePath>../../pom.xml</relativePath>
</parent>

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -62,10 +62,6 @@ case class AvroScan(
pushedFilters)
}

override def withFilters(
partitionFilters: Seq[Expression], dataFilters: Seq[Expression]): FileScan =
this.copy(partitionFilters = partitionFilters, dataFilters = dataFilters)

override def equals(obj: Any): Boolean = obj match {
case a: AvroScan => super.equals(a) && dataSchema == a.dataSchema && options == a.options &&
equivalentFilters(pushedFilters, a.pushedFilters)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ package org.apache.spark.sql.v2.avro

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.catalyst.StructFilters
import org.apache.spark.sql.connector.read.{Scan, SupportsPushDownFilters}
import org.apache.spark.sql.connector.read.Scan
import org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex
import org.apache.spark.sql.execution.datasources.v2.FileScanBuilder
import org.apache.spark.sql.sources.Filter
Expand All @@ -31,7 +31,7 @@ class AvroScanBuilder (
schema: StructType,
dataSchema: StructType,
options: CaseInsensitiveStringMap)
extends FileScanBuilder(sparkSession, fileIndex, dataSchema) with SupportsPushDownFilters {
extends FileScanBuilder(sparkSession, fileIndex, dataSchema) {

override def build(): Scan = {
AvroScan(
Expand All @@ -41,17 +41,16 @@ class AvroScanBuilder (
readDataSchema(),
readPartitionSchema(),
options,
pushedFilters())
pushedDataFilters,
partitionFilters,
dataFilters)
}

private var _pushedFilters: Array[Filter] = Array.empty

override def pushFilters(filters: Array[Filter]): Array[Filter] = {
override def pushDataFilters(dataFilters: Array[Filter]): Array[Filter] = {
if (sparkSession.sessionState.conf.avroFilterPushDown) {
_pushedFilters = StructFilters.pushedFilters(filters, dataSchema)
StructFilters.pushedFilters(dataFilters, dataSchema)
} else {
Array.empty[Filter]
}
filters
}

override def pushedFilters(): Array[Filter] = _pushedFilters
}
7 changes: 6 additions & 1 deletion external/docker-integration-tests/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
<parent>
<groupId>org.apache.spark</groupId>
<artifactId>spark-parent_2.12</artifactId>
<version>3.2.0-kylin-4.x-r60</version>
<version>3.2.0-kylin-4.x-r61</version>
<relativePath>../../pom.xml</relativePath>
</parent>

Expand Down Expand Up @@ -162,5 +162,10 @@
<artifactId>mssql-jdbc</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<scope>test</scope>
</dependency>
</dependencies>
</project>
Original file line number Diff line number Diff line change
Expand Up @@ -18,13 +18,14 @@
package org.apache.spark.sql.jdbc.v2

import java.sql.Connection
import java.util.Locale

import org.scalatest.time.SpanSugar._

import org.apache.spark.SparkConf
import org.apache.spark.sql.AnalysisException
import org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCTableCatalog
import org.apache.spark.sql.jdbc.{DatabaseOnDocker, DockerJDBCIntegrationSuite}
import org.apache.spark.sql.jdbc.DatabaseOnDocker
import org.apache.spark.sql.types._
import org.apache.spark.tags.DockerTest

Expand All @@ -36,8 +37,9 @@ import org.apache.spark.tags.DockerTest
* }}}
*/
@DockerTest
class DB2IntegrationSuite extends DockerJDBCIntegrationSuite with V2JDBCTest {
class DB2IntegrationSuite extends DockerJDBCIntegrationV2Suite with V2JDBCTest {
override val catalogName: String = "db2"
override val namespaceOpt: Option[String] = Some("DB2INST1")
override val db = new DatabaseOnDocker {
override val imageName = sys.env.getOrElse("DB2_DOCKER_IMAGE_NAME", "ibmcom/db2:11.5.4.0")
override val env = Map(
Expand All @@ -59,8 +61,13 @@ class DB2IntegrationSuite extends DockerJDBCIntegrationSuite with V2JDBCTest {
override def sparkConf: SparkConf = super.sparkConf
.set("spark.sql.catalog.db2", classOf[JDBCTableCatalog].getName)
.set("spark.sql.catalog.db2.url", db.getJdbcUrl(dockerIp, externalPort))
.set("spark.sql.catalog.db2.pushDownAggregate", "true")

override def dataPreparation(conn: Connection): Unit = {}
override def tablePreparation(connection: Connection): Unit = {
connection.prepareStatement(
"CREATE TABLE employee (dept INTEGER, name VARCHAR(10), salary DECIMAL(20, 2), bonus DOUBLE)")
.executeUpdate()
}

override def testUpdateColumnType(tbl: String): Unit = {
sql(s"CREATE TABLE $tbl (ID INTEGER)")
Expand All @@ -86,4 +93,17 @@ class DB2IntegrationSuite extends DockerJDBCIntegrationSuite with V2JDBCTest {
val expectedSchema = new StructType().add("ID", IntegerType, true, defaultMetadata)
assert(t.schema === expectedSchema)
}

override def caseConvert(tableName: String): String = tableName.toUpperCase(Locale.ROOT)

testVarPop()
testVarPop(true)
testVarSamp()
testVarSamp(true)
testStddevPop()
testStddevPop(true)
testStddevSamp()
testStddevSamp(true)
testCovarPop()
testCovarSamp()
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.spark.sql.jdbc.v2

import java.sql.Connection

import scala.collection.JavaConverters._

import org.apache.spark.sql.jdbc.{DatabaseOnDocker, DockerJDBCIntegrationSuite}
import org.apache.spark.sql.util.CaseInsensitiveStringMap
import org.apache.spark.tags.DockerTest

/**
* To run this test suite for a specific version (e.g., ibmcom/db2:11.5.6.0a):
* {{{
* ENABLE_DOCKER_INTEGRATION_TESTS=1 DB2_DOCKER_IMAGE_NAME=ibmcom/db2:11.5.6.0a
* ./build/sbt -Pdocker-integration-tests "testOnly *v2.DB2NamespaceSuite"
* }}}
*/
@DockerTest
class DB2NamespaceSuite extends DockerJDBCIntegrationSuite with V2JDBCNamespaceTest {
override val db = new DatabaseOnDocker {
override val imageName = sys.env.getOrElse("DB2_DOCKER_IMAGE_NAME", "ibmcom/db2:11.5.6.0a")
override val env = Map(
"DB2INST1_PASSWORD" -> "rootpass",
"LICENSE" -> "accept",
"DBNAME" -> "db2foo",
"ARCHIVE_LOGS" -> "false",
"AUTOCONFIG" -> "false"
)
override val usesIpc = false
override val jdbcPort: Int = 50000
override val privileged = true
override def getJdbcUrl(ip: String, port: Int): String =
s"jdbc:db2://$ip:$port/db2foo:user=db2inst1;password=rootpass;retrieveMessagesFromServerOnGetMessage=true;" //scalastyle:ignore
}

val map = new CaseInsensitiveStringMap(
Map("url" -> db.getJdbcUrl(dockerIp, externalPort),
"driver" -> "com.ibm.db2.jcc.DB2Driver").asJava)

catalog.initialize("db2", map)

override def dataPreparation(conn: Connection): Unit = {}

override def builtinNamespaces: Array[Array[String]] =
Array(Array("NULLID"), Array("SQLJ"), Array("SYSCAT"), Array("SYSFUN"),
Array("SYSIBM"), Array("SYSIBMADM"), Array("SYSIBMINTERNAL"), Array("SYSIBMTS"),
Array("SYSPROC"), Array("SYSPUBLIC"), Array("SYSSTAT"), Array("SYSTOOLS"))

override def listNamespaces(namespace: Array[String]): Array[Array[String]] = {
builtinNamespaces ++ Array(namespace)
}

override val supportsDropSchemaCascade: Boolean = false

testListNamespaces()
testDropNamespaces()
}
Loading

0 comments on commit 1db8bca

Please sign in to comment.