Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#1550] feat(spark-connector) support partition,bucket, sortorder table #2540

Merged
merged 12 commits into from
Mar 22, 2024

Conversation

FANNG1
Copy link
Contributor

@FANNG1 FANNG1 commented Mar 15, 2024

What changes were proposed in this pull request?

add partition, distribution, sort order support for spark connector

Why are the changes needed?

Fix: #1550

Does this PR introduce any user-facing change?

no

How was this patch tested?

add UT and IT, also verified in local env.

@FANNG1 FANNG1 marked this pull request as draft March 15, 2024 00:25
@FANNG1 FANNG1 force-pushed the partition branch 3 times, most recently from c8cf9ff to fc2770f Compare March 15, 2024 09:42
@FANNG1 FANNG1 changed the title [SIP][Don't merge] feat(spark-connector) support partition,bucket, sortorder table [#1550] feat(spark-connector) support partition,bucket, sortorder table Mar 15, 2024
@FANNG1 FANNG1 self-assigned this Mar 15, 2024
@FANNG1 FANNG1 force-pushed the partition branch 2 times, most recently from 3086089 to fcbe74b Compare March 18, 2024 03:07
@FANNG1 FANNG1 marked this pull request as ready for review March 18, 2024 03:16
@FANNG1
Copy link
Contributor Author

FANNG1 commented Mar 18, 2024

It's ready to review now, @jerryshao @qqqttt123 @yuqi1129 @mchades @diqiu50 please help to review when you are free.

@FANNG1 FANNG1 force-pushed the partition branch 2 times, most recently from 9644609 to d404224 Compare March 20, 2024 02:11
@FANNG1
Copy link
Contributor Author

FANNG1 commented Mar 20, 2024

@jerryshao , please help to review when you are free, thanks

@@ -27,6 +28,10 @@ dependencies {
implementation("org.apache.kyuubi:kyuubi-spark-connector-hive_$scalaVersion:$kyuubiVersion")
implementation("org.apache.spark:spark-catalyst_$scalaVersion:$sparkVersion")
implementation("org.apache.spark:spark-sql_$scalaVersion:$sparkVersion")
implementation("org.scala-lang.modules:scala-java8-compat_$scalaVersion:$scalaJava8CompatVersion")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it work in jdk 11 or 17?

Copy link
Contributor Author

@FANNG1 FANNG1 Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it's A Java 8 (and up) compatibility kit for Scala. which could pass IT.

@FANNG1
Copy link
Contributor Author

FANNG1 commented Mar 21, 2024

split toGravitinoTransform to two interfaces toGravitinoPartitions and toGravitinoDistributionAndSortorders which is not suitable to split again because Spark SortedBucketTransform contains both distribution and sortorders . @jerryshao @yuqi1129 @mchades please help to review again.

// Gravitino use ["a","b"] for nested fields while Spark use "a.b";
private static String getFieldNameFromGravitinoNamedReference(
NamedReference gravitinoNamedReference) {
return String.join(ConnectorConstants.DOT, gravitinoNamedReference.fieldName());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mchades Does Gravitino support nested fields? I remember ["a","b"], a means the table reference and b is the real column name?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the array of fieldName is used to represent access nested fields. For example, the column a is struct type{b int, c string}, then we can use a.b or a.c to reference a nested field

bucketNum, createSparkNamedReference(bucketFields), createSparkNamedReference(sortFields));
}

// columnName could be "a" or "a.b" for nested column
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So do you need to handle nested column case here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to handle it, because both spark and gravitino interfaces support nested columns

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I don't see you do it here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I see the code there, please ignore.

@FANNG1
Copy link
Contributor Author

FANNG1 commented Mar 22, 2024

@mchades @qqqttt123 @yuqi1129 @jerryshao @diqiu50 all comments are addressed, please help to review again.

@jerryshao
Copy link
Contributor

I have no further comment, I think we can go to unblock other PRs. If there's missing parts. We can fix in another PR.

@jerryshao jerryshao merged commit d2ed24f into apache:main Mar 22, 2024
14 checks passed
coolderli pushed a commit to coolderli/gravitino that referenced this pull request Apr 2, 2024
…er table (apache#2540)

### What changes were proposed in this pull request?
add partition, distribution, sort order support for spark connector

### Why are the changes needed?

Fix: apache#1550 

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
add UT and IT, also verified in local env.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Subtask] [spark-connector] support hive partition and bucket table
6 participants