Add spark submodule to convert SparkSql's LogicalPlan to Substrait Rel. #90

LaiZhou · 2022-09-13T15:39:33Z

At this moment, it aims to build a substrait rel producer of SparkSql at logical level.
Currently only the type converter and parts of Read Operators of Spark are finished, and some related works are in progress.
The testcases aslo integrate a hive-standalone-metastore docker container , that makes testing some operators conveniently, such as Iceberg/Hive datasource relation operators.

rui-mo · 2022-09-20T03:00:34Z

Similar to this PR, we are working on the conversion from SparkSql physical plan to Substrait plan. Currently, we have finished read, filter, project, aggregation and join. Some expressions used in TPC-H are also supported. We found there are some common parts for the conversion of logic and physical plan. Can we also upstream our code to substrait-java?
cc @FelixYBW

FelixYBW · 2022-09-20T05:36:24Z

The project Rui refer to : https://github.com/oap-project/gluten

jacques-n · 2022-09-20T21:40:54Z

@jinfengni and @JamesRTaylor, can you review this?

jinfengni · 2022-09-21T03:42:49Z

@jinfengni and @JamesRTaylor, can you review this?

Sorry, I missed reading earlier message. Will do the review.

jinfengni · 2022-09-22T06:44:48Z

My overall impression is this PR is still in a work-in-progress stage. It defines a type converter, expression converter, logical plan converter. But each converter support very limited inputs. If I understand correctly, it probably convert only "select col from t" like query.

My suggestion is we split this PR into smaller scoped building blocks for a spark plan converter. Maybe start with type converter, to deal with the majorities of types in Spark, then continue with expression converter with majorities of expression. We then can deal with the logical plan converter. For each converter, add UT to cover the types/expressions / logical relation operators that are supported.

LaiZhou · 2022-09-22T10:32:49Z

@jinfengni . Yeah , this PR just gives a skeleton to build rel producer of spark logical plan , then other people can continue to add more conversions . The next step is to support logical level expression conversion (WIP).

winningsix · 2022-09-23T00:21:43Z

One general question: what's the use case to convert Spark SQL to Substrait logic level?

FelixYBW · 2022-09-23T01:50:31Z

What's the goal of "substrait-java". My understanding is it's a standalone module without any dependency to application level. We used it in our Gluten's query plan conversion. But the conversion should be part of the application, instead of substrait-java. That's why we never thought to contribute to here. If we contribute, a seperate repo can be more useful.

jacques-n · 2022-09-23T04:06:03Z

What's the goal of "substrait-java". My understanding is it's a standalone module without any dependency to application level. We used it in our Gluten's query plan conversion. But the conversion should be part of the application, instead of substrait-java. That's why we never thought to contribute to here. If we contribute, a seperate repo can be more useful.

The substrait-java is a place to house Substrait related java programs. There are several modules/components within Substrait Java. We have a core library for arbitrary use. We also have Isthmus, which is an application for converting between SQL and Substrait. Having more modules like this one (spark to substrait) are appropriate for here. Really just focused on anything substrait and jvm.

spark/src/main/scala/io/substrait/spark/SparkTypeConverter.scala

baibaichen · 2022-10-05T07:47:14Z

The substrait-java is a place to house Substrait related java programs. There are several modules/components within Substrait Java. We have a core library for arbitrary use. We also have Isthmus, which is an application for converting between SQL and Substrait. Having more modules like this one (spark to substrait) are appropriate for here. Really just focused on anything substrait and jvm.

@jacques-n As you mention spark, substrait-java only supports java 17 and above, I know java 8 is the end of life, but the mainstream JVM of spark is java 8.

Any suggestion for java 8?

Thanks

CLAassistant · 2022-10-06T23:45:34Z

All committers have signed the CLA.

LaiZhou · 2022-10-10T03:02:51Z

The substrait-java is a place to house Substrait related java programs. There are several modules/components within Substrait Java. We have a core library for arbitrary use. We also have Isthmus, which is an application for converting between SQL and Substrait. Having more modules like this one (spark to substrait) are appropriate for here. Really just focused on anything substrait and jvm.

@jacques-n As you mention spark, substrait-java only supports java 17 and above, I know java 8 is the end of life, but the mainstream JVM of spark is java 8.

Any suggestion for java 8?

Thanks

I think the latest spark 3.3.0 already supports Java 17. https://issues.apache.org/jira/browse/SPARK-33772 .

LaiZhou · 2022-10-10T03:14:25Z

One general question: what's the use case to convert Spark SQL to Substrait logic level?

@winningsix
For Iceberg View representation : we may use trino-sql to query a Iceberg View created by spark-sql .
In this case, we may need a spark-sql substrait plan producer and a trino-sql substrait plan consumer at logic level.
Futhermore, we can even submit spark sql queries to a trino system .

FelixYBW · 2022-10-12T04:11:34Z

I think the latest spark 3.3.0 already supports Java 17. issues.apache.org/jira/browse/SPARK-33772 .

java8 is still very commonly used by spark customers.

FelixYBW · 2022-10-12T04:19:46Z

One general question: what's the use case to convert Spark SQL to Substrait logic level?

@winningsix For Iceberg View representation : we may use trino-sql to query a Iceberg View created by spark-sql . In this case, we may need a spark-sql substrait plan producer and a trino-sql substrait plan consumer at logic level. Futhermore, we can even submit spark sql queries to a trino system .

We use substrait to pass Spark plan to Native library. One issue we met is that spark plan isn't 1:1 mapping to native library plan. Now our solution is in Gluten we convert one Spark operator into one or more substrait operator. Then in native library, we use 1 to 1 mapping from substria to native operator. Does trino-sql has 1:1 mapping to Spark plan node?

LaiZhou · 2022-10-14T07:01:32Z

One general question: what's the use case to convert Spark SQL to Substrait logic level?

@winningsix For Iceberg View representation : we may use trino-sql to query a Iceberg View created by spark-sql . In this case, we may need a spark-sql substrait plan producer and a trino-sql substrait plan consumer at logic level. Futhermore, we can even submit spark sql queries to a trino system .
/
We use substrait to pass Spark plan to Native library. One issue we met is that spark plan isn't 1:1 mapping to native library plan. Now our solution is in Gluten we convert one Spark operator into one or more substrait operator. Then in native library, we use 1 to 1 mapping from substria to native operator. Does trino-sql has 1:1 mapping to Spark plan node?

No, different sql engines have different logical/physical sql plans , it's difficult to do 1:1 mapping for these query plans. Substrait was created to provide a standard and open format for these query plans, but there're still a lot of early works to do.

For example:
I found Trino engine will convert subquery expressions to plan node CorrelatedJoinNode or ApplyNode before logical plan generated . (https://github.com/trinodb/trino/blob/master/core/trino-main/src/main/java/io/trino/sql/planner/SubqueryPlanner.java )
This is very different to how Calcite and Spark do. We have defined the spec of subqueries here : https://substrait.io/expressions/subqueries/#scalar . So if Trino engine want to consume a substrait plan with subqueries , it should be responsible for the corresponding conversion.

@jacques-n , a question about Field References:
In Substrait, all fields are dealt with on a positional basis. But Spark resolves fields with a unique global exprId AttributeReference , Do we need convert a id based AttributeReference to a position based FieldReference ?

FelixYBW · 2022-10-18T17:55:39Z

No, different sql engines have different logical/physical sql plans , it's difficult to do 1:1 mapping for these query plans. Substrait was created to provide a standard and open format for these query plans, but there're still a lot of early works to do.

This is my concern of the conversion here. If we transform Spark's logic plan 1:1 to substrait plan, I need another layer either in java or c++ to modify the substrait plan into native library's plan. Currently we do the modification when we create the substrait plan. Are you going to create 1:1 mapping between spark logic plan and substrait plan?

jacques-n · 2022-10-25T22:03:15Z

Do we need convert a id based AttributeReference to a position based FieldReference ?

Yes, you need to follow the Substrait spec for field references.

baibaichen · 2022-11-18T01:51:16Z

@LaiZhou

We alreay integrated substrait-spark in gluten(apache/incubator-gluten#515). Although, gluten only needs to convert physical plan to substrait rel, we still develop a logical plan to substrait-rel transformation for UT. Due to schedule pressure, we put it on gluten for now,

Would you like to do a migration from gluten to substrait-java?

MasseGuillaume

minnor Scala suggestions

MasseGuillaume · 2023-06-05T17:54:55Z

spark/src/main/scala/io/substrait/spark/SparkLogicalPlanConverter.scala

+    var schema: StructType = null
+    var tableNames: List[String] = null;
+    plan match {
+      case logicalRelation: LogicalRelation =>
+        schema = logicalRelation.schema
+        tableNames = logicalRelation.catalogTable.get.identifier.unquotedString.split("\\.").toList
+        buildNamedScan(schema, tableNames)
+      case dataSourceV2ScanRelation: DataSourceV2ScanRelation =>
+        schema = dataSourceV2ScanRelation.schema
+        tableNames = dataSourceV2ScanRelation.relation.identifier.get.toString.split("\\.").toList
+        buildNamedScan(schema, tableNames)
+      case dataSourceV2Relation: DataSourceV2Relation =>
+        schema = dataSourceV2Relation.schema
+        tableNames = dataSourceV2Relation.identifier.get.toString.split("\\.").toList
+        buildNamedScan(schema, tableNames)
+      case hiveTableRelation: HiveTableRelation =>
+        schema = hiveTableRelation.schema
+        tableNames = hiveTableRelation.tableMeta.identifier.unquotedString.split("\\.").toList
+        buildNamedScan(schema, tableNames)
+      //TODO: LocalRelation,Range=>Virtual Table,LogicalRelation(HadoopFsRelation)=>LocalFiles
+
+      case _ =>
+        throw new UnsupportedOperationException("Unable to convert the plan to a substrait AbstractReadRel: " + plan)
+    }


Suggested change

var schema: StructType = null

var tableNames: List[String] = null;

plan match {

case logicalRelation: LogicalRelation =>

schema = logicalRelation.schema

tableNames = logicalRelation.catalogTable.get.identifier.unquotedString.split("\\.").toList

buildNamedScan(schema, tableNames)

case dataSourceV2ScanRelation: DataSourceV2ScanRelation =>

schema = dataSourceV2ScanRelation.schema

tableNames = dataSourceV2ScanRelation.relation.identifier.get.toString.split("\\.").toList

buildNamedScan(schema, tableNames)

case dataSourceV2Relation: DataSourceV2Relation =>

schema = dataSourceV2Relation.schema

tableNames = dataSourceV2Relation.identifier.get.toString.split("\\.").toList

buildNamedScan(schema, tableNames)

case hiveTableRelation: HiveTableRelation =>

schema = hiveTableRelation.schema

tableNames = hiveTableRelation.tableMeta.identifier.unquotedString.split("\\.").toList

buildNamedScan(schema, tableNames)

//TODO: LocalRelation,Range=>Virtual Table,LogicalRelation(HadoopFsRelation)=>LocalFiles

case _ =>

throw new UnsupportedOperationException("Unable to convert the plan to a substrait AbstractReadRel: " + plan)

}

val (schema, tableNames) =

plan match {

case logicalRelation: LogicalRelation =>

(logicalRelation.schema, logicalRelation.catalogTable.get.identifier.unquotedString.split("\\.").toList)

case dataSourceV2ScanRelation: DataSourceV2ScanRelation =>

(dataSourceV2ScanRelation.schema, dataSourceV2ScanRelation.relation.identifier.get.toString.split("\\.").toList)

case dataSourceV2Relation: DataSourceV2Relation =>

(dataSourceV2Relation.schema, dataSourceV2Relation.identifier.get.toString.split("\\.").toList)

case hiveTableRelation: HiveTableRelation =>

(hiveTableRelation.schema, hiveTableRelation.tableMeta.identifier.unquotedString.split("\\.").toList)

//TODO: LocalRelation,Range=>Virtual Table,LogicalRelation(HadoopFsRelation)=>LocalFiles

case _ =>

throw new UnsupportedOperationException("Unable to convert the plan to a substrait AbstractReadRel: " + plan)

}

buildNamedScan(schema, tableNames)

MasseGuillaume · 2023-06-05T17:58:08Z

spark/src/main/scala/io/substrait/spark/SparkTypeConverter.scala

+    val creator = Type.withNullability(true)
+    val names = new java.util.ArrayList[String]
+    val children = new java.util.ArrayList[Type]
+    schema.fields.foreach(field => {
+      names.add(field.name)
+      children.add(convert(field.dataType, field.nullable))
+    })
+    val struct = creator.struct(children)
+    NamedStruct.of(names, struct)


Suggested change

val creator = Type.withNullability(true)

val names = new java.util.ArrayList[String]

val children = new java.util.ArrayList[Type]

schema.fields.foreach(field => {

names.add(field.name)

children.add(convert(field.dataType, field.nullable))

})

val struct = creator.struct(children)

NamedStruct.of(names, struct)

import scala.jdk.CollectionConverters._

val name = schema.fields.map(_.name).asJava

val struct = creator.struct(

schema.fields.map(field => convert(field.dataType, field.nullable)).asJava

)

NamedStruct.of(names, struct)

MasseGuillaume · 2023-06-05T17:58:23Z

spark/src/main/scala/io/substrait/spark/SparkTypeConverter.scala

+        // TODO: now we lost the nested StructType's field names,do we need them?
+        //val names = new java.util.ArrayList[String]
+        val children = new java.util.ArrayList[Type]
+        fields.foreach(field => {


andrew-coleman · 2024-06-13T09:51:22Z

Would you like to do a migration from gluten to substrait-java?

@baibaichen I’d be willing to migrate your substrait-spark code into this repo, assuming that your team and the substrait-java (@vbarua) team are happy for me to do so.

Before it was removed (apache/incubator-gluten#4609) from the gluten repo, we had already started using it as part of a project here (IBM Research) so we are interested in keeping it alive and open-source.

Let me know, then I will prepare a separate PR. Thanks!

EpsilonPrime · 2024-06-26T02:31:10Z

spark/build.gradle.kts

+  implementation("org.apache.spark:spark-sql_2.12:3.3.0")
+  testImplementation("org.apache.spark:spark-hive_2.12:3.3.0")
+
+  //    testImplementation("org.apache.spark:spark-sql_2.12:3.3.0:tests")


nielspardon · 2024-06-27T12:05:21Z

I guess this PR has been superseded by #271 which has been merged recently and we could close this one.

EpsilonPrime · 2024-07-05T16:29:37Z

Obsoleted by #271

LaiZhou added 2 commits September 13, 2022 23:08

Add spark submodule to convert SparkSql's LogicalPlan to Substrait Rel.

4c82e4a

fix docker volumes permission.

2b65b21

jacques-n requested review from JamesRTaylor, jinfengni and rymurr September 14, 2022 16:29

fix converter

cf99a11

chaojun-zhang reviewed Sep 28, 2022

View reviewed changes

spark/src/main/scala/io/substrait/spark/SparkTypeConverter.scala Outdated Show resolved Hide resolved

fix converter

6fa75df

FelixYBW mentioned this pull request Oct 18, 2022

[Discuss] Flink on Velox supports facebookincubator/velox#1752

Closed

westonpace mentioned this pull request Apr 3, 2023

How to execute a substrait plan? substrait-io/substrait#485

Closed

MasseGuillaume reviewed Jun 5, 2023

View reviewed changes

brayanjuls mentioned this pull request May 27, 2024

Metrics Computing: DataFrame API OSS-Latam/df-metrics#7

Open

andrew-coleman mentioned this pull request Jun 18, 2024

Port of substrait-spark module from Gluten #271

Merged

EpsilonPrime approved these changes Jun 26, 2024

View reviewed changes

EpsilonPrime closed this Jul 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add spark submodule to convert SparkSql's LogicalPlan to Substrait Rel. #90

Add spark submodule to convert SparkSql's LogicalPlan to Substrait Rel. #90

LaiZhou commented Sep 13, 2022 •

edited

Loading

rui-mo commented Sep 20, 2022

FelixYBW commented Sep 20, 2022

jacques-n commented Sep 20, 2022

jinfengni commented Sep 21, 2022

jinfengni commented Sep 22, 2022

LaiZhou commented Sep 22, 2022 •

edited

Loading

winningsix commented Sep 23, 2022

FelixYBW commented Sep 23, 2022

jacques-n commented Sep 23, 2022

baibaichen commented Oct 5, 2022 •

edited

Loading

CLAassistant commented Oct 6, 2022 •

edited

Loading

LaiZhou commented Oct 10, 2022

LaiZhou commented Oct 10, 2022 •

edited

Loading

FelixYBW commented Oct 12, 2022

FelixYBW commented Oct 12, 2022

LaiZhou commented Oct 14, 2022

FelixYBW commented Oct 18, 2022

jacques-n commented Oct 25, 2022

baibaichen commented Nov 18, 2022

MasseGuillaume left a comment

MasseGuillaume Jun 5, 2023

MasseGuillaume Jun 5, 2023

MasseGuillaume Jun 5, 2023

andrew-coleman commented Jun 13, 2024

EpsilonPrime Jun 26, 2024

nielspardon commented Jun 27, 2024

EpsilonPrime commented Jul 5, 2024

Add spark submodule to convert SparkSql's LogicalPlan to Substrait Rel. #90

Add spark submodule to convert SparkSql's LogicalPlan to Substrait Rel. #90

Conversation

LaiZhou commented Sep 13, 2022 • edited Loading

rui-mo commented Sep 20, 2022

FelixYBW commented Sep 20, 2022

jacques-n commented Sep 20, 2022

jinfengni commented Sep 21, 2022

jinfengni commented Sep 22, 2022

LaiZhou commented Sep 22, 2022 • edited Loading

winningsix commented Sep 23, 2022

FelixYBW commented Sep 23, 2022

jacques-n commented Sep 23, 2022

baibaichen commented Oct 5, 2022 • edited Loading

CLAassistant commented Oct 6, 2022 • edited Loading

LaiZhou commented Oct 10, 2022

LaiZhou commented Oct 10, 2022 • edited Loading

FelixYBW commented Oct 12, 2022

FelixYBW commented Oct 12, 2022

LaiZhou commented Oct 14, 2022

FelixYBW commented Oct 18, 2022

jacques-n commented Oct 25, 2022

baibaichen commented Nov 18, 2022

MasseGuillaume left a comment

Choose a reason for hiding this comment

MasseGuillaume Jun 5, 2023

Choose a reason for hiding this comment

MasseGuillaume Jun 5, 2023

Choose a reason for hiding this comment

MasseGuillaume Jun 5, 2023

Choose a reason for hiding this comment

andrew-coleman commented Jun 13, 2024

EpsilonPrime Jun 26, 2024

Choose a reason for hiding this comment

nielspardon commented Jun 27, 2024

EpsilonPrime commented Jul 5, 2024

LaiZhou commented Sep 13, 2022 •

edited

Loading

LaiZhou commented Sep 22, 2022 •

edited

Loading

baibaichen commented Oct 5, 2022 •

edited

Loading

CLAassistant commented Oct 6, 2022 •

edited

Loading

LaiZhou commented Oct 10, 2022 •

edited

Loading