[TASK][CHALLENGE] Support Spark Connect Frontend/Backend #5383

ulysses-you · 2023-10-09T02:27:04Z

Code of Conduct

I agree to follow this project's Code of Conduct

Search before creating

I have searched in the task list and found no similar tasks.

Mentor

I have sufficient knowledge and experience of this task, and I volunteer to be the mentor of this task to guide contributors to complete the task.

Skill requirements

Knowledge about Spark Connect
Knowledge about Kyuubi architecture
Knowledge about protobuf
Knowledge about grpc
Knowledge about thrift

Background and Goals

Make Kyuubi server compatible with Spark Connect protocol, so that people can use Spark Connect client to connect to Kyuubi Server.

Implementation steps

Add a new Spark Connect frontend
1.1 Add basic gRpc server as frontend
1.2 Compatible with Spark Connect protocol, see https://github.com/apache/spark/blob/master/connector/connect/common/src/main/protobuf/spark/connect/base.proto
1.3 Support ExecutePlan
1.4 Support AnalyzePlan
1.5 Support Config
1.6 Support AddArtifacts
1.7 Support ArtifactsStatus
1.8 Support Interrupt
1.9 Support ReattachExecute
1.10 Support ReleaseExecute
1.11 Serialize the protobuf based request
Add a new Spark Connect backend
2.1 Imprort Sprak-Connect-Server and rewrite SparkConnectService https://github.com/apache/spark/blob/master/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectServer.scala
2.2 Deserialize response to protobuf based
Add IT
Add docs

Additional context

Introduction of #6232

yehere · 2023-10-10T11:53:16Z

I think this is very challenging, but I want to give it a try. Can you assign it to me and help me

ulysses-you · 2023-10-10T12:03:02Z

sure, thank you @yehere ! This is a kind of umbrella, we can create sub issue one by one later.

pan3793 · 2023-10-10T12:04:12Z

This huge task could be divided into several different level tasks, feel free to go ahead ~ all your contributions will be counted eventually :)

cfmcgrady · 2023-10-11T02:25:18Z

cc @cfmcgrady

zhaomin1423 · 2023-10-12T01:43:04Z

sure, thank you @yehere ! This is a kind of umbrella, we can create sub issue one by one later.

I'm also interested in it, hope to work together.

ulysses-you · 2023-10-12T10:11:06Z

thank you @zhaomin1423 , glad to see you are interested in.

minyk · 2023-10-19T02:35:06Z

how about co-located mode with kyuubi's sparksql engine? separated service is good and basic, but also needs more resources for more spark instances.

ulysses-you · 2023-10-19T02:43:32Z

@minyk there are in different process, just like Spark thirftserver and connect server. We are going to add a new module and new server for Kyuubi connect. We can do it together if you are interested in.

davidyuan1223 · 2024-04-04T11:41:01Z

@ulysses-you hello,i'm interested with this component,hope work with you

davidyuan1223 · 2024-04-16T14:40:38Z

@yaooqinn @pan3793 @ulysses-you
i found spark have package the connect module to maven repository
https://mvnrepository.com/artifact/org.apache.spark/spark-connect-client-jvm_2.13/3.4.0
can we use thoes package to simplify the code?
maybe we could provide a connectingstr to sparkSession like the code description
https://github.com/apache/spark/blob/master/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala
based spark-connect package, we can reduce grpcServer and proto
what do you think?

pan3793 · 2024-04-16T14:52:37Z

I haven't had deep look at it, my current thought is,

for server part, we only need a thin gRPC layer, coping proto files and regenerating gRPC files is fine.
for engine part, we can reuse the connect-server module to simplify the code.

ulysses-you · 2024-04-17T04:18:46Z

@davidyuan1223 sure, please go ahead. +1 for @pan3793 thought.

davidyuan1223 · 2024-04-17T11:17:12Z

@ulysses-you @pan3793
Understand, I'd like to try this challenging issue, which could go on for a long time, as I need to go through the whole architecture of kyuubi-server and figure out the differences between it and spark-connect, and in the process I might have some discussions with you.
Could youe assigned this issue to me?

tgravescs · 2024-04-17T13:35:41Z

Just to clarify here, the intention is to support spark connect client as another connection type to the engine - so you could still use jdbc or notebook (via rest) to the same Spark engine and have all those clients to the same application?

davidyuan1223 · 2024-04-17T15:50:45Z

Just to clarify here, the intention is to support spark connect client as another connection type to the engine - so you could still use jdbc or notebook (via rest) to the same Spark engine and have all those clients to the same application?

Yes, my initial assumption is to create a 3.4-based sparkSession by providing the configuration item remote connection str and then merging it with thrift service to provide the corresponding engine(so this configuration must force a check of the spark version > 3.4, while spark-connect-client has already written sparkSession to reduce our development process), what do you think?

pan3793 · 2024-04-18T02:05:46Z

@tgravescs that's a good question, and we did have an offline discussion about it.

TL;DR, your assumption will be the ultimate version, but not at the beginning.

As you know the current main flow of Kyuubi is:

       ===[http]
client ===[thrift]====> Server ===[thrift]===> Engine
       ===[etc.]               ---[thrift]---> STS/HS2/Impala (we know someone implemented such a feature internally)

The engine itself is kind of a regular Spark app that basically only consumes Spark's public API, making it easily compatible with multiple Spark versions. As connect is a new feature and connect-server is not supposed to be exposed to the user directly(I suppose only gRPC API is public API in this case), pulling connect-server in the current Spark engine module directly would break the current assumption. So in the experimental phase we are going to create a dedicated engine module for the connect engine, we may call it SPARK_CONNECT(the current one is SPARK_SQL)。

Another important case is Server ===[thrift]===> Engine, currently, we use Thrift(more specifically, the HiveServer2 Thrift protocol) as the internal RPC protocol, but for connect, obviously gRPC should be used, and keep two internal RPC protocol is quite complex and redundant, we tend to create a dedicated experimental server that keeps similar architecture but rewrite the RPC implementation.

Once the PoC is completed, we can consider merging servers and engines to achieve the final vision as you said.

       ===[http]
       ===[grpc]
client ===[thrift]====> Server ===[grpc]===> Engine
       ===[etc.]               --[thrift]--> STS/HS2/Impala (we know someone implemented such a feature internally)

Maybe @yaooqinn can share more information

davidyuan1223 · 2024-04-18T17:12:40Z

@pan3793 @yaooqinn @ulysses-you @tgravescs
Hello, I have analyzed the processing flow of spark-connect, as shown in the following figure.

SparkSession.builder.remote(host:port).getOrCreate() to create a SparkConnectClient(RPCClient)
spark.sql(xxx), acutually, this method is build a rpcRequest then use RPClient to process with Spark-Connect-Server
Then Spark-Connect-Server receive the request and process it with local sparkSession, finally, return the rpcResponse
The client sparkSession receive the rpcResponse will resolve it then return

As mentioned above, I believe that in the RPC request process of kyuubi based on SparkConnect, we no longer need the involvement of SparkSession, so I have designed the following process:

We will implement a KyuubiSparkConnectClient(RPCClient, based on SparkConnectClient). It will be created when we use EngineRef.getOrCreate to create a KyuubiSparkConnectEngine
Examples, like beeline, when we use beeline to execute sql, it will create a thrift request to the KyuubiSparkConnectFrontendService
The frontendService will not do any thing, just like other engine, then the frontendService will post request to KyuubiSparkConnectService(client: KyuubiSparkConnectClient)
The backendService also like other engine, it will use corresponding operation to handle the request
The operation will process like the follow
5.1 Process the thrift request and tranform it to rpc request
5.2 Call client method to process the request
5.3 Receive the rpc response from the Spark-Connect-Server
5.4 Tranform the rpc response to thrift response

Based the rpc client, we don't need create sparkSession

What do you think?

tigrulya-exe · 2024-08-15T13:24:55Z

@pan3793 @yaooqinn Hi! Just to clarify - do I understand correctly, that for the first iteration, we need to somehow allow gRPC-based engines to coexist with Thrift ones (all current engines) in order to add the SPARK_CONNECT engine? That also means, that we will need to rewrite a significant part of the internal communication logic, includingKyuubiSession, SessionManager, etc. or decouple it from Thrift.

Or it is expected to start directly from rewriting the current internal RPC mechanism from Thrift (HS2) to gRPC and changing the internal API (kyuubi frontend server <--> engine), so that it will include logical methods from both the old API and the Spark Connect API?

currently, we use Thrift(more specifically, the HiveServer2 Thrift protocol) as the internal RPC protocol, but for connect, obviously gRPC should be used

pan3793 · 2024-08-15T13:51:35Z

for the first iteration, we need to somehow allow gRPC-based engines to coexist with Thrift ones (all current engines) in order to add the SPARK_CONNECT engine? That also means, that we will need to rewrite a significant part of the internal communication logic, includingKyuubiSession, SessionManager, etc. or decouple it from Thrift.

@tigrulya-exe Exactly! I'm doing some experiments in this way, and it does involve lots of refactoring work to support both Thrift and gRPC and reuse code as much as possible. I can not promise an ETA since I'm not sure how much time I can spend on this task in the next few months. But I will open a draft PR once I make the pipeline work (for example, successfully executing select 1 using a spark-connect client), meanwhile, I will separate the refactoring changes and push them to the master branch gradually.

tigrulya-exe · 2024-08-15T14:10:23Z

@pan3793 great! I would like to participate in the development process, if it's possible :) Do you already have a list of kyuubi parts/classes/modules to refactor, so we can break this big task down into smaller parts to be able to work simultaneously?

Btw, I also noticed that there is #6412 PR, related to this issue. @davidyuan1223 Hi! Is it still active?

pan3793 · 2024-08-15T14:16:50Z

@tigrulya-exe I will share with you more details in the next one or two weeks.

davidyuan1223 · 2024-08-16T01:48:37Z

@pan3793 great! I would like to participate in the development process, if it's possible :) Do you already have a list of kyuubi parts/classes/modules to refactor, so we can break this big task down into smaller parts to be able to work simultaneously?

Btw, I also noticed that there is #6412 PR, related to this issue. @davidyuan1223 Hi! Is it still active?

Yeah, it's active, you could see this pr #6412. We first need to verify the feasibility of this solution, but the spark-connect latest version 3.5.1 has some question, so i'm waitting for the new version 3.5.2 release(currently it's released). And i will verify the spark-connect-3.5.2 this week

pan3793 · 2024-08-23T16:43:26Z

A quick and dirty version of Kyuubi Connect is available at #6642

tigrulya-exe · 2024-09-02T13:44:51Z

@pan3793 Hi! I checked your PoC and built it locally. I tried to run some queries using pyspark and they finished successfully, nice work! Now, I suggest creating a list of tasks that are required to complete this solution. These tasks include supporting all gRPC Spark Connect API methods and refactoring the current code to seamlessly integrate the PoC. This will allow us to work simultaneously and add functionality to the master branch more quickly.

Could you please share any changes that break the current thrift-based logic and any things that need to be refactored that you noticed during the implementation of this solution, so we can use this information as a starting point?

ulysses-you added the hacktoberfest label Oct 9, 2023

ulysses-you added this to 2023 Kyuubi Code Contribution Program Oct 9, 2023

yehere mentioned this issue Oct 26, 2023

[Umbrella] [KYUUBI#5383]Take Apart The Task For Support Spark Connect Frontend/Backend #5541

Closed

22 tasks

pan3793 removed this from 2023 Kyuubi Code Contribution Program Apr 3, 2024

cfmcgrady assigned davidyuan1223 Apr 17, 2024

jaykchen mentioned this issue Apr 18, 2024

note budget allocated jaykchen/stt#1

Open

davidyuan1223 mentioned this issue Apr 19, 2024

[Umbrella] Support Spark Connect Frontend/Backend #6324

Open

18 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TASK][CHALLENGE] Support Spark Connect Frontend/Backend #5383

[TASK][CHALLENGE] Support Spark Connect Frontend/Backend #5383

ulysses-you commented Oct 9, 2023 •

edited by pan3793

Loading

yehere commented Oct 10, 2023

ulysses-you commented Oct 10, 2023

pan3793 commented Oct 10, 2023

cfmcgrady commented Oct 11, 2023

zhaomin1423 commented Oct 12, 2023

ulysses-you commented Oct 12, 2023

minyk commented Oct 19, 2023

ulysses-you commented Oct 19, 2023

davidyuan1223 commented Apr 4, 2024

davidyuan1223 commented Apr 16, 2024

pan3793 commented Apr 16, 2024

ulysses-you commented Apr 17, 2024

davidyuan1223 commented Apr 17, 2024

tgravescs commented Apr 17, 2024

davidyuan1223 commented Apr 17, 2024

pan3793 commented Apr 18, 2024 •

edited

Loading

davidyuan1223 commented Apr 18, 2024 •

edited

Loading

tigrulya-exe commented Aug 15, 2024

pan3793 commented Aug 15, 2024

tigrulya-exe commented Aug 15, 2024 •

edited

Loading

pan3793 commented Aug 15, 2024

davidyuan1223 commented Aug 16, 2024

pan3793 commented Aug 23, 2024

tigrulya-exe commented Sep 2, 2024

[TASK][CHALLENGE] Support Spark Connect Frontend/Backend #5383

[TASK][CHALLENGE] Support Spark Connect Frontend/Backend #5383

Comments

ulysses-you commented Oct 9, 2023 • edited by pan3793 Loading

Code of Conduct

Search before creating

Mentor

Skill requirements

Background and Goals

Implementation steps

Additional context

yehere commented Oct 10, 2023

ulysses-you commented Oct 10, 2023

pan3793 commented Oct 10, 2023

cfmcgrady commented Oct 11, 2023

zhaomin1423 commented Oct 12, 2023

ulysses-you commented Oct 12, 2023

minyk commented Oct 19, 2023

ulysses-you commented Oct 19, 2023

davidyuan1223 commented Apr 4, 2024

davidyuan1223 commented Apr 16, 2024

pan3793 commented Apr 16, 2024

ulysses-you commented Apr 17, 2024

davidyuan1223 commented Apr 17, 2024

tgravescs commented Apr 17, 2024

davidyuan1223 commented Apr 17, 2024

pan3793 commented Apr 18, 2024 • edited Loading

davidyuan1223 commented Apr 18, 2024 • edited Loading

tigrulya-exe commented Aug 15, 2024

pan3793 commented Aug 15, 2024

tigrulya-exe commented Aug 15, 2024 • edited Loading

pan3793 commented Aug 15, 2024

davidyuan1223 commented Aug 16, 2024

pan3793 commented Aug 23, 2024

tigrulya-exe commented Sep 2, 2024

ulysses-you commented Oct 9, 2023 •

edited by pan3793

Loading

pan3793 commented Apr 18, 2024 •

edited

Loading

davidyuan1223 commented Apr 18, 2024 •

edited

Loading

tigrulya-exe commented Aug 15, 2024 •

edited

Loading