-
Notifications
You must be signed in to change notification settings - Fork 235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Support Multiple Spark versions in the same jar #355
Comments
Some specific things that have changed in Spark 3.1 that we need to handle:
|
I have a prototype working and this is what the api and high level design look like: Directory structure for modules, we keep the existing modules and just add version specific ones: dist The new modules need a dependency on the sql-plugin module and each module builds against the appropriate version of Spark. spark30 - pom file uses spark.version=3.0.0 and spark31 uses spark.version=3.1.0-SNAPSHOT. All those modules create their own jar and then in our dist/ module we package them all together into a rapids-4-spark_2.12-{version}.jar. This does require that each module use its own package name or separate class names. for instance com.nvidia.spark.rapids.shims.spark30 vs com.nvidia.spark.rapids.shims.spark31, or we could use make sure class names are different: GpuShuffledHashJoinExec30 vs GpuShuffledHashJoinExec31. For Spark distribution specific - Databricks, GCP, AWS, we will have to build directly against their jars. So those submodules would be built against those jars and then pulled as dependencies. The combined jar can then be run through integration tests on each of the versions and distributions. The high level design of the Shimloader:
The base SparkShims trait which each version would implement would be something like this:
Example implementation for Spark 3.0 would look like:
Examples of how this would be called from the sql-plugin code:
Here you notice the ++ ShimLoader.getSparkShims.getExecs call where it would add in any specific execs for the version of Spark you are running against |
Would it make sense to organize the various shims into their own submodule, e.g.:
|
The problem with putting them into another submodule is you run into cyclic dependency issues. I started out having them in their own and it caused a lot of issues. unless you had specific ideas.
Now, I had things mostly working by pulling GpuOverrides and a few other classes out and move them into the spark-shim layer itself but there were still a few issues to resolve and changeing this approach made that problem go away entirely. |
There already is a circular dependency in this design, and it only works because this design uses reflection to avoid the explicit dependency. I think we could solve the issue a bit cleaner like this:
For example, sql-plugin would provide a probe or detector interface, something like this: trait ShimProbe {
def buildSparkShim(sparkVersion: String): Option[SparkShim]
} sql-plugin would then use |
Signed-off-by: spark-rapids automation <[email protected]>
It would be nice if we can have a single spark rapids jar that supports multiple versions of Spark (spark 3.0, spark 3.1, etc), including different distributions based on the same Spark version (GCP, Databricks, AWS, etc)
Types of incompatibilities it would need ti handle:
Proposal to handle:
Create a ShimLoader class that looks at the spark version and then loads the proper rapids subclass. This can have multiple different classes. For instance a SparkShim layer that is wrapper around simple function calls, we could have a loader for different classes that share a base class. For instance GpuBroadcastNestedLoopJoinExec could have a GpuBroadcastNestedLoopJoinExecBase class and then specific versions GpuBroadcastNestedLoopJoinExec30 and GpuBroadcastNestedLoopJoinExec31.
Each Spark version then gets its own submodule where all the specific classes for Spark3.0, Spark 3.1, Spark 3.0 databricks etc. all live. The Loader will look at the version and then load the corresponding class.
Some spark distributions make things a bit harder in that we can’t compile it at the same time as Spark 3.0 and spark 3.1. So for those jars we will have to build them separately and ideally pull them as a dependency to package up and distribute a single jar. The Apache spark 3.0 and spark 3.1 jars can be built in a single build.
To handle expressions and exec being added and removed we can have the GpuOverrides call down into the ShimLoader to get any version specific ones. So for instance TimeSub in Spark 3.1 got changed to TimeAdd. The GpuOverrides definition of exprs would leave it out but then call into the shim layer to get specific versions (++ ShimLoader.getSparkShims.getExecs).
The text was updated successfully, but these errors were encountered: