From 9378142faab3bde2f07e893f8007087c532280f1 Mon Sep 17 00:00:00 2001 From: PHILO-HE Date: Sat, 2 Apr 2022 09:06:59 +0800 Subject: [PATCH] [NSE-761] Update document to reflect spark 3.2.x support (#817) * Update document to reflect spark 3.2.x support * Correct a statement --- docs/Installation.md | 11 ++++++++--- docs/Prerequisite.md | 2 +- docs/SparkInstallation.md | 10 ++++++---- docs/User-Guide.md | 12 ++++++++++-- docs/limitations.md | 2 +- docs/memory.md | 2 +- docs/performance.md | 2 +- 7 files changed, 28 insertions(+), 13 deletions(-) diff --git a/docs/Installation.md b/docs/Installation.md index e34bcdf87..0315e29b0 100644 --- a/docs/Installation.md +++ b/docs/Installation.md @@ -12,9 +12,14 @@ yum install gmock ## Build Gazelle Plugin ``` shell -git clone -b ${version} https://github.com/oap-project/native-sql-engine.git -cd oap-native-sql -mvn clean package -DskipTests -Dcpp_tests=OFF -Dbuild_arrow=ON -Dcheckstyle.skip +git clone -b ${version} https://github.com/oap-project/gazelle_plugin.git +cd gazelle_plugin +mvn clean package -Pspark-3.1 -DskipTests -Dcpp_tests=OFF -Dbuild_arrow=ON -Dcheckstyle.skip +``` +Please note two Spark profiles (`spark-3.1`, `spark-3.2`) are provided to build packages with different versions of Spark dependencies. +Currently, a few unit tests are not compatible with spark 3.2. So if profile `spark-3.2` is used, `-Dmaven.test.skip` should be added to skip compiling unit tests. +``` +mvn clean package -Pspark-3.2 -Dmaven.test.skip -Dcpp_tests=OFF -Dbuild_arrow=ON -Dcheckstyle.skip ``` Based on the different environment, there are some parameters can be set via -D with mvn. diff --git a/docs/Prerequisite.md b/docs/Prerequisite.md index 3c29c492f..fa4df5579 100644 --- a/docs/Prerequisite.md +++ b/docs/Prerequisite.md @@ -9,7 +9,7 @@ Please make sure you have already installed the software in your system. 4. cmake 3.16 or higher version 5. Maven 3.6.3 or higher version 6. Hadoop 2.7.5 or higher version -7. Spark 3.1.1 or higher version +7. Spark 3.1.x or Spark 3.2.x 8. Intel Optimized Arrow 4.0.0 ## gcc installation diff --git a/docs/SparkInstallation.md b/docs/SparkInstallation.md index 5cf87d169..9018e8ff2 100644 --- a/docs/SparkInstallation.md +++ b/docs/SparkInstallation.md @@ -1,6 +1,6 @@ -### Download Spark 3.1.1 +### Download Spark binary -Currently Gazelle Plugin works on the Spark 3.1.1 version. +Currently Gazelle Plugin can work on Spark 3.1.x & 3.2.x. Take Spark 3.1.1 as example. ``` wget http://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz @@ -9,7 +9,9 @@ sudo cd /opt/spark && sudo tar -xf spark-3.1.1-bin-hadoop3.2.tgz export SPARK_HOME=/opt/spark/spark-3.1.1-bin-hadoop3.2/ ``` -### [Or building Spark from source](https://spark.apache.org/docs/latest/building-spark.html) +### Build Spark from source + +Ref. [link](https://spark.apache.org/docs/latest/building-spark.html). ``` shell git clone https://github.com/intel-bigdata/spark.git @@ -27,7 +29,7 @@ Specify SPARK_HOME to spark path export SPARK_HOME=${HADOOP_PATH} ``` -### Hadoop building from source +### Build Hadoop from source ``` shell git clone https://github.com/apache/hadoop.git diff --git a/docs/User-Guide.md b/docs/User-Guide.md index f99afd05e..b9564d8b7 100644 --- a/docs/User-Guide.md +++ b/docs/User-Guide.md @@ -52,8 +52,16 @@ There are three ways to use OAP: Gazelle Plugin, Please go to [OAP's Maven Central Repository](https://repo1.maven.org/maven2/com/intel/oap/) to find Gazelle Plugin jars. For usage, you will require below two jar files: -1. spark-arrow-datasource-standard--jar-with-dependencies.jar is located in com/intel/oap/spark-arrow-datasource-standard// -2. spark-columnar-core--jar-with-dependencies.jar is located in com/intel/oap/spark-columnar-core// +1. `spark-arrow-datasource-standard--jar-with-dependencies.jar` is located in com/intel/oap/spark-arrow-datasource-standard// +2. `spark-columnar-core--jar-with-dependencies.jar` is located in com/intel/oap/spark-columnar-core// + +Since 1.3.1 release, there are two extra jars to work with different Spark minor releases. For spark 3.1.x, the jar whose `` is `spark311` should be used. +And for spark 3.2.x, the jar whose `` is `spark321` should be used. + +3. `spark-sql-columnar-shims-common--SNAPSHOT.jar` + +4. `spark-sql-columnar-shims---SNAPSHOT.jar` + Please notice the files are fat jars shipped with our custom Arrow library and pre-compiled from our server(using GCC 9.3.0 and LLVM 7.0.1), which means you will require to pre-install GCC 9.3.0 and LLVM 7.0.1 in your system for normal usage. ### Building by Conda diff --git a/docs/limitations.md b/docs/limitations.md index e6efbd4d9..2c4326cfa 100644 --- a/docs/limitations.md +++ b/docs/limitations.md @@ -1,7 +1,7 @@ # Limitations for Gazelle Plugin ## Spark compability -Gazelle Plugin currenlty works with Spark 3.0.0 only. There are still some trouble with latest Shuffle/AQE API from Spark 3.0.1, 3.0.2 or 3.1.x. +Currently, Gazelle Plugin is workable with Spark 3.1.x & 3.2.x. ## Operator limitations All performance critical operators in TPC-H/TPC-DS should be supported. For those unsupported operators, Gazelle Plugin will automatically fallback to row operators in vanilla Spark. diff --git a/docs/memory.md b/docs/memory.md index 66005dc2a..1ecc13e16 100644 --- a/docs/memory.md +++ b/docs/memory.md @@ -1,4 +1,4 @@ -# Memory allocation in Gazelle Plugin +# Memory Allocation in Gazelle Plugin ## Java memory allocation By default, Arrow columnar vector Java API is using netty [pooledbytebuffer allocator](https://github.com/apache/arrow/blob/master/java/memory/memory-netty/src/main/java/io/netty/buffer/PooledByteBufAllocatorL.java), which will try to hold on the "free memory" by not returning back to System immediately for better performance. This will result big memory footprint on operators relying on this API, e.g., [CoalesceBatches](https://github.com/oap-project/gazelle_plugin/blob/master/native-sql-engine/core/src/main/scala/com/intel/oap/execution/CoalesceBatchesExec.scala). We changed to use unsafe API since 1.2 release, which means the freed memory will be returned to system directly. Performance tests showed the performance of this change is negatable. diff --git a/docs/performance.md b/docs/performance.md index 899b4f1db..84717f06f 100644 --- a/docs/performance.md +++ b/docs/performance.md @@ -1,4 +1,4 @@ -# Performance tuning for Gazelle Plugin +# Performance Tuning for Gazelle Plugin It is complicated to tune for Spark workloads as each varies a lot. Here are several general tuning options on the most popular TPCH/TPC-DS benchmarking.