Skip to content
This repository has been archived by the owner on Sep 18, 2023. It is now read-only.

Commit

Permalink
[NSE-761] Update document to reflect spark 3.2.x support (#817)
Browse files Browse the repository at this point in the history
* Update document to reflect spark 3.2.x support

* Correct a statement
  • Loading branch information
PHILO-HE authored Apr 2, 2022
1 parent f40940c commit 34bb293
Show file tree
Hide file tree
Showing 7 changed files with 18 additions and 11 deletions.
8 changes: 6 additions & 2 deletions docs/Installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,13 @@ yum install gmock
``` shell
git clone -b ${version} https://github.com/oap-project/gazelle_plugin.git
cd gazelle_plugin
mvn clean package -PSpark-3.2 -DskipTests -Dcpp_tests=OFF -Dbuild_arrow=ON -Dcheckstyle.skip
mvn clean package -Pspark-3.1 -DskipTests -Dcpp_tests=OFF -Dbuild_arrow=ON -Dcheckstyle.skip
```
Please note two Spark profiles (`spark-3.1`, `spark-3.2`) are provided to build packages with different versions of Spark dependencies.
Currently, a few unit tests are not compatible with spark 3.2. So if profile `spark-3.2` is used, `-Dmaven.test.skip` should be added to skip compiling unit tests.
```
mvn clean package -Pspark-3.2 -Dmaven.test.skip -Dcpp_tests=OFF -Dbuild_arrow=ON -Dcheckstyle.skip
```
Two Spark profiles(spark-3.1, spark-3.2) were provided to build packages for different Spark.

Based on the different environment, there are some parameters can be set via -D with mvn.

Expand Down
2 changes: 1 addition & 1 deletion docs/Prerequisite.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Please make sure you have already installed the software in your system.
4. cmake 3.16 or higher version
5. Maven 3.6.3 or higher version
6. Hadoop 2.7.5 or higher version
7. Spark 3.1.1 or higher version
7. Spark 3.1.x or Spark 3.2.x
8. Intel Optimized Arrow 4.0.0

## gcc installation
Expand Down
10 changes: 6 additions & 4 deletions docs/SparkInstallation.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
### Download Spark 3.1.1
### Download Spark binary

Currently Gazelle Plugin works on the Spark 3.1.1 version.
Currently Gazelle Plugin can work on Spark 3.1.x & 3.2.x. Take Spark 3.1.1 as example.

```
wget http://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
Expand All @@ -9,7 +9,9 @@ sudo cd /opt/spark && sudo tar -xf spark-3.1.1-bin-hadoop3.2.tgz
export SPARK_HOME=/opt/spark/spark-3.1.1-bin-hadoop3.2/
```

### [Or building Spark from source](https://spark.apache.org/docs/latest/building-spark.html)
### Build Spark from source

Ref. [link](https://spark.apache.org/docs/latest/building-spark.html).

``` shell
git clone https://github.com/intel-bigdata/spark.git
Expand All @@ -27,7 +29,7 @@ Specify SPARK_HOME to spark path
export SPARK_HOME=${HADOOP_PATH}
```

### Hadoop building from source
### Build Hadoop from source

``` shell
git clone https://github.com/apache/hadoop.git
Expand Down
3 changes: 2 additions & 1 deletion docs/User-Guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,8 @@ For usage, you will require below two jar files:
1. `spark-arrow-datasource-standard-<version>-jar-with-dependencies.jar` is located in com/intel/oap/spark-arrow-datasource-standard/<version>/
2. `spark-columnar-core-<version>-jar-with-dependencies.jar` is located in com/intel/oap/spark-columnar-core/<version>/

Since 1.3.1 release, there are two extra jars to work with different Spark minor releases
Since 1.3.1 release, there are two extra jars to work with different Spark minor releases. For spark 3.1.x, the jar whose `<spark-version>` is `spark311` should be used.
And for spark 3.2.x, the jar whose `<spark-version>` is `spark321` should be used.

3. `spark-sql-columnar-shims-common-<version>-SNAPSHOT.jar`

Expand Down
2 changes: 1 addition & 1 deletion docs/limitations.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Limitations for Gazelle Plugin

## Spark compability
Gazelle Plugin currenlty works with Spark 3.0.0 only. There are still some trouble with latest Shuffle/AQE API from Spark 3.0.1, 3.0.2 or 3.1.x.
Currently, Gazelle Plugin is workable with Spark 3.1.x & 3.2.x.

## Operator limitations
All performance critical operators in TPC-H/TPC-DS should be supported. For those unsupported operators, Gazelle Plugin will automatically fallback to row operators in vanilla Spark.
Expand Down
2 changes: 1 addition & 1 deletion docs/memory.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Memory allocation in Gazelle Plugin
# Memory Allocation in Gazelle Plugin

## Java memory allocation
By default, Arrow columnar vector Java API is using netty [pooledbytebuffer allocator](https://github.com/apache/arrow/blob/master/java/memory/memory-netty/src/main/java/io/netty/buffer/PooledByteBufAllocatorL.java), which will try to hold on the "free memory" by not returning back to System immediately for better performance. This will result big memory footprint on operators relying on this API, e.g., [CoalesceBatches](https://github.com/oap-project/gazelle_plugin/blob/master/native-sql-engine/core/src/main/scala/com/intel/oap/execution/CoalesceBatchesExec.scala). We changed to use unsafe API since 1.2 release, which means the freed memory will be returned to system directly. Performance tests showed the performance of this change is negatable.
Expand Down
2 changes: 1 addition & 1 deletion docs/performance.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Performance tuning for Gazelle Plugin
# Performance Tuning for Gazelle Plugin

It is complicated to tune for Spark workloads as each varies a lot. Here are several general tuning options on the most popular TPCH/TPC-DS benchmarking.

Expand Down

0 comments on commit 34bb293

Please sign in to comment.