Skip to content
This repository has been archived by the owner on Sep 18, 2023. It is now read-only.

Commit

Permalink
[NSE-283] Pick S3/CSV supports to OAP 1.1 (#284)
Browse files Browse the repository at this point in the history
* [NSE-237] Add ARROW_CSV=ON to default C++ build commands (#238)

* [NSE-261] ArrowDataSource: Add S3 Support (#270)

Closes #261

* [NSE-276] Add option to switch Hadoop version

* [NSE-119] clean up on comments (#288)

Signed-off-by: Yuan Zhou <[email protected]>

* [NSE-206]Update installation guide and configuration guide. (#289)

* [NSE-206]Update installation guide and configuration guide.

* Fix numaBinding setting issue. & Update description for protobuf

* [NSE-206]Fix Prerequisite and Arrow Installation Steps. (#290)

Co-authored-by: Yuan <[email protected]>
Co-authored-by: Wei-Ting Chen <[email protected]>
  • Loading branch information
3 people authored Apr 27, 2021
1 parent 5b2bf6c commit 230df38
Show file tree
Hide file tree
Showing 17 changed files with 209 additions and 72 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/tpch.yml
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ jobs:
git clone https://github.com/oap-project/arrow.git
cd arrow && git checkout arrow-3.0.0-oap-1.1 && cd cpp
mkdir build && cd build
cmake .. -DARROW_JNI=ON -DARROW_GANDIVA_JAVA=ON -DARROW_GANDIVA=ON -DARROW_PARQUET=ON -DARROW_HDFS=ON -DARROW_FILESYSTEM=ON -DARROW_WITH_SNAPPY=ON -DARROW_JSON=ON -DARROW_DATASET=ON -DARROW_WITH_LZ4=ON -DARROW_JEMALLOC=OFF && make -j2
cmake .. -DARROW_JNI=ON -DARROW_GANDIVA_JAVA=ON -DARROW_GANDIVA=ON -DARROW_PARQUET=ON -DARROW_CSV=ON -DARROW_HDFS=ON -DARROW_FILESYSTEM=ON -DARROW_WITH_SNAPPY=ON -DARROW_JSON=ON -DARROW_DATASET=ON -DARROW_WITH_LZ4=ON -DARROW_JEMALLOC=OFF && make -j2
sudo make install
cd ../../java
mvn clean install -B -Dorg.slf4j.simpleLogger.log.org.apache.maven.cli.transfer.Slf4jMavenTransferListener=warn -P arrow-jni -am -Darrow.cpp.build.dir=/tmp/arrow/cpp/build/release/ -DskipTests -Dcheckstyle.skip
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/unittests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ jobs:
git clone https://github.com/oap-project/arrow.git
cd arrow && git checkout arrow-3.0.0-oap-1.1 && cd cpp
mkdir build && cd build
cmake .. -DARROW_JNI=ON -DARROW_GANDIVA_JAVA=ON -DARROW_GANDIVA=ON -DARROW_PARQUET=ON -DARROW_HDFS=ON -DARROW_FILESYSTEM=ON -DARROW_WITH_SNAPPY=ON -DARROW_JSON=ON -DARROW_DATASET=ON -DARROW_WITH_LZ4=ON -DGTEST_ROOT=/usr/src/gtest && make -j2
cmake .. -DARROW_JNI=ON -DARROW_GANDIVA_JAVA=ON -DARROW_GANDIVA=ON -DARROW_PARQUET=ON -DARROW_CSV=ON -DARROW_HDFS=ON -DARROW_FILESYSTEM=ON -DARROW_WITH_SNAPPY=ON -DARROW_JSON=ON -DARROW_DATASET=ON -DARROW_WITH_LZ4=ON -DGTEST_ROOT=/usr/src/gtest && make -j2
sudo make install
- name: Run unit tests
run: |
Expand Down
27 changes: 21 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,20 @@ We implemented columnar shuffle to improve the shuffle performance. With the col

Please check the operator supporting details [here](./docs/operators.md)

## Build the Plugin
## How to use OAP: Native SQL Engine

There are three ways to use OAP: Native SQL Engine,
1. Use precompiled jars
2. Building by Conda Environment
3. Building by Yourself

### Use precompiled jars

Please go to [OAP's Maven Central Repository](https://repo1.maven.org/maven2/com/intel/oap/) to find Native SQL Engine jars.
For usage, you will require below two jar files:
1. spark-arrow-datasource-standard-<version>-jar-with-dependencies.jar is located in com/intel/oap/spark-arrow-datasource-standard/<version>/
2. spark-columnar-core-<version>-jar-with-dependencies.jar is located in com/intel/oap/spark-columnar-core/<version>/
Please notice the files are fat jars shipped with our custom Arrow library and pre-compiled from our server(using GCC 9.3.0 and LLVM 7.0.1), which means you will require to pre-install GCC 9.3.0 and LLVM 7.0.1 in your system for normal usage.

### Building by Conda

Expand All @@ -51,18 +64,18 @@ Then you can just skip below steps and jump to Getting Started [Get Started](#ge

If you prefer to build from the source code on your hand, please follow below steps to set up your environment.

### Prerequisite
#### Prerequisite

There are some requirements before you build the project.
Please check the document [Prerequisite](./docs/Prerequisite.md) and make sure you have already installed the software in your system.
If you are running a SPARK Cluster, please make sure all the software are installed in every single node.

### Installation
Please check the document [Installation Guide](./docs/Installation.md)
#### Installation

### Configuration & Testing
Please check the document [Configuration Guide](./docs/Configuration.md)
Please check the document [Installation Guide](./docs/Installation.md)

## Get started

To enable OAP NativeSQL Engine, the previous built jar `spark-columnar-core-<version>-jar-with-dependencies.jar` should be added to Spark configuration. We also recommend to use `spark-arrow-datasource-standard-<version>-jar-with-dependencies.jar`. We will demonstrate an example by using both jar files.
SPARK related options are:

Expand All @@ -75,6 +88,8 @@ SPARK related options are:
For Spark Standalone Mode, please set the above value as relative path to the jar file.
For Spark Yarn Cluster Mode, please set the above value as absolute path to the jar file.

More Configuration, please check the document [Configuration Guide](./docs/Configuration.md)

Example to run Spark Shell with ArrowDataSource jar file
```
${SPARK_HOME}/bin/spark-shell \
Expand Down
2 changes: 1 addition & 1 deletion arrow-data-source/.travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ jobs:
- cd arrow && git checkout oap-master && cd cpp
- sed -i "s/\${Python3_EXECUTABLE}/\/opt\/pyenv\/shims\/python3/g" CMakeLists.txt
- mkdir build && cd build
- cmake .. -DARROW_JNI=ON -DARROW_GANDIVA_JAVA=ON -DARROW_GANDIVA=ON -DARROW_PARQUET=ON -DARROW_HDFS=ON -DARROW_FILESYSTEM=ON -DARROW_WITH_SNAPPY=ON -DARROW_JSON=ON -DARROW_DATASET=ON -DARROW_WITH_LZ4=ON && make
- cmake .. -DARROW_JNI=ON -DARROW_GANDIVA_JAVA=ON -DARROW_GANDIVA=ON -DARROW_PARQUET=ON -DARROW_CSV=ON -DARROW_HDFS=ON -DARROW_FILESYSTEM=ON -DARROW_WITH_SNAPPY=ON -DARROW_JSON=ON -DARROW_DATASET=ON -DARROW_WITH_LZ4=ON && make
- sudo make install
- cd ../../java
- mvn clean install -q -P arrow-jni -am -Darrow.cpp.build.dir=/tmp/arrow/cpp/build/release/ -DskipTests -Dcheckstyle.skip
Expand Down
2 changes: 1 addition & 1 deletion arrow-data-source/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,7 @@ git clone -b <version> https://github.com/Intel-bigdata/arrow.git
cd arrow/cpp
mkdir build
cd build
cmake -DARROW_DEPENDENCY_SOURCE=BUNDLED -DARROW_GANDIVA_JAVA=ON -DARROW_GANDIVA=ON -DARROW_PARQUET=ON -DARROW_HDFS=ON -DARROW_BOOST_USE_SHARED=ON -DARROW_JNI=ON -DARROW_DATASET=ON -DARROW_WITH_PROTOBUF=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_LZ4=ON -DARROW_FILESYSTEM=ON -DARROW_JSON=ON ..
cmake -DARROW_DEPENDENCY_SOURCE=BUNDLED -DARROW_GANDIVA_JAVA=ON -DARROW_GANDIVA=ON -DARROW_PARQUET=ON -DARROW_CSV=ON -DARROW_HDFS=ON -DARROW_BOOST_USE_SHARED=ON -DARROW_JNI=ON -DARROW_DATASET=ON -DARROW_WITH_PROTOBUF=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_LZ4=ON -DARROW_FILESYSTEM=ON -DARROW_JSON=ON ..
make
// build and install arrow jvm library
Expand Down
2 changes: 1 addition & 1 deletion arrow-data-source/docs/ApacheArrowInstallation.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ git clone https://github.com/Intel-bigdata/arrow.git
cd arrow && git checkout branch-0.17.0-oap-1.0
mkdir -p arrow/cpp/release-build
cd arrow/cpp/release-build
cmake -DARROW_DEPENDENCY_SOURCE=BUNDLED -DARROW_GANDIVA_JAVA=ON -DARROW_GANDIVA=ON -DARROW_PARQUET=ON -DARROW_HDFS=ON -DARROW_BOOST_USE_SHARED=ON -DARROW_JNI=ON -DARROW_DATASET=ON -DARROW_WITH_PROTOBUF=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_LZ4=ON -DARROW_FILESYSTEM=ON -DARROW_JSON=ON ..
cmake -DARROW_DEPENDENCY_SOURCE=BUNDLED -DARROW_GANDIVA_JAVA=ON -DARROW_GANDIVA=ON -DARROW_PARQUET=ON -DARROW_CSV=ON -DARROW_HDFS=ON -DARROW_BOOST_USE_SHARED=ON -DARROW_JNI=ON -DARROW_DATASET=ON -DARROW_WITH_PROTOBUF=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_LZ4=ON -DARROW_FILESYSTEM=ON -DARROW_JSON=ON ..
make -j
make install

Expand Down
68 changes: 59 additions & 9 deletions arrow-data-source/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
<groupId>com.intel.oap</groupId>
<artifactId>native-sql-engine-parent</artifactId>
<version>1.1.0</version>
</parent>
</parent>

<modelVersion>4.0.0</modelVersion>
<groupId>com.intel.oap</groupId>
Expand All @@ -18,12 +18,6 @@
<module>parquet</module>
</modules>
<properties>
<scala.version>2.12.10</scala.version>
<scala.binary.version>2.12</scala.binary.version>
<spark.version>3.0.0</spark.version>
<arrow.version>3.0.0</arrow.version>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
<script.dir>${arrow.script.dir}</script.dir>
<datasource.cpp_tests>${cpp_tests}</datasource.cpp_tests>
<datasource.build_arrow>${build_arrow}</datasource.build_arrow>
Expand All @@ -48,6 +42,50 @@
</pluginRepositories>

<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-aws</artifactId>
<version>${hadoop.version}</version>
<exclusions>
<exclusion>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-core</artifactId>
</exclusion>
<exclusion>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-annotations</artifactId>
</exclusion>
<exclusion>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
</exclusion>
<exclusion>
<groupId>javax.servlet</groupId>
<artifactId>servlet-api</artifactId>
</exclusion>
<exclusion>
<groupId>com.sun.jersey</groupId>
<artifactId>jersey-core</artifactId>
</exclusion>
<exclusion>
<groupId>com.sun.jersey</groupId>
<artifactId>jersey-json</artifactId>
</exclusion>
<exclusion>
<groupId>com.sun.jersey</groupId>
<artifactId>jersey-server</artifactId>
</exclusion>
<exclusion>
<groupId>commons-httpclient</groupId>
<artifactId>commons-httpcore</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpcore</artifactId>
<version>4.2</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
Expand All @@ -61,7 +99,7 @@
<exclusions>
<exclusion>
<groupId>org.apache.arrow</groupId>
<artifactId>arrow-format</artifactId>
<artifactId>arrow-vector</artifactId>
</exclusion>
</exclusions>
<scope>provided</scope>
Expand All @@ -83,13 +121,25 @@
<groupId>org.apache.spark</groupId>
<artifactId>spark-catalyst_2.12</artifactId>
<version>${spark.version}</version>
<exclusions>
<exclusion>
<groupId>org.apache.arrow</groupId>
<artifactId>arrow-vector</artifactId>
</exclusion>
</exclusions>
<type>test-jar</type>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>${spark.version}</version>
<exclusions>
<exclusion>
<groupId>org.apache.arrow</groupId>
<artifactId>arrow-vector</artifactId>
</exclusion>
</exclusions>
<type>test-jar</type>
<scope>test</scope>
</dependency>
Expand Down Expand Up @@ -118,7 +168,7 @@
<configuration>
<executable>bash</executable>
<arguments>
<argument>${script.dir}/build_arrow.sh</argument>
<argument>${script.dir}/build_arrow.sh</argument>
<argument>--tests=${datasource.cpp_tests}</argument>
<argument>--build_arrow=${datasource.build_arrow}</argument>
<argument>--static_arrow=${datasource.static_arrow}</argument>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -156,6 +156,11 @@ object ArrowUtils {

private def rewriteUri(uriStr: String): String = {
val uri = URI.create(uriStr)
if (uri.getScheme == "s3" || uri.getScheme == "s3a") {
val s3Rewritten = new URI("s3", uri.getAuthority,
uri.getPath, uri.getQuery, uri.getFragment).toString
return s3Rewritten
}
val sch = uri.getScheme match {
case "hdfs" => "hdfs"
case "file" => "file"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -106,10 +106,18 @@ class ArrowDataSourceTest extends QueryTest with SharedSparkSession {
verifyParquet(
spark.read
.option(ArrowOptions.KEY_ORIGINAL_FORMAT, "parquet")
.option(ArrowOptions.KEY_FILESYSTEM, "hdfs")
.arrow(path))
}

test("simple sql query on s3") {
val path = "s3a://mlp-spark-dataset-bucket/test_arrowds_s3_small"
val frame = spark.read
.option(ArrowOptions.KEY_ORIGINAL_FORMAT, "parquet")
.arrow(path)
frame.createOrReplaceTempView("stab")
assert(spark.sql("select id from stab").count() === 1000)
}

test("create catalog table") {
val path = ArrowDataSourceTest.locateResourcePath(parquetFile1)
spark.catalog.createTable("ptab", path, "arrow")
Expand All @@ -130,7 +138,6 @@ class ArrowDataSourceTest extends QueryTest with SharedSparkSession {
val path = ArrowDataSourceTest.locateResourcePath(parquetFile1)
val frame = spark.read
.option(ArrowOptions.KEY_ORIGINAL_FORMAT, "parquet")
.option(ArrowOptions.KEY_FILESYSTEM, "hdfs")
.arrow(path)
frame.createOrReplaceTempView("ptab")
verifyParquet(spark.sql("select * from ptab"))
Expand All @@ -142,7 +149,6 @@ class ArrowDataSourceTest extends QueryTest with SharedSparkSession {
val path = ArrowDataSourceTest.locateResourcePath(parquetFile3)
val frame = spark.read
.option(ArrowOptions.KEY_ORIGINAL_FORMAT, "parquet")
.option(ArrowOptions.KEY_FILESYSTEM, "hdfs")
.arrow(path)
frame.createOrReplaceTempView("ptab")
val sqlFrame = spark.sql("select * from ptab")
Expand All @@ -163,7 +169,6 @@ class ArrowDataSourceTest extends QueryTest with SharedSparkSession {
val path = ArrowDataSourceTest.locateResourcePath(parquetFile1)
val frame = spark.read
.option(ArrowOptions.KEY_ORIGINAL_FORMAT, "parquet")
.option(ArrowOptions.KEY_FILESYSTEM, "hdfs")
.arrow(path)
frame.createOrReplaceTempView("ptab")
spark.sql("select col from ptab where col = 1").explain(true)
Expand All @@ -178,7 +183,6 @@ class ArrowDataSourceTest extends QueryTest with SharedSparkSession {
val path = ArrowDataSourceTest.locateResourcePath(parquetFile2)
val frame = spark.read
.option(ArrowOptions.KEY_ORIGINAL_FORMAT, "parquet")
.option(ArrowOptions.KEY_FILESYSTEM, "hdfs")
.arrow(path)
frame.createOrReplaceTempView("ptab")
val rows = spark.sql("select * from ptab where col = 'b'").collect()
Expand Down Expand Up @@ -215,7 +219,6 @@ class ArrowDataSourceTest extends QueryTest with SharedSparkSession {
val path = ArrowDataSourceTest.locateResourcePath(parquetFile1)
val frame = spark.read
.option(ArrowOptions.KEY_ORIGINAL_FORMAT, "parquet")
.option(ArrowOptions.KEY_FILESYSTEM, "hdfs")
.arrow(path)
frame.createOrReplaceTempView("ptab")

Expand Down
26 changes: 5 additions & 21 deletions docs/ApacheArrowInstallation.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,25 +24,16 @@ make install
```

# cmake:
Arrow will download package during compiling, in order to support SSL in cmake, build cmake is optional.
``` shell
wget https://github.com/Kitware/CMake/releases/download/v3.15.0-rc4/cmake-3.15.0-rc4.tar.gz
tar xf cmake-3.15.0-rc4.tar.gz
cd cmake-3.15.0-rc4/
./bootstrap --system-curl --parallel=64 #parallel num depends on your server core number
make -j
make install
cmake --version
cmake version 3.15.0-rc4
```
Please make sure your cmake version is qualified based on the prerequisite.


# Apache Arrow
``` shell
git clone https://github.com/Intel-bigdata/arrow.git
cd arrow && git checkout branch-0.17.0-oap-1.0
cd arrow && git checkout <version>
mkdir -p arrow/cpp/release-build
cd arrow/cpp/release-build
cmake -DARROW_DEPENDENCY_SOURCE=BUNDLED -DARROW_GANDIVA_JAVA=ON -DARROW_GANDIVA=ON -DARROW_PARQUET=ON -DARROW_HDFS=ON -DARROW_BOOST_USE_SHARED=ON -DARROW_JNI=ON -DARROW_DATASET=ON -DARROW_WITH_PROTOBUF=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_LZ4=ON -DARROW_FILESYSTEM=ON -DARROW_JSON=ON ..
cmake -DARROW_DEPENDENCY_SOURCE=BUNDLED -DARROW_GANDIVA_JAVA=ON -DARROW_GANDIVA=ON -DARROW_PARQUET=ON -DARROW_CSV=ON -DARROW_HDFS=ON -DARROW_BOOST_USE_SHARED=ON -DARROW_JNI=ON -DARROW_DATASET=ON -DARROW_WITH_PROTOBUF=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_LZ4=ON -DARROW_FILESYSTEM=ON -DARROW_JSON=ON ..
make -j
make install

Expand All @@ -60,11 +51,4 @@ mvn test -pl adapter/parquet -P arrow-jni
mvn test -pl gandiva -P arrow-jni
```

# Copy binary files to oap-native-sql resources directory
Because oap-native-sql plugin will build a stand-alone jar file with arrow dependency, if you choose to build Arrow by yourself, you have to copy below files as a replacement from the original one.
You can find those files in Apache Arrow installation directory or release directory. Below example assume Apache Arrow has been installed on /usr/local/lib64
``` shell
cp /usr/local/lib64/libarrow.so.17 $native-sql-engine-dir/cpp/src/resources
cp /usr/local/lib64/libgandiva.so.17 $native-sql-engine-dir/cpp/src/resources
cp /usr/local/lib64/libparquet.so.17 $native-sql-engine-dir/cpp/src/resources
```
After arrow installed in the specific directory, please make sure to set up -Dbuild_arrow=OFF -Darrow_root=/path/to/arrow when building Native SQL Engine.
Loading

0 comments on commit 230df38

Please sign in to comment.