Skip to content
This repository has been archived by the owner on Sep 18, 2023. It is now read-only.

Commit

Permalink
[NSE-206]Update documents and License for 1.1.0 (#292)
Browse files Browse the repository at this point in the history
* [NSE-206]Update documents and remove duplicate parts

* Modify documents by comments
  • Loading branch information
Hong authored and zhixingheyi-tian committed Apr 30, 2021
1 parent 1126320 commit e2eb35d
Show file tree
Hide file tree
Showing 37 changed files with 13,008 additions and 1,140 deletions.
285 changes: 284 additions & 1 deletion arrow-data-source/CHANGELOG.md → CHANGELOG.md

Large diffs are not rendered by default.

1,957 changes: 1,957 additions & 0 deletions LICENSE

Large diffs are not rendered by default.

8 changes: 6 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
##### \* LEGAL NOTICE: Your use of this software and any required dependent software (the "Software Package") is subject to the terms and conditions of the software license agreements for the Software Package, which may also include notices, disclaimers, or license terms for third party or open source software included in or with the Software Package, and your use indicates your acceptance of all such terms. Please refer to the "TPP.txt" or other similarly-named text file included with the Software Package for additional details.

##### \* Optimized Analytics Package for Spark* Platform is under Apache 2.0 (https://www.apache.org/licenses/LICENSE-2.0).

# Spark Native SQL Engine

A Native Engine for Spark SQL with vectorized SIMD optimizations
Expand All @@ -10,7 +14,7 @@ You can find the all the Native SQL Engine documents on the [project web page](h

![Overview](./docs/image/nativesql_arch.png)

Spark SQL works very well with structured row-based data. It used WholeStageCodeGen to improve the performance by Java JIT code. However Java JIT is usually not working very well on utilizing latest SIMD instructions, especially under complicated queries. [Apache Arrow](https://arrow.apache.org/) provided CPU-cache friendly columnar in-memory layout, its SIMD optimized kernels and LLVM based SQL engine Gandiva are also very efficient. Native SQL Engine used these technoligies and brought better performance to Spark SQL.
Spark SQL works very well with structured row-based data. It used WholeStageCodeGen to improve the performance by Java JIT code. However Java JIT is usually not working very well on utilizing latest SIMD instructions, especially under complicated queries. [Apache Arrow](https://arrow.apache.org/) provided CPU-cache friendly columnar in-memory layout, its SIMD optimized kernels and LLVM based SQL engine Gandiva are also very efficient. Native SQL Engine used these technologies and brought better performance to Spark SQL.

## Key Features

Expand Down Expand Up @@ -58,7 +62,7 @@ Please notice the files are fat jars shipped with our custom Arrow library and p
### Building by Conda

If you already have a working Hadoop Spark Cluster, we provide a Conda package which will automatically install dependencies needed by OAP, you can refer to [OAP-Installation-Guide](./docs/OAP-Installation-Guide.md) for more information. Once finished [OAP-Installation-Guide](./docs/OAP-Installation-Guide.md), you can find built `spark-columnar-core-<version>-jar-with-dependencies.jar` under `$HOME/miniconda2/envs/oapenv/oap_jars`.
Then you can just skip below steps and jump to Getting Started [Get Started](#get-started).
Then you can just skip below steps and jump to [Get Started](#get-started).

### Building by yourself

Expand Down
10,639 changes: 10,639 additions & 0 deletions TPP.txt

Large diffs are not rendered by default.

201 changes: 0 additions & 201 deletions arrow-data-source/LICENSE.txt

This file was deleted.

18 changes: 7 additions & 11 deletions arrow-data-source/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,36 +6,32 @@ A Spark DataSource implementation for reading files into Arrow compatible column

The development of this library is still in progress. As a result some of the functionality may not be constantly stable for being used in production environments that have not been fully considered due to the limited testing capabilities so far.

## Online Documentation

You can find the all the Native SQL Engine documents on the [project web page](https://oap-project.github.io/arrow-data-source/).

## Build

### Prerequisite

There are some requirements before you build the project.
Please make sure you have already installed the software in your system.

1. gcc 9.3 or higher version
1. GCC 7.0 or higher version
2. java8 OpenJDK -> yum install java-1.8.0-openjdk
3. cmake 3.2 or higher version
4. maven 3.1.1 or higher version
3. cmake 3.16 or higher version
4. maven 3.6 or higher version
5. Hadoop 2.7.5 or higher version
6. Spark 3.0.0 or higher version
7. Intel Optimized Arrow 3.0.0

### Building by Conda

If you already have a working Hadoop Spark Cluster, we provide a Conda package which will automatically install dependencies needed by OAP, you can refer to [OAP-Installation-Guide](./docs/OAP-Installation-Guide.md) for more information. Once finished [OAP-Installation-Guide](./docs/OAP-Installation-Guide.md), you can find built `spark-arrow-datasource-standard-<version>-jar-with-dependencies.jar` under `$HOME/miniconda2/envs/oapenv/oap_jars`.
If you already have a working Hadoop Spark Cluster, we provide a Conda package which will automatically install dependencies needed by OAP, you can refer to [OAP-Installation-Guide](../docs/OAP-Installation-Guide.md) for more information. Once finished [OAP-Installation-Guide](../docs/OAP-Installation-Guide.md), you can find built `spark-arrow-datasource-standard-<version>-jar-with-dependencies.jar` under `$HOME/miniconda2/envs/oapenv/oap_jars`.
Then you can just skip steps below and jump to [Get Started](#get-started).

### cmake installation

If you are facing some trouble when installing cmake, please follow below steps to install cmake.

```
// installing cmake 3.2
// installing cmake 3.16.1
sudo yum install cmake3
// If you have an existing cmake, you can use below command to set it as an option within alternatives command
Expand Down Expand Up @@ -121,7 +117,7 @@ You have to use a customized Arrow to support for our datasets Java API.

```
// build arrow-cpp
git clone -b <version> https://github.com/Intel-bigdata/arrow.git
git clone -b arrow-3.0.0-oap-1.1 https://github.com/oap-project/arrow.git
cd arrow/cpp
mkdir build
cd build
Expand Down Expand Up @@ -213,7 +209,7 @@ spark.sql("SELECT * FROM my_temp_view LIMIT 10").show(10)

To validate if ArrowDataSource works, you can go to the DAG to check if ArrowScan has been used from the above example query.

![Image of ArrowDataSource Validation](./docs/image/arrowdatasource_validation.png)
![Image of ArrowDataSource Validation](../docs/image/arrowdatasource_validation.png)


## Work together with ParquetDataSource (experimental)
Expand Down
70 changes: 0 additions & 70 deletions arrow-data-source/docs/ApacheArrowInstallation.md

This file was deleted.

Loading

0 comments on commit e2eb35d

Please sign in to comment.