Skip to content
This repository has been archived by the owner on Sep 18, 2023. It is now read-only.

[NSE-206]Update documents and License for 1.1.0 #292

Merged
merged 2 commits into from
Apr 30, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
285 changes: 284 additions & 1 deletion arrow-data-source/CHANGELOG.md → CHANGELOG.md

Large diffs are not rendered by default.

1,957 changes: 1,957 additions & 0 deletions LICENSE

Large diffs are not rendered by default.

8 changes: 6 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
##### \* LEGAL NOTICE: Your use of this software and any required dependent software (the "Software Package") is subject to the terms and conditions of the software license agreements for the Software Package, which may also include notices, disclaimers, or license terms for third party or open source software included in or with the Software Package, and your use indicates your acceptance of all such terms. Please refer to the "TPP.txt" or other similarly-named text file included with the Software Package for additional details.

##### \* Optimized Analytics Package for Spark* Platform is under Apache 2.0 (https://www.apache.org/licenses/LICENSE-2.0).

# Spark Native SQL Engine

A Native Engine for Spark SQL with vectorized SIMD optimizations
Expand All @@ -10,7 +14,7 @@ You can find the all the Native SQL Engine documents on the [project web page](h

![Overview](./docs/image/nativesql_arch.png)

Spark SQL works very well with structured row-based data. It used WholeStageCodeGen to improve the performance by Java JIT code. However Java JIT is usually not working very well on utilizing latest SIMD instructions, especially under complicated queries. [Apache Arrow](https://arrow.apache.org/) provided CPU-cache friendly columnar in-memory layout, its SIMD optimized kernels and LLVM based SQL engine Gandiva are also very efficient. Native SQL Engine used these technoligies and brought better performance to Spark SQL.
Spark SQL works very well with structured row-based data. It used WholeStageCodeGen to improve the performance by Java JIT code. However Java JIT is usually not working very well on utilizing latest SIMD instructions, especially under complicated queries. [Apache Arrow](https://arrow.apache.org/) provided CPU-cache friendly columnar in-memory layout, its SIMD optimized kernels and LLVM based SQL engine Gandiva are also very efficient. Native SQL Engine used these technologies and brought better performance to Spark SQL.

## Key Features

Expand Down Expand Up @@ -58,7 +62,7 @@ Please notice the files are fat jars shipped with our custom Arrow library and p
### Building by Conda

If you already have a working Hadoop Spark Cluster, we provide a Conda package which will automatically install dependencies needed by OAP, you can refer to [OAP-Installation-Guide](./docs/OAP-Installation-Guide.md) for more information. Once finished [OAP-Installation-Guide](./docs/OAP-Installation-Guide.md), you can find built `spark-columnar-core-<version>-jar-with-dependencies.jar` under `$HOME/miniconda2/envs/oapenv/oap_jars`.
Then you can just skip below steps and jump to Getting Started [Get Started](#get-started).
Then you can just skip below steps and jump to [Get Started](#get-started).

### Building by yourself

Expand Down
10,639 changes: 10,639 additions & 0 deletions TPP.txt

Large diffs are not rendered by default.

201 changes: 0 additions & 201 deletions arrow-data-source/LICENSE.txt

This file was deleted.

18 changes: 7 additions & 11 deletions arrow-data-source/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,36 +6,32 @@ A Spark DataSource implementation for reading files into Arrow compatible column

The development of this library is still in progress. As a result some of the functionality may not be constantly stable for being used in production environments that have not been fully considered due to the limited testing capabilities so far.

## Online Documentation

You can find the all the Native SQL Engine documents on the [project web page](https://oap-project.github.io/arrow-data-source/).

## Build

### Prerequisite

There are some requirements before you build the project.
Please make sure you have already installed the software in your system.

1. gcc 9.3 or higher version
1. GCC 7.0 or higher version
2. java8 OpenJDK -> yum install java-1.8.0-openjdk
3. cmake 3.2 or higher version
4. maven 3.1.1 or higher version
3. cmake 3.16 or higher version
4. maven 3.6 or higher version
5. Hadoop 2.7.5 or higher version
6. Spark 3.0.0 or higher version
7. Intel Optimized Arrow 3.0.0

### Building by Conda

If you already have a working Hadoop Spark Cluster, we provide a Conda package which will automatically install dependencies needed by OAP, you can refer to [OAP-Installation-Guide](./docs/OAP-Installation-Guide.md) for more information. Once finished [OAP-Installation-Guide](./docs/OAP-Installation-Guide.md), you can find built `spark-arrow-datasource-standard-<version>-jar-with-dependencies.jar` under `$HOME/miniconda2/envs/oapenv/oap_jars`.
If you already have a working Hadoop Spark Cluster, we provide a Conda package which will automatically install dependencies needed by OAP, you can refer to [OAP-Installation-Guide](../docs/OAP-Installation-Guide.md) for more information. Once finished [OAP-Installation-Guide](../docs/OAP-Installation-Guide.md), you can find built `spark-arrow-datasource-standard-<version>-jar-with-dependencies.jar` under `$HOME/miniconda2/envs/oapenv/oap_jars`.
Then you can just skip steps below and jump to [Get Started](#get-started).

### cmake installation

If you are facing some trouble when installing cmake, please follow below steps to install cmake.

```
// installing cmake 3.2
// installing cmake 3.16.1
sudo yum install cmake3

// If you have an existing cmake, you can use below command to set it as an option within alternatives command
Expand Down Expand Up @@ -121,7 +117,7 @@ You have to use a customized Arrow to support for our datasets Java API.

```
// build arrow-cpp
git clone -b <version> https://github.com/Intel-bigdata/arrow.git
git clone -b arrow-3.0.0-oap-1.1 https://github.com/oap-project/arrow.git
cd arrow/cpp
mkdir build
cd build
Expand Down Expand Up @@ -213,7 +209,7 @@ spark.sql("SELECT * FROM my_temp_view LIMIT 10").show(10)

To validate if ArrowDataSource works, you can go to the DAG to check if ArrowScan has been used from the above example query.

![Image of ArrowDataSource Validation](./docs/image/arrowdatasource_validation.png)
![Image of ArrowDataSource Validation](../docs/image/arrowdatasource_validation.png)


## Work together with ParquetDataSource (experimental)
Expand Down
70 changes: 0 additions & 70 deletions arrow-data-source/docs/ApacheArrowInstallation.md

This file was deleted.

Loading