Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Doc]update download docs for 2308 version[skip ci] #8948

Merged
merged 6 commits into from
Aug 10, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,15 +113,15 @@ mvn -pl dist -PnoSnapshots package -DskipTests
Verify that shim-specific classes are hidden from a conventional classloader.

```bash
$ javap -cp dist/target/rapids-4-spark_2.12-23.06.0-SNAPSHOT-cuda11.jar com.nvidia.spark.rapids.shims.SparkShimImpl
$ javap -cp dist/target/rapids-4-spark_2.12-23.08.0-SNAPSHOT-cuda11.jar com.nvidia.spark.rapids.shims.SparkShimImpl
Error: class not found: com.nvidia.spark.rapids.shims.SparkShimImpl
```

However, its bytecode can be loaded if prefixed with `spark3XY` not contained in the package name

```bash
$ javap -cp dist/target/rapids-4-spark_2.12-23.06.0-SNAPSHOT-cuda11.jar spark320.com.nvidia.spark.rapids.shims.SparkShimImpl | head -2
Warning: File dist/target/rapids-4-spark_2.12-23.06.0-SNAPSHOT-cuda11.jar(/spark320/com/nvidia/spark/rapids/shims/SparkShimImpl.class) does not contain class spark320.com.nvidia.spark.rapids.shims.SparkShimImpl
$ javap -cp dist/target/rapids-4-spark_2.12-23.08.0-SNAPSHOT-cuda11.jar spark320.com.nvidia.spark.rapids.shims.SparkShimImpl | head -2
Warning: File dist/target/rapids-4-spark_2.12-23.08.0-SNAPSHOT-cuda11.jar(/spark320/com/nvidia/spark/rapids/shims/SparkShimImpl.class) does not contain class spark320.com.nvidia.spark.rapids.shims.SparkShimImpl
Compiled from "SparkShims.scala"
public final class com.nvidia.spark.rapids.shims.SparkShimImpl {
```
Expand Down Expand Up @@ -163,7 +163,7 @@ mvn package -pl dist -am -Dbuildver=340 -DallowConventionalDistJar=true
Verify `com.nvidia.spark.rapids.shims.SparkShimImpl` is conventionally loadable:
```bash
$ javap -cp dist/target/rapids-4-spark_2.12-23.06.0-SNAPSHOT-cuda11.jar com.nvidia.spark.rapids.shims.SparkShimImpl | head -2
$ javap -cp dist/target/rapids-4-spark_2.12-23.08.0-SNAPSHOT-cuda11.jar com.nvidia.spark.rapids.shims.SparkShimImpl | head -2
Compiled from "SparkShims.scala"
public final class com.nvidia.spark.rapids.shims.SparkShimImpl {
```
Expand Down
22 changes: 11 additions & 11 deletions docs/FAQ.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,25 +10,25 @@ nav_order: 12

### What versions of Apache Spark does the RAPIDS Accelerator for Apache Spark support?

The RAPIDS Accelerator for Apache Spark requires version 3.1.1, 3.1.2, 3.1.3, 3.2.0, 3.2.1, 3.2.2, 3.2.3, 3.3.0, 3.3.1 or 3.3.2 of
Apache Spark. Because the plugin replaces parts of the physical plan that Apache Spark considers to
be internal the code for those plans can change even between bug fix releases. As a part of our
process, we try to stay on top of these changes and release updates as quickly as possible.
Please see [Software Requirements](download.md#software-requirements) section for complete list of
Apache Spark versions supported by RAPIDS plugin. The plugin replaces parts of the physical plan that
Apache Spark considers internal. The code for these plans can change, even between bug fix releases.
As a part of our process, we try to stay on top of these changes and release updates as quickly as possible.

### Which distributions are supported?

The RAPIDS Accelerator for Apache Spark officially supports:
- [Apache Spark](get-started/getting-started-on-prem.md)
- [AWS EMR 6.2+](get-started/getting-started-aws-emr.md)
- [Databricks Runtime 10.4, 11.3](get-started/getting-started-databricks.md)
- [Google Cloud Dataproc 2.0](get-started/getting-started-gcp.md)
- [Databricks Runtime](get-started/getting-started-databricks.md)
- [Google Cloud Dataproc](get-started/getting-started-gcp.md)
- [Azure Synapse](get-started/getting-started-azure-synapse-analytics.md)
- Cloudera provides the plugin packaged through
[CDS 3.2](https://docs.cloudera.com/cdp-private-cloud-base/7.1.7/cds-3/topics/spark-spark-3-overview.html)
and [CDS 3.3](https://docs.cloudera.com/cdp-private-cloud-base/7.1.8/cds-3/topics/spark-spark-3-overview.html).

Most distributions based on a supported Apache Spark version should work, but because the plugin
replaces parts of the physical plan that Apache Spark considers to be internal the code for those
replaces parts of the physical plan that Apache Spark considers to be internal. The code for these
plans can change from one distribution to another. We are working with most cloud service providers
to set up testing and validation on their distributions.

Expand All @@ -39,10 +39,10 @@ release.

### What hardware is supported?

The plugin is tested and supported on P100, V100, T4, A2, A10, A30, A100 and L4 datacenter GPUs. It is possible
to run the plugin on GeForce desktop hardware with Volta or better architectures. GeForce hardware
does not support [CUDA forward
compatibility](https://docs.nvidia.com/deploy/cuda-compatibility/index.html#forward-compatibility-title),
Please see [Hardware Requirements](download.md#hardware-requirements) section for the list of GPUs that
the RAPIDS plugin has been tested on. It is possible to run the plugin on GeForce desktop hardware with Volta
or better architectures. GeForce hardware does not support
[CUDA forward compatibility](https://docs.nvidia.com/deploy/cuda-compatibility/index.html#forward-compatibility-title),
and will need CUDA 11.5 installed. If not, the following error will be displayed:

```
Expand Down
2 changes: 1 addition & 1 deletion docs/additional-functionality/rapids-shuffle.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ pools is the number of cores in the system divided by the number of executors pe
---
**NOTE:**

As of the spark-rapids 23.06 release, UCX packages support CUDA 11.
As of the spark-rapids 23.08 release, UCX packages support CUDA 11.
UCX support for CUDA 12 in the RAPIDS Accelerator will be added in a future release.

---
Expand Down
64 changes: 64 additions & 0 deletions docs/archive.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,70 @@ nav_order: 15
---
Below are archived releases for RAPIDS Accelerator for Apache Spark.

## Release v23.06.0
Hardware Requirements:

The plugin is tested on the following architectures:

GPU Models: NVIDIA P100, V100, T4 and A2/A10/A30/A100 GPUs

Software Requirements:

OS: Ubuntu 20.04, Ubuntu 22.04, CentOS 7, or Rocky Linux 8

CUDA & NVIDIA Drivers*: 11.x & v470+

Apache Spark 3.1.1, 3.1.2, 3.1.3, 3.2.0, 3.2.1, 3.2.2, 3.2.3, 3.2.4, 3.3.0, 3.3.1, 3.3.2, 3.4.0 Databricks 10.4 ML LTS or 11.3 ML LTS Runtime and GCP Dataproc 2.0, Dataproc 2.1

Python 3.6+, Scala 2.12, Java 8, Java 17

*Some hardware may have a minimum driver version greater than v450.80.02+. Check the GPU spec sheet
for your hardware's minimum driver version.

*For Cloudera and EMR support, please refer to the
[Distributions](./FAQ.md#which-distributions-are-supported) section of the FAQ.

### Download v23.06.0
* Download the [RAPIDS
Accelerator for Apache Spark 23.06.0 jar](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/23.06.0/rapids-4-spark_2.12-23.06.0.jar)

This package is built against CUDA 11.8, all CUDA 11.x and 12.x versions are supported through [CUDA forward
compatibility](https://docs.nvidia.com/deploy/cuda-compatibility/index.html). It is tested
on V100, T4, A2, A10, A30, A100, L4 and H100 GPUs with CUDA 11.8-12.0. For those using other types of GPUs
which do not have CUDA forward compatibility (for example, GeForce), CUDA 11.8 or later is required. Users will
need to ensure the minimum driver (450.80.02) and CUDA toolkit are installed on each Spark node.

### Verify signature
* Download the [RAPIDS Accelerator for Apache Spark 23.06.0 jar](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/23.06.0/rapids-4-spark_2.12-23.06.0.jar)
and [RAPIDS Accelerator for Apache Spark 23.06.0 jars.asc](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/23.06.0/rapids-4-spark_2.12-23.06.0.jar.asc)
* Download the [PUB_KEY](https://keys.openpgp.org/[email protected]).
* Import the public key: `gpg --import PUB_KEY`
* Verify the signature: `gpg --verify rapids-4-spark_2.12-23.06.0.jar.asc rapids-4-spark_2.12-23.06.0.jar`

The output if signature verify:

gpg: Good signature from "NVIDIA Spark (For the signature of spark-rapids release jars) <[email protected]>"

### Release Notes
New functionality and performance improvements for this release include:
* Enhanced operator support with an OOM retry framework to minimize OOM or GPU specific config changes
* Spill framework to reduce OOM issues to minimize OOM or GPU specific config changes
* AQE for skewed broadcast hash join performance improvement
* Support JSON to struct
* Support StringTranslate
* Support windows function with string input in order by clause
* Support regular expressions with line anchors in choice input
* Support rlike function with line anchor input
* Improve the performance of ORC small file reads
* Qualification and Profiling tool:
* Qualification tool support for Azure Databricks
* The Qualification and Profiling tools do not require a live cluster, and only require read permissions on clusters
* Improve Profiling tool recommendations to support more tuning options


For a detailed list of changes, please refer to the
[CHANGELOG](https://github.com/NVIDIA/spark-rapids/blob/main/CHANGELOG.md).

## Release v23.04.1
Hardware Requirements:

Expand Down
4 changes: 2 additions & 2 deletions docs/dev/testing.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,5 +5,5 @@ nav_order: 2
parent: Developer Overview
---
An overview of testing can be found within the repository at:
* [Unit tests](https://github.com/NVIDIA/spark-rapids/tree/branch-23.06/tests#readme)
* [Integration testing](https://github.com/NVIDIA/spark-rapids/tree/branch-23.06/integration_tests#readme)
* [Unit tests](https://github.com/NVIDIA/spark-rapids/tree/branch-23.08/tests#readme)
* [Integration testing](https://github.com/NVIDIA/spark-rapids/tree/branch-23.08/integration_tests#readme)
64 changes: 36 additions & 28 deletions docs/download.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,66 +18,74 @@ cuDF jar, that is either preinstalled in the Spark classpath on all nodes or sub
that uses the RAPIDS Accelerator For Apache Spark. See the [getting-started
guide](https://nvidia.github.io/spark-rapids/Getting-Started/) for more details.

## Release v23.06.0
Hardware Requirements:
## Release v23.08.0
### Hardware Requirements:

The plugin is tested on the following architectures:

GPU Models: NVIDIA P100, V100, T4 and A2/A10/A30/A100 GPUs
GPU Models: NVIDIA P100, V100, T4, A10/A100, L4 and H100 GPUs

Software Requirements:
### Software Requirements:

OS: Ubuntu 20.04, Ubuntu 22.04, CentOS 7, or Rocky Linux 8

CUDA & NVIDIA Drivers*: 11.x & v470+

Apache Spark 3.1.1, 3.1.2, 3.1.3, 3.2.0, 3.2.1, 3.2.2, 3.2.3, 3.2.4, 3.3.0, 3.3.1, 3.3.2, 3.4.0 Databricks 10.4 ML LTS or 11.3 ML LTS Runtime and GCP Dataproc 2.0, Dataproc 2.1
NVIDIA Driver*: 470+

Python 3.6+, Scala 2.12, Java 8, Java 17

Supported Spark versions:
Apache Spark 3.1.1, 3.1.2, 3.1.3
Apache Spark 3.2.0, 3.2.1, 3.2.2, 3.2.3, 3.2.4
Apache Spark 3.3.0, 3.3.1, 3.3.2
Apache Spark 3.4.0, 3.4.1

Supported Databricks runtime versions:
Azure/AWS:
Databricks 10.4 ML LTS (GPU, Scala 2.12, Spark 3.2.1)
Databricks 11.3 ML LTS (GPU, Scala 2.12, Spark 3.3.0)
Databricks 12.2 ML LTS (GPU, Scala 2.12, Spark 3.3.2)

Supported Dataproc versions:
GCP Dataproc 2.0
GCP Dataproc 2.1

*Some hardware may have a minimum driver version greater than v450.80.02+. Check the GPU spec sheet
for your hardware's minimum driver version.

*For Cloudera and EMR support, please refer to the
[Distributions](./FAQ.md#which-distributions-are-supported) section of the FAQ.

### Download v23.06.0
### Download v23.08.0
* Download the [RAPIDS
Accelerator for Apache Spark 23.06.0 jar](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/23.06.0/rapids-4-spark_2.12-23.06.0.jar)
Accelerator for Apache Spark 23.08.0 jar](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/23.08.0/rapids-4-spark_2.12-23.08.0.jar)

This package is built against CUDA 11.8, all CUDA 11.x and 12.x versions are supported through [CUDA forward
compatibility](https://docs.nvidia.com/deploy/cuda-compatibility/index.html). It is tested
on V100, T4, A2, A10, A30, A100, L4 and H100 GPUs with CUDA 11.8-12.0. For those using other types of GPUs
which do not have CUDA forward compatibility (for example, GeForce), CUDA 11.8 or later is required. Users will
need to ensure the minimum driver (450.80.02) and CUDA toolkit are installed on each Spark node.
on V100, T4, A10, A100, L4 and H100 GPUs with CUDA 11.8-12.0. For those using other types of GPUs
which do not have CUDA forward compatibility (for example, GeForce), CUDA 11.8 or later is required.

### Verify signature
* Download the [RAPIDS Accelerator for Apache Spark 23.06.0 jar](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/23.06.0/rapids-4-spark_2.12-23.06.0.jar)
and [RAPIDS Accelerator for Apache Spark 23.06.0 jars.asc](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/23.06.0/rapids-4-spark_2.12-23.06.0.jar.asc)
* Download the [RAPIDS Accelerator for Apache Spark 23.08.0 jar](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/23.08.0/rapids-4-spark_2.12-23.08.0.jar)
and [RAPIDS Accelerator for Apache Spark 23.08.0 jars.asc](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/23.08.0/rapids-4-spark_2.12-23.08.0.jar.asc)
* Download the [PUB_KEY](https://keys.openpgp.org/[email protected]).
* Import the public key: `gpg --import PUB_KEY`
* Verify the signature: `gpg --verify rapids-4-spark_2.12-23.06.0.jar.asc rapids-4-spark_2.12-23.06.0.jar`
* Verify the signature: `gpg --verify rapids-4-spark_2.12-23.08.0.jar.asc rapids-4-spark_2.12-23.08.0.jar`

The output if signature verify:

gpg: Good signature from "NVIDIA Spark (For the signature of spark-rapids release jars) <[email protected]>"

### Release Notes
New functionality and performance improvements for this release include:
* Enhanced operator support with an OOM retry framework to minimize OOM or GPU specific config changes
* Spill framework to reduce OOM issues to minimize OOM or GPU specific config changes
* AQE for skewed broadcast hash join performance improvement
* Support JSON to struct
* Support StringTranslate
* Support windows function with string input in order by clause
* Support regular expressions with line anchors in choice input
* Support rlike function with line anchor input
* Improve the performance of ORC small file reads
* Compatibility with Databricks AWS & Azure 12.2 ML LTS.
* Enhanced stability and support for ORC and Parquet.
* Reduction of out-of-memory (OOM) occurrences.
* Corner case evaluation for data formats, operators and expressions
* Qualification and Profiling tool:
* Qualification tool support for Azure Databricks
* The Qualification and Profiling tools do not require a live cluster, and only require read permissions on clusters
* Improve Profiling tool recommendations to support more tuning options

* Profiling tool now supports Azure Databricks and AWS Databricks.
* Qualification tool can provide advice on unaccelerated operations.
* Improve user experience through CLI design.
* Qualification tool provides configuration and migration recommendations for Dataproc and EMR.

For a detailed list of changes, please refer to the
[CHANGELOG](https://github.com/NVIDIA/spark-rapids/blob/main/CHANGELOG.md).
Expand Down
11 changes: 5 additions & 6 deletions docs/get-started/getting-started-databricks.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,9 @@ This guide will run through how to set up the RAPIDS Accelerator for Apache Spar
At the end of this guide, the reader will be able to run a sample Apache Spark application that runs
on NVIDIA GPUs on Databricks.

## Prerequisites
* Apache Spark 3.x running in Databricks Runtime 10.4 ML or 11.3 ML with GPU
* AWS: 10.4 LTS ML (GPU, Scala 2.12, Spark 3.2.1) or 11.3 LTS ML (GPU, Scala 2.12, Spark 3.3.0)
* Azure: 10.4 LTS ML (GPU, Scala 2.12, Spark 3.2.1) or 11.3 LTS ML (GPU, Scala 2.12, Spark 3.3.0)
## Supported runtime versions
Please see [Software Requirements](../download.md#software-requirements) section for complete list of
Databricks runtime versions supported by RAPIDS plugin.

Databricks may do [maintenance
releases](https://docs.databricks.com/release-notes/runtime/maintenance-updates.html) for their
Expand Down Expand Up @@ -67,7 +66,7 @@ Navigate to your home directory in the UI and select **Create** > **File** from
create an `init.sh` scripts with contents:
```bash
#!/bin/bash
sudo wget -O /databricks/jars/rapids-4-spark_2.12-23.06.0.jar https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/23.06.0/rapids-4-spark_2.12-23.06.0.jar
sudo wget -O /databricks/jars/rapids-4-spark_2.12-23.08.0.jar https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/23.08.0/rapids-4-spark_2.12-23.08.0.jar
```
Then create a Databricks cluster by going to "Compute", then clicking `+ Create compute`. Ensure the
cluster meets the prerequisites above by configuring it as follows:
Expand Down Expand Up @@ -116,7 +115,7 @@ cluster meets the prerequisites above by configuring it as follows:
```bash
spark.rapids.sql.python.gpu.enabled true
spark.python.daemon.module rapids.daemon_databricks
spark.executorEnv.PYTHONPATH /databricks/jars/rapids-4-spark_2.12-23.06.0.jar:/databricks/spark/python
spark.executorEnv.PYTHONPATH /databricks/jars/rapids-4-spark_2.12-23.08.0.jar:/databricks/spark/python
```
Note that since python memory pool require installing the cudf library, so you need to install cudf library in
each worker nodes `pip install cudf-cu11 --extra-index-url=https://pypi.nvidia.com` or disable python memory pool
Expand Down
4 changes: 3 additions & 1 deletion docs/get-started/getting-started-gcp.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,9 @@ parent: Getting-Started

# Getting Started with the RAPIDS Accelerator on GCP Dataproc
[Google Cloud Dataproc](https://cloud.google.com/dataproc) is Google Cloud's fully managed Apache
Spark and Hadoop service. The quick start guide will go through:
Spark and Hadoop service. Please see [Software Requirements](../download.md#software-requirements)
section for complete list of Dataproc versions supported by RAPIDS plugin.
The quick start guide will go through:

* [Create a Dataproc Cluster Accelerated by GPUs](#create-a-dataproc-cluster-accelerated-by-gpus)
* [Create a Dataproc Cluster using T4's](#create-a-dataproc-cluster-using-t4s)
Expand Down
4 changes: 3 additions & 1 deletion docs/get-started/getting-started-on-prem.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,9 @@ parent: Getting-Started
# Getting Started with RAPIDS Accelerator with on premise cluster or local mode
## Spark Deployment Methods
The way you decide to deploy Spark affects the steps you must take to install and setup Spark and
the RAPIDS Accelerator for Apache Spark. The primary methods to deploy Spark are:
the RAPIDS Accelerator for Apache Spark. Please see [Software Requirements](../download.md#software-requirements)
section for complete list of Spark versions supported by RAPIDS plugin. The primary methods to
deploy Spark are:
* [Local mode](#local-mode) - this is for dev/testing only, not for production
* [Standalone Mode](#spark-standalone-cluster)
* [On a YARN cluster](#running-on-yarn)
Expand Down