Skip to content

Commit

Permalink
docs: Add TF compat info (#1528)
Browse files Browse the repository at this point in the history
  • Loading branch information
mikemckiernan authored Jun 27, 2022
1 parent 60e316f commit bcf7baa
Show file tree
Hide file tree
Showing 2 changed files with 60 additions and 19 deletions.
75 changes: 56 additions & 19 deletions docs/source/resources/troubleshooting.md
Original file line number Diff line number Diff line change
@@ -1,27 +1,29 @@
Troubleshooting
===============
# Troubleshooting

## Checking the Schema of the Parquet File

NVTabular expects that all input parquet files have the same schema, which includes column types and the nullable (not null) option. If you encounter the error

```
RuntimeError: Schemas are inconsistent, try using to_parquet(..., schema="infer"),
or pass an explicit pyarrow schema. Such as to_parquet(..., schema={"column1": pa.string()})
```

when you load the dataset as shown below, one of your parquet files might have a different schema:

```python
ds = nvt.Dataset(PATH, engine="parquet", part_size="1000MB")
ds.to_ddf().head()
```

The easiest way to fix this is to load your dataset with dask_cudf and save it again using the parquet format ( ```dask_cudf.read_parquet("INPUT_FOLDER").to_parquet("OUTPUT_FOLDER")```), so that the parquet file is standardized and the ```_metadata``` file is generated.
The easiest way to fix this is to load your dataset with dask_cudf and save it again using the parquet format ( `dask_cudf.read_parquet("INPUT_FOLDER").to_parquet("OUTPUT_FOLDER")`), so that the parquet file is standardized and the `_metadata` file is generated.

If you want to identify which parquet files contain columns with different schemas, you can run one of these scripts:
* [PyArrow](https://github.com/dask/dask/issues/6504#issuecomment-675465645)
* [cudf=0.17](https://github.com/rapidsai/cudf/pull/6796#issue-522934284)

These scripts check for schema consistency and generate only the ```_metadata``` file instead of
- [PyArrow](https://github.com/dask/dask/issues/6504#issuecomment-675465645)
- [cudf=0.17](https://github.com/rapidsai/cudf/pull/6796#issue-522934284)

These scripts check for schema consistency and generate only the `_metadata` file instead of
converting all the parquet files. If the schema is inconsistent across all files, the script will
raise an exception. For additional information, see [this
issue](https://github.com/NVIDIA/NVTabular/issues/429).
Expand All @@ -30,9 +32,10 @@ issue](https://github.com/NVIDIA/NVTabular/issues/429).

NVTabular is designed to scale to larger than GPU or host memory datasets. In our experiments, we are able to [scale to 1.3TB of uncompressed click logs](https://github.com/NVIDIA/NVTabular/tree/main/examples/scaling-criteo). However, some workflows can result in OOM errors `cudaErrorMemoryAllocation out of memory`, which can be addressed by small configuration changes.

### 1. Setting the Row Group Size for the Parquet Files
### Setting the Row Group Size for the Parquet Files

You can use most Data Frame frameworks to set the row group size (number of rows) for your parquet files. In the following Pandas and cuDF examples, the `row_group_size` is the number of rows that will be stored in each row group (internal structure within the parquet file):

You can use most Data Frame frameworks to set the row group size (number of rows) for your parquet files. In the following Pandas and cuDF examples, the ```row_group_size``` is the number of rows that will be stored in each row group (internal structure within the parquet file):
```python
#Pandas
pandas_df.to_parquet("/file/path", engine="pyarrow", row_group_size=10000)
Expand All @@ -41,8 +44,7 @@ cudf_df.to_parquet("/file/path", engine="pyarrow", row_group_size=10000)
```

The row group **memory** size of the parquet files should be smaller than the **part_size** that
you set for the NVTabular dataset such as ```nvt.Dataset(TRAIN_DIR, engine="parquet",
part_size="1000MB")```. To determine how much memory a row group will hold, you can slice your dataframe to a specific number of rows and use the following function to get the memory usage in bytes. You can then set the row_group_size (number of rows) accordingly when you save the parquet file. A row group memory size that is close to 128MB is recommended.
you set for the NVTabular dataset such as `nvt.Dataset(TRAIN_DIR, engine="parquet", part_size="1000MB")`. To determine how much memory a row group will hold, you can slice your dataframe to a specific number of rows and use the following function to get the memory usage in bytes. You can then set the row_group_size (number of rows) accordingly when you save the parquet file. A row group memory size that is close to 128MB is recommended.

```python
def _memory_usage(df):
Expand All @@ -60,9 +62,9 @@ def _memory_usage(df):
return size
```

### 2. Initializing a Dask CUDA Cluster
### Initializing a Dask CUDA Cluster

Even if you only have a single GPU to work with, it is best practice to use a distributed Dask-CUDA cluster to execute memory-intensive NVTabular workflows. If there is no distributed `client` object passed to an NVTabular `Workflow`, it will fall back on Dask’s single-threaded “synchronous” scheduler at computation time. The primary advantage of using a Dask-CUDA cluster is that the Dask-CUDA workers enable GPU-aware memory spilling. In our experience, many OOM errors can be resolved by initializing a dask-CUDA cluster with an appropriate `device_memory_limit` setting, and by passing a corresponding client to NVTabular. It is easy to deploy a single-machine dask-CUDA cluster using `LocalCUDACluster`.
Even if you only have a single GPU to work with, it is best practice to use a distributed Dask-CUDA cluster to execute memory-intensive NVTabular workflows. If there is no distributed `client` object passed to an NVTabular `Workflow`, it will fall back on Dask’s single-threaded “synchronous” scheduler at computation time. The primary advantage of using a Dask-CUDA cluster is that the Dask-CUDA workers enable GPU-aware memory spilling. In our experience, many OOM errors can be resolved by initializing a dask-CUDA cluster with an appropriate `device_memory_limit` setting, and by passing a corresponding client to NVTabular. It is easy to deploy a single-machine dask-CUDA cluster using `LocalCUDACluster`.

```python
from dask_cuda import LocalCUDACluster
Expand All @@ -86,13 +88,49 @@ client.shutdown()
client.close()
```

### 3. String Column Error
### String Column Error

If you run into a problem an error that states the size of your string column is too large, like:

```
Exception: RuntimeError('cuDF failure at: /opt/conda/envs/rapids/conda-bld/libcudf_1618503955512/work/cpp/src/copying/concatenate.cu:368: Total number of concatenated rows exceeds size_type range')
```
This is usually caused by string columns in parquet files. If you encounter this error, to fix it you need to decrease the size of the partitions of your dataset. If, after decreasing the size of the partitions, you get a warning about picking a partition size smaller than the row group size, you will need to reformat the dataset with a smaller row group size (refer to #1). There is a 2GB max size for concatenated string columns in cudf currently, for details refer to [this](https://github.com/rapidsai/cudf/issues/3958).

This is usually caused by string columns in parquet files. If you encounter this error, to fix it you need to decrease the size of the partitions of your dataset. If, after decreasing the size of the partitions, you get a warning about picking a partition size smaller than the row group size, you will need to reformat the dataset with a smaller row group size (refer to #1). There is a 2GB max size for concatenated string columns in cudf currently, for details refer to [this](https://github.com/rapidsai/cudf/issues/3958).

### Special Considerations for TensorFlow 2.7 and Lower

The example notebooks in the repository are developed and tested with the latest [Merlin containers](https://catalog.ngc.nvidia.com/containers?filters=&orderBy=dateModifiedDESC&query=merlin) that are available from the NGC Catalog.
If you run the example notebooks in an environment that has a different version of TensorFlow, you might experience an out-of-memory condition that requires you to perform additional configuration.
The version of TensorFlow in each Merlin container is available from the [support matrix](https://nvidia-merlin.github.io/Merlin/main/support_matrix/index.html) in the Merlin documentation.

TensorFlow 2.8 uses `cuda_malloc_async` as the default GPU memory allocation function.
TensorFlow specifies the function in the `TF_GPU_ALLOCATOR` environment variable.

In the 2.7 release, the setting was available as an experimental feature and was not activated by default.
In releases 2.6 and lower, the setting is not available at all.

If the `cuda_malloc_async` function is not available or not specified, TensorFlow reserves all GPU memory.
In these cases, you can limit the amount of memory used by TensorFlow with the `TF_MEMORY_ALLOCATION` environment variable.

If you experience an out-of-memory issue with an older TensorFlow version, try specifying the `cuda_malloc_async` function as the GPU memory allocation function or set the `TF_MEMORY_ALLOCATION` environment variable.

For example, for TensorFlow 2.7, use code like the following example:

```python
import os
os.environ["TF_GPU_ALLOCATOR"] = "cuda_malloc_async"
```

For TensorFlow 2.6 and lower, use code like the following example:

```python
import os
os.environ["TF_MEMORY_ALLOCATION"] = "0.5"
```

Specify a value between `0` and `1`.
The value indicates the percentage of GPU memory to allocate to TensorFlow.

## Reducing Memory Consumption for NVTabular Dataloaders

Expand All @@ -110,10 +148,10 @@ PyTorch:

```python
train_loader = TorchAsyncItr(
nvt.Dataset(train_files, part_size="300MB"),
batch_size=1024*64,
cats=CATEGORICAL_COLUMNS,
conts=CONTINUOUS_COLUMNS,
nvt.Dataset(train_files, part_size="300MB"),
batch_size=1024*64,
cats=CATEGORICAL_COLUMNS,
conts=CONTINUOUS_COLUMNS,
labels=LABEL_COLUMNS
)
```
Expand All @@ -129,4 +167,3 @@ train_loader = KerasSequenceLoader(
cont_names=CONTINUOUS_COLUMNS
)
```

4 changes: 4 additions & 0 deletions examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,3 +110,7 @@ To run the example notebooks using Docker containers, do the following:
1. Open a browser and use the `127.0.0.1` URL provided in the messages by JupyterLab.

1. After you log in to JupyterLab, navigate to the `/nvtabular` directory to try out the example notebooks.

## Troubleshooting

If you experience any trouble running the example notebooks, check the latest [troubleshooting](https://nvidia-merlin.github.io/NVTabular/main/resources/troubleshooting.html) documentation.

0 comments on commit bcf7baa

Please sign in to comment.