This plugin enables data loading from an Apache Cassandra NoSQL database to NVIDIA Data Loading Library (DALI) (which can be used to load and preprocess images for PyTorch or TensorFlow).
The plugin has been tested and is compatible with DALI v1.41.
The easiest way to test the cassandra-dali-plugin is by using the provided Dockerfile (derived from NVIDIA PyTorch NGC), which also contains NVIDIA DALI, Cassandra C++ and Python drivers, a Cassandra server, PyTorch and Apache Spark, as shown in the commands below.
# Build and run cassandra-dali-plugin docker container
$ docker build -t cassandra-dali-plugin .
$ docker run --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
--cap-add=sys_admin cassandra-dali-plugin
Alternatively, for better performance and for data persistence, it is
advised to mount a host directory for Cassandra on a fast disk (e.g.,
/mnt/fast_disk/cassandra
):
# Run cassandra-dali-plugin docker container with external data dir
$ docker run --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
-v /mnt/fast_disk/cassandra:/cassandra/data:rw \
--cap-add=sys_admin cassandra-dali-plugin
Once installed the plugin can be loaded with
import crs4.cassandra_utils
import nvidia.dali.plugin_manager as plugin_manager
import nvidia.dali.fn as fn
import pathlib
plugin_path = pathlib.Path(crs4.cassandra_utils.__path__[0])
plugin_path = plugin_path.parent.parent.joinpath("libcrs4cassandra.so")
plugin_path = str(plugin_path)
plugin_manager.load_library(plugin_path)
At this point the plugin can be integrated in a DALI
pipeline,
for example replacing a call to fn.readers.file
with
images, labels = fn.crs4.cassandra(
name="Reader", cassandra_ips=["127.0.0.1"],
table="imagenet.train_data", label_col="label", label_type="int",
data_col="data", id_col="img_id",
source_uuids=train_uuids, prefetch_buffers=2,
)
Below, we'll provide a full summary of the parameters' meanings. If you prefer to skip this section, here you can find some working examples.
name
: name of the module to be passed to DALI (e.g. "Reader")cassandra_ips
: list of IP pointing to the DB (e.g.,["127.0.0.1"]
)cassandra_port
: Cassandra TCP port (default:9042
)table
: data table (e.g.,imagenet.train_data
)label_col
: name of the label column (e.g.,label
)label_type
: type of label: "int", "blob" or "none" ("int" is typically used for classification, "blob" for segmentation)data_col
: name of the data column (e.g.,data
)id_col
: name of the UUID column (e.g.,img_id
)source_uuids
: full list of UUIDs, as strings, to be retrieved
Cassandra server provides a wide range of (non-mandatory) options for configuring authentication and authorization. Our plugin supports them by using the following parameters:
username
: username for Cassandrapassword
: password for Cassandrause_ssl
: use SSL to encrypt the transfers: True or Falsessl_certificate
: public key of the Cassandra server (e.g., "server.crt")ssl_own_certificate
: public key of the client (e.g., "client.crt")ssl_own_key
: private key of the client (e.g., "client.key")ssl_own_key_pass
: password protecting the private key of the client (e.g., "blablabla")cloud_config
: Astra-like configuration (e.g.,{'secure_connect_bundle': 'secure-connect-blabla.zip'}
)
Their use is demonstrated in private_data.py file, used by the examples
This plugin offers extensive internal parallelism that can be adjusted to enhance pipeline performance. Refer for example to this discussion on how to improve the throughput over a long fat network.
The main idea behind this plugin is that relatively small files can be efficiently stored and retrieved as BLOBs in a NoSQL DB. This enables scalability in data loading through prefetching and pipelining. Furthermore, it enables data to be stored in a separate location, potentially even at a significant distance from where it is processed, as discussed here. This capability also facilitates storing data along with a comprehensive set of associated metadata, which can be more conveniently utilized during machine learning.
For the sake of convenience and improved performance, we choose to store data and metadata in separate tables within the database. The metadata table will be utilized for selecting the images that need to be processed. These images are identified by UUIDs and stored as BLOBs in the data table. During the machine learning process, we will exclusively access the data table. Below, you will find examples of functional code for creating and populating these tables in the database.
See the following annotated example for details on how to use this plugin:
A variant of the same example implemented with PyTorch Lightning is available in:
A (less) annotated example for segmentation can be found in:
An example showing how to save and decode multilabels as serialized numpy tensors can be found in:
An example of how to automatically create a single file with data split to feed the training application:
This plugin also supports efficient inference via NVIDIA Triton server:
cassandra-dali-plugin requires:
- NVIDIA DALI
- Cassandra C/C++ driver
- Cassandra Python driver
The details of how to install missing dependencies, in a system which provides only some of the dependencies, can be deduced from the Dockerfile, which contains all the installation commands for the packages above.
Once the dependencies have been installed, the plugin can easily be installed with pip:
$ pip3 install .
Cassandra Data Loader is developed by
- Francesco Versaci, CRS4 [email protected]
- Giovanni Busonera, CRS4 [email protected]
cassandra-dali-plugin is licensed under the under the Apache License, Version 2.0. See LICENSE for further details.
- Jakob Progsch for his ThreadPool code