This reposistory contains implementations of four GPU-based join implementations and code to evaluate them. The implementations contain
- SMJI - State-of-the-art sort merge join using radix sort and Merge Path. "SMJ-UM" in the paper.
- SMJ - Improved SMJI using the GFTR technique. "SMJ-OM" in the paper.
- PHJ - State-of-the-art partitioned hash join implementation from Sioulas et al.. "PHJ-UM" in the paper.
- SHJ - Improved PHJ using the GFTR pattern with a redesigned partitioning strategy. "PHJ-OM" in the paper.
Our code is developed and tested under the following environment.
- CUDA 12.2 or 12.3 (including nvcc, cub, thrust and so on)
- GCC 10.2.1 or 11.4.0 (10.2.1 for the A100 machine, 11.4.0 for the RTX 3090 machine)
There are two places in the code base that need to be customized to your machine.
- In the
Makefile
, specify the compute capability of your GPU. We have tested the code on RTX 3090 (8.6) and A100 (8.0). - In
src/volcano/utils.cuh
, change themem_pool_size
variable to your own GPU's memory capacity. - In
src/volcano/tpc_utils.hpp
, change theTPC_DATA_PREFIX
(absolute path) to the directory where the TPC-H and TPC-DS data are stored. See more in "Run the TPC-H/DS benchmarks".
Then in the project home directory, run
sh configure.sh
This will compile all available executables to bin/volcano/
, including
bin/volcano/join_exp_4b4b
: 4-byte keys + 4-byte non-keys for Section 5.2.1 - 5.2.6.bin/volcano/join_exp_4b8b
: 4-byte keys + 8-byte non-keys for Section 5.2.5.bin/volcano/join_exp_8b8b
: 8-byte keys + 8-byte non-keys for Section 5.2.5.bin/volcano/join_pipeline
: sequence of joins for Section 5.2.7.bin/volcano/tpch_[7,18,19]
: Joins extracted from TPC-H Q7, 18, 19 for Section 5.3.bin/volcano/tpcds_[64,95]
: Joins extracted from TPC-DS Q64, 95 for Section 5.3.
Because the microbenchmark will generate input data and the data generation is slow,
it is more efficient to cache the generated data on disk.
Suppose you want to store the input data in <path_to_input_data>
, then run
mkdir -p <path_to_input_data>/int/ <path_to_input_data>/long/
You can directly invoke the four executables and pass in the configurations. You can check the instructions by passing the -h
flag to each executable.
Alternatively, we have prepared a script to run all the microbenchmarks in run.sh
. Simply run the following command in the project home directory.
sh run.sh <repeat times> <path_to_input_data>
The results will be written to the exp_results/gpu_join/
directory, and each microbenchmark is stored as a separate CSV file.
Running TPC-H/DS requires input data to be generated from the data generators, i.e., dbgen and dsdgen. For ease of use, we have uploaded all relevant data to the polybox. Download the vldb24_tpch_tpcds.tar.xz
to your machine and then decompress and extract the data.
xz -d -v vldb24_tpch_tpcds.tar.xz
tar -xvf file.tar -C .
In the end, make sure your TPC_DATA_PREFIX
directory have the following structure.
TPC_DATA_PREFIX
tpch_sf10
tpcds_sf100
q64
q95
- It is recommended to turn on the persistence mode to reduce the program launch time. See this guide. You can check if it is turned out via
nvidia-smi
.