This repository contains artifacts for evaluating gSampler. It includes code to reproduce Figures 7, 8, and 10 from the paper:
- Figure 7: Time comparison between gSampler and baseline systems for 3 simple graph sampling algorithms.
- Figure 8: Time comparison between gSampler and baseline systems for 4 complex graph sampling algorithms.
- Figure 10: Ablation study of gSampler's optimizations on PD and PP graphs.
To replicate the paper's findings, we recommend using the p3.x16large
instance equipped with NVIDIA V100 GPUs, 64 vCPUs, and 480GB memory. Running all scripts may take approximately 6-8 hours.
If you leverage your own hardware to reprodce the results, please ensure your hardware has at least 256GB memory for preprocessing the two large-scale graphs, ogbn_papers100M
and friendster
, in CSC format.
To reproduce the paper's results, just use the latest version from the main branch of the repository and the source code available at https://github.com/gsampler9/gSampler.
The data for this project should be stored in the ./dataset
directory and include the following datasets:
- friendster
- livejournal
- ogbn_papers100M
- ogbn_products
Download the ogbn_products
and ogbn_papers100M
datasets from ogb, and the livejournal
and friendster
datasets from SNAP.
To handle other datasets, follow these steps:
- Prepare the graph in the CSC format.
- Load the dataset using the
m.load_graph
API.
# Prepare the graph in CSC format
csc_indptr, csc_indices = load_graph(...)
# Load the graph into GPU memory
m = gs.Matrix()
m.load_graph("CSC", [csc_indptr.cuda(), csc_indices.cuda()])
# For large-scale graphs using Unified Virtual Addressing (UVA)
m.load_graph("CSC", [csc_indptr.pin_memory(), csc_indices.pin_memory()])
# To utilize super-batching, convert Matrix to BatchMatrix
bm = gs.BatchMatrix()
bm.load_from_matrix(m)
The repository contains four directories as follows:
gsampler-artifact-evaluation
├── README.md
├── fig_examples # Example output figures.
├── examples # E2E demo with DGL
├── figure10 # reproduce figure10
├── figure7 # reproduce figure7
├── figure8 # reproduce figure8
├── clean.sh # delete all results
└── run.sh # run all reproduce workload
To ease the burden of setting up a software-hardware environment, we provided reviewers with a direct instance. For login instructions, please refer to comments A2
and A3
in sosp23ae.
If using our AWS EC2 server, simply run conda activate gsampler-ae
and proceed to step 4. For other setups, follow these instructions:
- Git clone the repo:
git submodule update --init --recursive
Follow this guide to build gSampler. Ensure you're in the gsampler-ae
Conda environment with dgl and gs library installed.
Refer to https://pytorch-geometric.readthedocs.io/en/latest/install/installation.html for installation.
Refer to https://www.dgl.ai/pages/start.html for installation.
cd figure7/skywalker/
git checkout gsampler-baseline
mkdir build
cd build
cmake ..
make -j
cd figure7/gunrock/
git checkout gsampler-baseline
mkdir build
cd build
cmake ..
make sage
Refer to https://github.com/rapidsai/cugraph for installation.
To execute Figure 8, Figure 10, and Table 10 together, navigate to the project root directory and run the following command:
cd ${workspace}
bash clean.sh
bash run.sh
Results will be generated in the subdirectories. Note that Figure 7 requires building three additional systems, so it will be generated separately.
To build and run the multiple baseline code for Figure 7, you will need cuGraph, GunRock, SkyWalker, PyG, and DGL. Please install them first. Refer to figure7 for detailed instructions.
To build and run the multiple baseline code for Figure 8, you will need DGL and PyG. Please install them first. Refer to figure8 for detailed instructions.
To build and run the multiple baseline code for Figure 10, you will need DGL. Please install them first. Refer to figure10 for detailed instructions.