Why Are Learned Indexes So Effective but Sometimes Ineffective?

Ⅰ. Benchmark Datasets (4 real datasets and 3 synthetic datasets)

We adopt 4 real datasets from SOSD [1], which can be directly downloaded via the following links.

Specifically:

fb: A set of user IDs randomly sampled from Facebook.
wiki: A set of edit timestamp IDs committed to Wikipedia.
books: A dataset of book popularity from Amazon.
osm_cellids: A set of cell IDs from OpenStreetMap.

We also generate 3 synthetic datasets by sampling from uniform, normal, and log-normal distributions, following a process similar to [1, 2]. All keys are stored as 64-bit unsigned integers (uint64_t in C++).

cd ./data
bash gen_data.sh

Statistics of benchmark datasets.

Dataset	Category	Keys	Raw Size	$h_D$	$\overline{Cov}$
fb	Real	200 M	1.6 GB	3.88	94
wiki	Real	200 M	1.6 GB	1.77	877
books	Real	800 M	6.4 GB	5.39	101
osm	Real	800 M	6.4 GB	1.91	129
uniform	Synthetic	400 M	3.2 GB	Varied	N.A.
normal	Synthetic	400 M	3.2 GB	Varied	N.A.
lognormal	Synthetic	400 M	3.2 GB	Varied	N.A.

References:
[1] Marcus, et al. SOSD: A Benchmark Suite for Similarity Search over Sorted Data. PVLDB, 2020.
[2] Zhang, et al. Making Learned Indexes Practical: A Comprehensive Study on Data Distribution and Model Selection. PVLDB, 2024.

II. RUN RMI BENCHMARK

1. Generate RMI Models

RMI code: https://github.com/learnedsystems/RMI/tree/master

The following scripts can be used to generate a set of RMI model configurations with various index sizes. Notably, wiki_200M_uint64 can be replaced to other dataset name.

cargo run --release -- --optimize optimizer_out_wiki.json wiki_200M_uint64
cargo run wiki_200M_uint64 --param-grid optimizer_out_wiki.json -d YOUR_RMI_SAVE_FOLDER --threads 8 --zero-build-time

The above process will generate 10 RMI models for each dataset in YOUR_RMI_SAVE_FOLDER. For example, for dataset fb_200M_uint64, it includes 10 RMI model parameter files (fb_200M_uint64_i_L1_PARAMETERS, i=0,...,9) and 30 RMI source code files (3 code files for each RMI model, named fb_200M_uint64_i_data.h, fb_200M_uint64_i.cpp, and fb_200M_uint64_i.h, for i=0,...,9).

To make reproduction easier, we have generated all the required RMI parameter files and source files, available at link. One can directly download and put them under ./exp_rmi folder.

2. Run Benchmark

To run the benchmarks:

cd exp_rmi
make -f Makefile_all run_all

III. RUN PGM BENCHMARK

The original PGM-Index implementation is from: https://github.com/gvinciguerra/PGM-index

Our PGM++ implementation is based on the original PGM-Index. The codes are all put in folder ./exp_pgm.

To run the benchmarks:

cd exp_pgm
g++ main.cpp -std=c++17 -I. -o main -fopenmp
./main data_file_path result_output_path

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
data		data
exp_pgm		exp_pgm
exp_rmi		exp_rmi
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Why Are Learned Indexes So Effective but Sometimes Ineffective?

Ⅰ. Benchmark Datasets (4 real datasets and 3 synthetic datasets)

II. RUN RMI BENCHMARK

1. Generate RMI Models

2. Run Benchmark

III. RUN PGM BENCHMARK

About

Releases

Packages

Languages

qyliu-hkust/bench_search

Folders and files

Latest commit

History

Repository files navigation

Why Are Learned Indexes So Effective but Sometimes Ineffective?

Ⅰ. Benchmark Datasets (4 real datasets and 3 synthetic datasets)

II. RUN RMI BENCHMARK

1. Generate RMI Models

2. Run Benchmark

III. RUN PGM BENCHMARK

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages