Nonspacing Bench

This repo is a sample app that's benchmarking different data structures to filter the nonspacing marks.

It's the result of that charabia/issues/137 issue.

Running benchmarks

Run cargo bench and that's it.

You can also run cargo run to ensure that all the data structures filter the same number of chars.

Results on my machine

Naive                   time:   [22.296 ms 22.378 ms 22.461 ms]  

BTreeSet                time:   [14.278 ms 14.335 ms 14.394 ms] 

RoaringBitmap           time:   [15.217 ms 15.274 ms 15.333 ms] 

HashSet                 time:   [9.5395 ms 9.5683 ms 9.5972 ms]

It looks like HashSet is the fastest one, even though it takes more memory.

Memory Usage

I'm still not sure how to properly get the dynamic variable size in Rust, so I captured some samples with jemalloc and put them in jemalloc_stats.

Text sample

The repo is using an Arabic text sample from that repo AliOsm/arabic-text-diacritization

Nonspacing Marks

The nonspacing marks are finite set (1281 chars) of unicode chars show there https://www.compart.com/en/unicode/category/Mn

The repo stores them as bin and txt files inside dicts.

The code uses bin representation only.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
benches		benches
dicts		dicts
jemalloc_stats		jemalloc_stats
samples		samples
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Nonspacing Bench

Running benchmarks

Results on my machine

Memory Usage

Text sample

Nonspacing Marks

About

Releases

Packages

Languages

crudiedo/nonspacing-bench

Folders and files

Latest commit

History

Repository files navigation

Nonspacing Bench

Running benchmarks

Results on my machine

Memory Usage

Text sample

Nonspacing Marks

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages