An implementation of Parallel Run-Length Encoding(PARLE) in CUDA You can read the details on my blog. You can also find benchmarking results there.