Validate: improve performance #164

mhuang74 · 2022-02-10T22:29:59Z

Validate currently runs under a single-thread and could become a bottleneck for data validation pipelines.

Would like to improve performance through higher concurrency
~~Concurrency should be controlled via --jobs option or QSV_MAX_JOBS env var~~
~~Concurrency should not exceed CPU count~~
~~When jobs is set to 0, apply same rules as stats to calculate optimal concurrency~~
Use Rayon to automatically control concurrency
Include validate performance numbers in performance suite

$ head -50000 NYC_311_SR_2010-2020-sample-1M.csv > NYC-short.csv
$ qsvlite index NYC-short.csv
$ time qsvlite schema NYC-short.csv --value-constraints --enum-threshold=25
Schema written to NYC-short.csv.schema.json

real	0m6.941s
user	0m12.050s
sys	0m3.960s
$ time qsvlite validate NYC-short.csv NYC-short.csv.schema.json 
[00:00:08] [==================== 100% validated 49,999 records.] (6,015/sec)
0 out of 49,999 records invalid.

real	0m8.424s
user	0m8.202s
sys	0m0.128s

The text was updated successfully, but these errors were encountered:

jqnatividad · 2022-02-10T23:16:55Z

Awesome!

BTW, while I was revamping qsv's benchmarks last year, i found cargo-flamegraph very useful in quickly finding the hotspots to focus tuning. You may want to check it out.

mhuang74 · 2022-02-19T13:29:25Z

@jqnatividad I played around with Rayon instead of using threadpool, which seems unmaintained. Please let me know your thoughts.

#170

jqnatividad · 2022-02-19T13:58:22Z

@mhuang74 , indeed, rayon is the current goto crate for parallelism. Last year, I was thinking of enabling parallelism on other commands but I also found threadpool support lacking.

Let's go with rayon, and migrate the other commands to it over time so we can remove the threadpool dependency.

mhuang74 · 2022-02-20T12:54:55Z

Thanks @jqnatividad. Follow-up PR to mainly fix --fail-fast panic, and slight speed up by not doing IO inside parallel processing.

#171

jqnatividad · 2022-02-20T14:21:39Z

Merged! Just in time for this week's release...

mhuang74 · 2022-02-21T08:34:47Z

Generated flamegraph, but haven't studied in depth. Some thoughts:

avoid calling serde_json on jsonschema output on every single row validated. should only do this for invalid rows.

https://github.com/mhuang74/qsv/blob/flamegraph/docs/flamegraph-validate-command.svg

command used

(base) ➜  tmp git:(validate_performance) ✗ flamegraph --no-inline --image-width 10000 ../target/release/qsvlite validate NYC_311_SR_2010-2020-sample-1M.csv NYC-short.csv.schema.json
WARNING: Kernel address maps (/proc/{kallsyms,modules}) are restricted,
check /proc/sys/kernel/kptr_restrict and /proc/sys/kernel/perf_event_paranoid.

Samples in kernel functions may not be resolved if a suitable vmlinux
file is not found in the buildid cache or in the vmlinux path.

Samples in kernel modules won't be resolved at all.

If some relocation was applied (e.g. kexec) symbols may be misresolved
even with a suitable vmlinux or kallsyms file.

Couldn't record kernel reference relocation symbol
Symbol resolution may be skewed if relocation was used (e.g. kexec).
Check /proc/kallsyms permission or run as root.
[00:00:50] [==================== 100% validated 1,000,000 records.] (22,952/sec)
6,494 out of 1,000,000 records invalid.

jqnatividad · 2022-02-21T18:53:03Z

Avoiding calling serde_json on every row should improve performance.

As for profiling rayon-enabled code, it seems to be a known issue (rayon-rs/rayon#591).

It's a little better with export RAYON_NUM_THREADS=1, but the rayon wrappers obscure the call stack...

$ export RAYON_NUM_THREADS=1
$ flamegraph --no-inline  target/release/qsvlite validate scripts/NYC_311_SR_2010-2020-sample-1M.csv scripts/nyc-50k.csv.schema.json

It's still useful though as it gives you some insight as to where most of the time is spent - per the flamegraph above - about 33% of the time on jsonschema stuff, and 24% on serde::ser::SerializeMap::serialize_entry.

Regardless, the rayon payoff is clear on my Ubuntu VM with 8 logical CPUs:

# qsv v0.32.1 (before rayon)
[00:01:44] [==================== 100% validated 1,000,000 records.] (9,589/sec)
621 out of 1,000,000 records invalid.

# qsv v0.32.2 (with rayon and RAYON_NUM_THREADS=1)
# confirming the overhead of rayon to be minimal, as there were other optimizations in 0.32.2
[00:01:42] [==================== 100% validated 1,000,000 records.] (10,046/sec)
621 out of 1,000,000 records invalid.

# qsv v0.32.2 (with rayon without setting RAYON_NUM_THREADS, more than 6x faster!)
[00:00:17] [==================== 100% validated 1,000,000 records.] (68,212/sec)
621 out of 1,000,000 records invalid.

mhuang74 · 2022-02-22T08:21:26Z

Thanks @jqnatividad. That indeed made a huge difference (~2.6X faster). I also simplified error output format for readability. validate feels much more production ready now.

#172

jqnatividad · 2022-02-22T13:07:33Z

@mhuang74 It's blazing fast!

[00:00:06] [==================== 100% validated 1,000,000 records.] (283,848/sec)
621 out of 1,000,000 records invalid.

Yep! It's production-ready indeed.

The throughput is amazing - from 9,500 records/sec on v0.32.1 pre-rayon to 280,000 records/sec after all your performance tweaking.

It'd be interesting to see how it performs with regex patterns, but even then, that will be in the jsonschema engine, so I think we can close this. WDYT?

jqnatividad · 2022-02-23T01:08:12Z

Hi @mhuang74 , I tweaked it a little more and managed to squeeze a little more performance - #173

$ qsv validate NYC_311_SR_2010-2020-sample-1M.csv nyc-50k.csv.schema.json 
[00:00:03] [==================== 100% validated 1,000,000 records.] (298,775/sec)
Writing invalid/valid/error files...
621 out of 1,000,000 records invalid.

mhuang74 · 2022-02-23T06:25:37Z

Thanks @jqnatividad . #173 looks good. Take 3-4 seconds to write the files while user reads the message.

jqnatividad · 2022-02-23T13:07:14Z

Closing this for now until the next round of performance tweaks.

mhuang74 mentioned this issue Feb 10, 2022

Create schema command #60

Closed

jqnatividad pinned this issue Feb 10, 2022

mhuang74 mentioned this issue Feb 19, 2022

Validate: improve performance #170

Merged

mhuang74 mentioned this issue Feb 22, 2022

Validate: improve performance and simplify error report format #172

Merged

jqnatividad closed this as completed Feb 23, 2022

jqnatividad unpinned this issue Feb 23, 2022

jqnatividad mentioned this issue Mar 8, 2022

docs: CSV Kit got a 10x improvement #180

Merged

jqnatividad mentioned this issue Apr 30, 2022

Advanced Data Dictionary dathere/datapusher-plus#18

Open

jqnatividad mentioned this issue May 7, 2022

simd support? Stranger6667/jsonschema#368

Closed

jqnatividad mentioned this issue Feb 1, 2023

Roadmap for 1.0 Stranger6667/jsonschema#408

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate: improve performance #164

Validate: improve performance #164

mhuang74 commented Feb 10, 2022 •

edited

Loading

jqnatividad commented Feb 10, 2022

mhuang74 commented Feb 19, 2022

jqnatividad commented Feb 19, 2022

mhuang74 commented Feb 20, 2022 •

edited

Loading

jqnatividad commented Feb 20, 2022

mhuang74 commented Feb 21, 2022 •

edited

Loading

jqnatividad commented Feb 21, 2022 •

edited

Loading

mhuang74 commented Feb 22, 2022 •

edited

Loading

jqnatividad commented Feb 22, 2022 •

edited

Loading

jqnatividad commented Feb 23, 2022

mhuang74 commented Feb 23, 2022

jqnatividad commented Feb 23, 2022

Validate: improve performance #164

Validate: improve performance #164

Comments

mhuang74 commented Feb 10, 2022 • edited Loading

jqnatividad commented Feb 10, 2022

mhuang74 commented Feb 19, 2022

jqnatividad commented Feb 19, 2022

mhuang74 commented Feb 20, 2022 • edited Loading

jqnatividad commented Feb 20, 2022

mhuang74 commented Feb 21, 2022 • edited Loading

jqnatividad commented Feb 21, 2022 • edited Loading

mhuang74 commented Feb 22, 2022 • edited Loading

jqnatividad commented Feb 22, 2022 • edited Loading

jqnatividad commented Feb 23, 2022

mhuang74 commented Feb 23, 2022

jqnatividad commented Feb 23, 2022

mhuang74 commented Feb 10, 2022 •

edited

Loading

mhuang74 commented Feb 20, 2022 •

edited

Loading

mhuang74 commented Feb 21, 2022 •

edited

Loading

jqnatividad commented Feb 21, 2022 •

edited

Loading

mhuang74 commented Feb 22, 2022 •

edited

Loading

jqnatividad commented Feb 22, 2022 •

edited

Loading