Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate: improve performance #164

Closed
mhuang74 opened this issue Feb 10, 2022 · 12 comments
Closed

Validate: improve performance #164

mhuang74 opened this issue Feb 10, 2022 · 12 comments

Comments

@mhuang74
Copy link
Contributor

mhuang74 commented Feb 10, 2022

Validate currently runs under a single-thread and could become a bottleneck for data validation pipelines.

  • Would like to improve performance through higher concurrency
  • Concurrency should be controlled via --jobs option or QSV_MAX_JOBS env var
  • Concurrency should not exceed CPU count
  • When jobs is set to 0, apply same rules as stats to calculate optimal concurrency
  • Use Rayon to automatically control concurrency
  • Include validate performance numbers in performance suite
$ head -50000 NYC_311_SR_2010-2020-sample-1M.csv > NYC-short.csv
$ qsvlite index NYC-short.csv
$ time qsvlite schema NYC-short.csv --value-constraints --enum-threshold=25
Schema written to NYC-short.csv.schema.json

real	0m6.941s
user	0m12.050s
sys	0m3.960s
$ time qsvlite validate NYC-short.csv NYC-short.csv.schema.json 
[00:00:08] [==================== 100% validated 49,999 records.] (6,015/sec)
0 out of 49,999 records invalid.

real	0m8.424s
user	0m8.202s
sys	0m0.128s
@jqnatividad jqnatividad pinned this issue Feb 10, 2022
@jqnatividad
Copy link
Collaborator

Awesome!

BTW, while I was revamping qsv's benchmarks last year, i found cargo-flamegraph very useful in quickly finding the hotspots to focus tuning. You may want to check it out.

@mhuang74
Copy link
Contributor Author

@jqnatividad I played around with Rayon instead of using threadpool, which seems unmaintained. Please let me know your thoughts.

#170

@jqnatividad
Copy link
Collaborator

@mhuang74 , indeed, rayon is the current goto crate for parallelism. Last year, I was thinking of enabling parallelism on other commands but I also found threadpool support lacking.

Let's go with rayon, and migrate the other commands to it over time so we can remove the threadpool dependency.

@mhuang74
Copy link
Contributor Author

mhuang74 commented Feb 20, 2022

Thanks @jqnatividad. Follow-up PR to mainly fix --fail-fast panic, and slight speed up by not doing IO inside parallel processing.

#171

@jqnatividad
Copy link
Collaborator

Merged! Just in time for this week's release...

@mhuang74
Copy link
Contributor Author

mhuang74 commented Feb 21, 2022

Generated flamegraph, but haven't studied in depth. Some thoughts:

  • avoid calling serde_json on jsonschema output on every single row validated. should only do this for invalid rows.

https://github.com/mhuang74/qsv/blob/flamegraph/docs/flamegraph-validate-command.svg

command used

(base) ➜  tmp git:(validate_performance) ✗ flamegraph --no-inline --image-width 10000 ../target/release/qsvlite validate NYC_311_SR_2010-2020-sample-1M.csv NYC-short.csv.schema.json
WARNING: Kernel address maps (/proc/{kallsyms,modules}) are restricted,
check /proc/sys/kernel/kptr_restrict and /proc/sys/kernel/perf_event_paranoid.

Samples in kernel functions may not be resolved if a suitable vmlinux
file is not found in the buildid cache or in the vmlinux path.

Samples in kernel modules won't be resolved at all.

If some relocation was applied (e.g. kexec) symbols may be misresolved
even with a suitable vmlinux or kallsyms file.

Couldn't record kernel reference relocation symbol
Symbol resolution may be skewed if relocation was used (e.g. kexec).
Check /proc/kallsyms permission or run as root.
[00:00:50] [==================== 100% validated 1,000,000 records.] (22,952/sec)
6,494 out of 1,000,000 records invalid.

@jqnatividad
Copy link
Collaborator

jqnatividad commented Feb 21, 2022

Avoiding calling serde_json on every row should improve performance.

As for profiling rayon-enabled code, it seems to be a known issue (rayon-rs/rayon#591).

It's a little better with export RAYON_NUM_THREADS=1, but the rayon wrappers obscure the call stack...

$ export RAYON_NUM_THREADS=1
$ flamegraph --no-inline  target/release/qsvlite validate scripts/NYC_311_SR_2010-2020-sample-1M.csv scripts/nyc-50k.csv.schema.json 

flamegraph-validate-1thread

It's still useful though as it gives you some insight as to where most of the time is spent - per the flamegraph above - about 33% of the time on jsonschema stuff, and 24% on serde::ser::SerializeMap::serialize_entry.

Regardless, the rayon payoff is clear on my Ubuntu VM with 8 logical CPUs:

# qsv v0.32.1 (before rayon)
[00:01:44] [==================== 100% validated 1,000,000 records.] (9,589/sec)
621 out of 1,000,000 records invalid.

# qsv v0.32.2 (with rayon and RAYON_NUM_THREADS=1)
# confirming the overhead of rayon to be minimal, as there were other optimizations in 0.32.2
[00:01:42] [==================== 100% validated 1,000,000 records.] (10,046/sec)
621 out of 1,000,000 records invalid.

# qsv v0.32.2 (with rayon without setting RAYON_NUM_THREADS, more than 6x faster!)
[00:00:17] [==================== 100% validated 1,000,000 records.] (68,212/sec)
621 out of 1,000,000 records invalid.

@mhuang74
Copy link
Contributor Author

mhuang74 commented Feb 22, 2022

Thanks @jqnatividad. That indeed made a huge difference (~2.6X faster). I also simplified error output format for readability. validate feels much more production ready now.

#172

@jqnatividad
Copy link
Collaborator

jqnatividad commented Feb 22, 2022

@mhuang74 It's blazing fast!

[00:00:06] [==================== 100% validated 1,000,000 records.] (283,848/sec)
621 out of 1,000,000 records invalid.

Yep! It's production-ready indeed.

The throughput is amazing - from 9,500 records/sec on v0.32.1 pre-rayon to 280,000 records/sec after all your performance tweaking.

It'd be interesting to see how it performs with regex patterns, but even then, that will be in the jsonschema engine, so I think we can close this. WDYT?

@jqnatividad
Copy link
Collaborator

Hi @mhuang74 , I tweaked it a little more and managed to squeeze a little more performance - #173

$ qsv validate NYC_311_SR_2010-2020-sample-1M.csv nyc-50k.csv.schema.json 
[00:00:03] [==================== 100% validated 1,000,000 records.] (298,775/sec)
Writing invalid/valid/error files...
621 out of 1,000,000 records invalid.

@mhuang74
Copy link
Contributor Author

Thanks @jqnatividad . #173 looks good. Take 3-4 seconds to write the files while user reads the message.

@jqnatividad
Copy link
Collaborator

Closing this for now until the next round of performance tweaks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants