Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better handling of big json files #21

Closed
srwilson opened this issue Sep 7, 2016 · 6 comments
Closed

Better handling of big json files #21

srwilson opened this issue Sep 7, 2016 · 6 comments

Comments

@srwilson
Copy link

srwilson commented Sep 7, 2016

Currently running gron on large json files is very slow. For example a 40MB file takes over a minute:

> time gron big.json > foo

real    1m28.850s
user    1m37.038s
sys 0m2.333s

My guess is it's in the sorting phase. Would it possible to avoid sorting all together? Maybe doing a streaming decode of the json would be helpful too.

At the very least it should be possible to disable sorting via command line option.

@tomnomnom
Copy link
Owner

That sounds like a reasonable guess. I'll do some profiling and see what crops up.

Would you be able to share the source of your big JSON file so I can get a reasonable comparison?

@srwilson
Copy link
Author

srwilson commented Sep 7, 2016

Can't share the one I originally ran but here's a python script I made to create a file

import json
d = {"a": "a", "b": "b", "c": "c"}
dd = [d]*1000000
print json.dumps({"data": dd})

That makes a 31MB file that did even worse:

real    2m44.392s
user    2m58.819s
sys 0m5.225s

@tomnomnom
Copy link
Owner

Great, thanks!

@tomnomnom
Copy link
Owner

tomnomnom commented Sep 7, 2016

@srwilson I'm not done yet, but I've made some changes in 2e2114b that should help you a bit.

There's a couple of really minor speedups here and there, but the two main things are:

  1. The sorting no-longer bothers stripping color codes from the statements if you're using --monochrome
  2. There's a --no-sort option that, somewhat predictably, disables sorting

Using a JSON file generated from your python script:

tom@girru:~/src/github.com/tomnomnom/gron (master)▶ time gron testdata/verybig.json > /dev/null

real    2m23.393s
user    2m33.124s
sys 0m0.932s
tom@girru:~/src/github.com/tomnomnom/gron (master)▶ time gron --monochrome testdata/verybig.json > /dev/null

real    0m35.218s
user    0m37.208s
sys 0m0.680s
tom@girru:~/src/github.com/tomnomnom/gron (master)▶ time gron --no-sort testdata/verybig.json > /dev/null

real    0m13.636s
user    0m15.208s
sys 0m0.680s
tom@girru:~/src/github.com/tomnomnom/gron (master)▶ time gron --no-sort --monochrome testdata/verybig.json > /dev/null

real    0m8.768s
user    0m10.148s
sys 0m0.632s

The --no-sort is the major win, but things are still more acceptable than they were when sorting the output when using --monochrome. There's a little extra that can be done there too: the --monochrome flag could be forced when the output isn't a TTY, rather than having to manually specify it.

I've had a think about using a streaming JSON parser, but you could only use it when --no-sort is in use, and it might make for significant complications in other parts of the code. I might do a bit of a POC to see how bad it would be though.

There's still more to be done to make things better so I'm not going to close this issue right now.

In the meantime I've tagged and released what I've done so far as 0.3.4

Thanks again!

@tomnomnom
Copy link
Owner

@srwilson nothing's tagged yet, but I thought you might be interested to know I've made some pretty big changes to gron's inner workings to make the sort more efficient (ec6e312).

The outcome is that worst-case performance (colors and sorting enabled) is now around 5 times better.

The slightly unfortunate thing is that the best-case performance (monochrome, no sorting) is slightly worse - mostly because of an increased number of allocations. Thankfully the massive refactor opens up new avenues for meaningful optimisation now that the sorting doesn't dominate quite so much.

Here's the same tests from above repeated with a build from master:

tom@girru:~/src/github.com/tomnomnom/gron (master)▶ time gron ~/tmp/big.json > /dev/null

real    0m28.844s
user    0m34.744s
sys 0m1.204s
tom@girru:~/src/github.com/tomnomnom/gron (master)▶ time gron --monochrome ~/tmp/big.json > /dev/null

real    0m22.123s
user    0m27.708s
sys 0m1.084s
tom@girru:~/src/github.com/tomnomnom/gron (master)▶ time gron --no-sort ~/tmp/big.json > /dev/null

real    0m18.683s
user    0m24.720s
sys 0m1.180s
tom@girru:~/src/github.com/tomnomnom/gron (master)▶ time gron --no-sort --monochrome ~/tmp/big.json > /dev/null

real    0m12.171s
user    0m17.404s
sys 0m1.072s

@tomnomnom
Copy link
Owner

tomnomnom commented Sep 9, 2016

A few commits later and I've made some more improvements. Removed some unnecessary copies and made the monochrome mode forced by the output not being a terminal:

tom@girru:~/src/github.com/tomnomnom/gron (master)▶ time gron ~/tmp/big.json > /dev/null

real    0m15.914s
user    0m17.804s
sys 0m1.280s
tom@girru:~/src/github.com/tomnomnom/gron (master)▶ time gron --no-sort ~/tmp/big.json > /dev/null

real    0m8.471s
user    0m10.928s
sys 0m1.216s

That puts worst case when stdout is redirected at about 9 times better, and best case (i.e. with --no-sort) about 17 times better than when this issue was raised.

I'm going to consider the issue 'fixed', although I will continue to make things faster.

I've released all the changes as 0.3.6.

@srwilson thanks again for your input!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants