Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parseJson is extremely slow #12152

Closed
FedericoCeratto opened this issue Sep 6, 2019 · 22 comments
Closed

parseJson is extremely slow #12152

FedericoCeratto opened this issue Sep 6, 2019 · 22 comments

Comments

@FedericoCeratto
Copy link
Member

FedericoCeratto commented Sep 6, 2019

parseJson on Nim 0.20.2 on amd64 is 5x slower than Python (!)
Steps to reproduce:

import json, times
let t = epochTime()
for l in lines("j"):
  discard parseJson(l)
echo epochTime() - t

curl -s https://ooni-data.s3.amazonaws.com/autoclaved/jsonl.tar.lz4/2019-07-20/20190720T180822Z-BR-AS28573-web_connectivity-20190720T180824Z_AS28573_yoUArbw59XNf9gyshmMwTJa6oeODxoRrp7ara6VhTBog0l5Izf-0.2.0-probe.json.lz4 | lz4 -dcfm > j
nim c -d:release --hints:off -r bench.nim 

The test file contains 2854 lines and takes 20s for Nim and 4s for Python 3

@krux02
Copy link
Contributor

krux02 commented Sep 6, 2019

Did you try packed json?

@cheatfate
Copy link
Member

@FedericoCeratto use -d:danger not -d:release, because -d:release is now equal to -d:debug.

@FedericoCeratto
Copy link
Member Author

FedericoCeratto commented Sep 7, 2019

Update: running a benchmark with only one JSON object (one line of the file), using -d:danger and Nim devel, picking the best parseJson time out of 10000 loops:

line #    len    Python time    Nim time
1     2717    40.7 usec      131 us
2   138561    1.12 msec     3431 us
3   497396    4.39 msec    11519 us
4    62906     834 usec     1474 us

Same against Python's ujson:
len    Python time    Nim time
2717	 27.6 usec	132 us	
138561  547 usec	3870 us	
497396  2.12 msec	11414 us	
62906	 320 usec	1477 us	
2869	 26.3 usec	130 us	

@dom96
Copy link
Contributor

dom96 commented Sep 7, 2019

@FedericoCeratto use -d:danger not -d:release, because -d:release is now equal to -d:debug.

-d:release is not equal to -d:debug: https://github.com/nim-lang/Nim/blob/devel/config/nim.cfg#L72-L79

@disruptek
Copy link
Contributor

I wouldn't say json is slow, but rather, that ujson is fast. Also, calling it Python isn't really fair; there's very little actual Python running in a simple repro. Try running Python's json library instead and see how it does. 😉

@FedericoCeratto
Copy link
Member Author

@disruptek Only the second chunk of the last test is done against ujson, everything else was against Python's json.

@disruptek
Copy link
Contributor

Can we see your Python? I thought maybe there was some cheating going on whereby data wasn't being unpacked until accessed. I made this native Python to test that theory; it's about twice as slow as Nim:

import json

fh = open("j")
for line in fh.readlines():
    x = json.loads(line)
    if x["test_version"] == "nah":
        break

@Clyybber
Copy link
Contributor

Clyybber commented Sep 7, 2019

Did you try packed json?

@krux02 I did, with -d:release it fails with a RangeError, with -d:danger it takes ~20% longer

@FedericoCeratto
Copy link
Member Author

@disruptek the Python you wrote takes 4.6s vs 11s on Nim devel with -d:danger
Replacing json with ujson in Python goes down to 2.6s

@cheatfate
Copy link
Member

@dom96 in terms of performance this options are now equal.

Difference between release and debug is

stacktrace:off
excessiveStackTrace:off
linetrace:off
debugger:off
line_dir:off
opt:speed

All of this options cannot seriously affect performance.

obj_checks:off
field_checks:off
range_checks:off
bound_checks:off
overflow_checks:off
assertions:off
@if nimHasNilChecks:
  nilchecks:off
@end

But this list of options is options which can kill all the performance.

@FedericoCeratto
Copy link
Member Author

FedericoCeratto commented Sep 8, 2019

A Nim snippet to pre-load and split the file in lines and time only the parseJson:

import strutils, json, times
let mylines = readFile("j").strip().splitLines()
echo mylines.len
let t = epochTime()
for l in mylines:
  discard parseJson(l)
echo epochTime() - t

@timotheecour
Copy link
Member

timotheecour commented Sep 9, 2019

pre-load and split the file in lines and time only the parseJson

  • benchmarking code like for l in mylines: needs to be done with caution, depending on how large mylines is, the main bottleneck can become page faults. Ideally the data used by that benchmarks fits in the cache
  • also it may be a good idea to ensure code doesn't get optimized away (both in python and nim), with:
var count=0 # dummy counter
for l in mylines:
  count += parseJson(l).len # or whatever is needed to avoid optimizing away
doAssert count != 0

one performance advantage python (and D) has over nim is that slicing strings/seq can be done both safely and without allocating (see https://dlang.org/articles/d-array-article.html). This obviously has serious performance implications in some applications.
At least supporting unsafe (first-class) slices would help for performance critical applications. See also nim-lang/RFCs#88

@Araq
Copy link
Member

Araq commented Sep 9, 2019

This obviously has serious performance implications in some applications.

No it doesn't for the simple reason that Python doesn't have O(1) slicing. (It could do it, but it doesn't. At least up to version 3.3 or something, can't check every Python version.)

@timotheecour
Copy link
Member

timotheecour commented Sep 9, 2019

for seq O(1) slicing is the raison-d'etre of numpy in python:

import numpy as np
a=np.array([1,2,3])
b=a[0:2]
a[0]=10
b # array([10,  2])

for string, admittedly, it's more awkward (and probably not used much for user facing code), but still doable:

a=bytearray(b"hello world")
a2 = memoryview(a)
a2[0]=105
a # bytearray(b'iello world')

as for D, strings are just special case of dynamic arrays (immutable(char)[]) and also has safe O(1) slicing (char[] can be used for mutable slices). nim doesn't have immutability but unsafe slicing would be a good compromise. It would need to be another type than seq[T] / string obviously, but it could potentially be openArray[T] under nim-lang/RFCs#88

@Araq
Copy link
Member

Araq commented Sep 9, 2019

It is planned to extend openArray[T] for this indeed.

@timotheecour
Copy link
Member

timotheecour commented Sep 9, 2019

that's really good to hear. Is there an issue to track that besides nim-lang/RFCs#88 ? (ideally a PR to an RFC markdown file so it can be edited by anyone, like nim-lang/RFCs#167)

@FedericoCeratto
Copy link
Member Author

@timotheecour I tested the benchmarks with discard and with tricks to prevent optimizing away the JSON parsing and it was the same.

@FedericoCeratto
Copy link
Member Author

@timotheecour
Copy link
Member

timotheecour commented Dec 27, 2019

I've reproduced json timing results from https://embark.status.im/news/2019/11/18/nim-vs-crystal-part-1-performance-interoperability/index.html on json file; nim is indeed slower than crystal by a factor 1.3X to 1.5X; parseJson is the slow part

I've tried packedjson as recommended here: #12152 (comment) and nim is now 1.08X faster than crystal:

  import pkg/packedjson
  let jobj = parseFile("/tmp/1.json")
  let coordinates = jobj["coordinates"]
  let len = float(coordinates.len)
  var x = 0.0
  var y = 0.0
  var z = 0.0

  for coord in coordinates:
    x += coord["x"].getFloat
    y += coord["y"].getFloat
    z += coord["z"].getFloat

  echo x / len
  echo y / len
  echo z / len

so maybe std/json could be improved and this issue re-opened? (without the 'extremely' part...)

note that parseJsonFragments is not used in https://embark.status.im/news/2019/11/18/nim-vs-crystal-part-1-performance-interoperability/index.html

@Araq
Copy link
Member

Araq commented Dec 30, 2019

That particular speed issue was solved though and packedjson gets its speed by breaking the API.

@FedericoCeratto
Copy link
Member Author

JSON parsing using the stdlib is still quite slow and people use it for comparative benchmarks across languages. IMO we should warn users in the stdlib documentation and keep an open issue.

@FedericoCeratto
Copy link
Member Author

Related: #3809

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants