-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experimental implementation of the On Demand API. #13
base: main
Are you sure you want to change the base?
Conversation
cc @jkeiser |
Holy wow, this is the first time (I think?) I've seen an interpreted language plugin that gets us into multiple GB/s! Nice. |
@lemire - I have one question that I cannot crack: Does On Demand parser assume that a structure of the JSON to be parsed is known in advance? Thanks. |
It does not. However, if you do know the schema, then you can benefit from that knowledge with on demand. |
Let me clarify. The idea is that you (the user) is supposed to take the value and do something with it... Even with the rewind functionality, it would still not be right to rewind whenever you want to access a value. So if you have cc @jkeiser |
@lemire @jkeiser - ok, thanks for clarification. It more or less matches my "mental model". The problematic bit in Python is that if you want to store the value in the Python-native way, you will have to construct that native type and that's exactly what is slow ... and pysimdjon/cysimdjson takes the advantage of delaying that conversion as much as possible; this is essential for the high performance of the binding. So my current thinking is that some kind of "intermediate" storage on C++ level will be needed for On-Demand API. And the question is how different is this from the "previous" API. |
Right. The python-C++ interface is a nasty challenge. I am aware. :-) |
Note that the on demand interface has matured considerably since... |
Unfortunately the gotcha still exists even with the matured API - it's the same reason pysimdjson has avoided it so far. Given the overwhelming overhead of object construction, the only benefit simdjson wrappers offer to Python over some easier-to-integrate options (like yyjson) is the DOM model for delayed object creation. |
@TkTech Granted, but I wanted to stress that many of the earlier comments in this issue are obsolete. |
@TkTech assuming Python has optimizations for short-lived objects (which I imagine it does), one design I've been thinking about for On Demand python is, to forego the simdjson frontend entirely: make a single call to the tokenizer (stage 1), stash those indices in a Python array, and then do an On Demand frontend in python. That way the opaque C++ boundary doesn't get in the way and Python can do any optimizations it wants (as opposed to when you have to call out to C++ for each value). |
@jkeiser that would be an interesting approach and it would be nice functionality to have for other things (cases when the end user knows they will need the entire document at some point) but even the cost of creating that initial array is certainly higher than sparse access through the DOM model. Creating the strings in the array is extremely expensive because of how Python is internally storing the strings, requiring a copy. |
Ah, I might have misunderstood. The source of the buffer is coming from python, (which will usually already be in utf-8 internally, so we can get a 0 cost string pointer for C++) and we're only calling simdjson to get the numeric indices for tokens. |
I think that @jkeiser's proposal is worth investigating and it is on my todo. Whether it works out is a research question, but it is practical in the sense that it does not require 'years' of difficult implementation. Although, I must say, there are difficulties. Related discussion: simdjson/simdjson#1912 Note that it applies to JavaScript runtimes as well... oven-sh/bun#2570 |
Let me quote the results of the recent JavaScript efforts (@Jarred-Sumner)...
|
I'm not sure what @ateska's plans are for this repo (I feel like I'm hijacking his issue :)) but on the pysimdjson side ideally we'd get something like... from_buffer(const char *buffer, uint64_t size_of_buffer, uint64_t **output, int *bytes_read, int *indices_written, malloc_func_t, realloc_func_t) ... which would be very easy to integrate and allow us to do proper memory tracking and re-use. |
In JSC's case, strings must be either latin1 or UTF-16. If the programming language internally supports UTF-8 strings it may be cheaper. |
SIMDJSON introduced On Demand API as a default API recently.
This is an experiment that employs this API in the Python/Cython.
Following issues has been identified so far:
Speed is indeed impressive: