-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: Faster JSON dumps in format_message
with an alternative JSON SerDe library
#1046
Comments
That's right, although the SDK never used that since it depended on an older version of But, after #979, we can perhaps explore using Also, there's multiple ways to tackle performance in Singer pipelines, and we've already received feedback about improvements using |
That's great feedback @edgarrmondragon, also thanks for taking the time to provide the back story re orjson. |
format_message
with orjson
This has been updated in my fork of Method is def format_message(message, option=0):
def default(obj):
if isinstance(obj, decimal.Decimal):
return int(obj) if float(obj).is_integer() else float(obj)
raise TypeError
return orjson.dumps(message.asdict(), option=option, default=default) |
Testing a half-million (500,000) row table using a branch in my tap-db2 variant - a rewrite using the Meltano SDK, the
The speed of the same table in seconds for Meltano SDK vs tap-db2 1.0.4: SDK - 126.02 seconds @edgarrmondragon - |
While we're considering alternatives for JSON reading/writing in Python, I'll call out Its benchmarks show it being significantly more performant than I haven't personally used it, but it (along with others) are worth considering before we choose to use a third-party library to handle JSON writing in the SDK. |
Latest msgspec only seems to support Python 3.8+, so if we decide for it then it'd be good to follow #1659 with this.
|
Having updated the sdk in my own fork to use I can see some speed benefits, however the simplejson
>>> simplejson.dumps(decimal.Decimal("1111111111111.333333333333333333"),use_decimal=True)
'1111111111111.333333333333333333' Digging down into the module code, this is because the core https://github.com/simplejson/simplejson/blob/master/simplejson/encoder.py#L521-L522 elif _use_decimal and isinstance(value, Decimal):
yield buf + str(value) And
orjsonWithout handler
>>> orjson.dumps(decimal.Decimal("1111111111111.333333333333333333"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: Type is not JSON serializable: decimal.Decimal With handlerImplement a default function, as described in the >>> import orjson,decimal
>>> def default(obj):
... if isinstance(obj, decimal.Decimal):
... return str(obj)
... raise TypeError
...
>>> orjson.dumps(decimal.Decimal("1111111111111.333333333333333333"),default=default)
b'"1111111111111.333333333333333333"' The result is wrapped in quotes and would end up in JSON as: "record":{"decm_scale": "1111111111111.333333333333333333"} At this point, we could try to use >>> import orjson,decimal
>>> def default(obj):
... if isinstance(obj, decimal.Decimal):
... return float(obj)
... raise TypeError
...
>>> orjson.dumps(decimal.Decimal("1111111111111.333333333333333333"),default=default)
b'1111111111111.3333' msgspec
>>> msgspec.json.encode(decimal.Decimal("1111111111111.333333333333333333"))
b'"1111111111111.333333333333333333"'
ConclusionUsing
Further reading on SO: https://stackoverflow.com/questions/1960516/python-json-serialize-a-decimal-object @edgarrmondragon @WillDaSilva - I guess the questions for these two improved modules are:
|
@mjsqu thanks for that amazing deep dive! I think we're gonna have to live with the loss of precision 😕. Probably very few taps are de-serializing number values in API responses to In order not to break taps that do rely on sdk/singer_sdk/helpers/_typing.py Line 469 in dfcec5f
With that in mind, we'd sacrifice numeric precision for a minority of taps for a gain in performance. Also, I really like import typing as t
import msgspec
class Message(msgspec.Struct, tag_field="type", tag=str.upper):
pass
class Record(Message):
stream: str
record: dict[str, t.Any]
class Schema(Message):
stream: str
schema: dict[str, t.Any]
msg = Record("users", {"id": 1})
buf = msgspec.json.encode(msg)
roundtrip = msgspec.json.decode(buf, type=t.Union[Record, Schema])
assert isinstance(roundtrip, Record) |
@edgarrmondragon I ran across rapidjson
You can see https://python-rapidjson.readthedocs.io/en/latest/benchmarks.html |
I would really like to advocate for accurate movement of decimal data. We can't assume we understand the usage of the data and whether the accuracy of the scale or precision is important. I would from a trust of the overall Meltano eco-system, taps and targets advocated accuracy over speed. It is for this reason that we are using Singer.Decimal notation to ensure we don't get rounding in our data decimal and float datatypes. If |
Oh, the other thought is I wonder if There seems to be some discussion about this as an Issue on the |
I really like how
I agree with this 100 percent.
There have been issues raised about adding decimals to ijl/orjson#172: Feature request for dumps decimal It looks like there was a PR but it was not merged because no clear decision was made by the authors of the PR whether to serializing as a
I am having a hard time understanding how to use
I am still reading and playing with |
Hi,
This isn't correct. If no types are specified (the default), In [1]: import msgspec
In [2]: msgspec.json.decode('{"hello": 1, "world": 2}') # works the same as any other json.loads
Out[2]: {'hello': 1, 'world': 2}
In [3]: msgspec.json.encode(["no", "need", "to", "use", "structs"]) # works the same as any other json.dumps
Out[3]: b'["no","need","to","use","structs"]' There's no need to use structs for typed decoding if you don't want to. The highest performance on the decoding side will come if you use msgspec for both decoding & validation, but you can also use msgspec just for serialization, same as you would any other If you have any suggestions for how we could make that clearer in our docs, I'd appreciate them. This is covered both on the main page and in our usage docs, but since there was a misunderstanding here clearly we can do better :).
I'm not familiar with Anyway, no pressure to use |
I have been playing with the three JSON libraries mentioned in this issue and thought it might be good to publish my somewhat working versions of the SDK utilizing them as draft PRs. I promise all three completed pytest locally (ok except the S3 test). I have been doing some testing locally with
I then run
Here are some notes I have on all three. msgspecMsgpsec requires python 3.8 or greater. I bumped the lowest version of python in the
Tracked this down to the fact SQLAlchemy results return the type orjsonAdding orjson was uneventful. It also returns a rapidjsonAdding rapidjson was uneventful. The main issue is python doesn't seem to understand a thing about it. No autocomplete or info on methods in VSCode. It runs and works but I miss the visual cues. A |
Thanks for the informative writeup @BuzzCutNorman!
Worth mentioning that the SDK will be dropping support for Python 3.7 soon, since Python 3.7 reaches end-of-life on 2023-06-27: |
A few notes on your msgspec implementation:
While this works, it does have a performance cost. If you're trying to optimize JSON performance you likely don't want whitespace anyway, and comparing msgspec with the extra format call to other libraries without it unnecessarily hampers msgspec's performance.
I'd be happy to push a fix upstream for this, supporting string subclasses as keys in a
As I mentioned above, I highly recommend using strings to represent decimal types, as encoding them as numbers makes it harder for downstream tools to decode without losing precision. That said, we have an open issue for this, and if it's a blocker for meltano's use case I'd be happy to push up a fix (it's pretty simple to support). |
@jcrist Thank you 🙏 for creating that case I will update it with the info I have. |
Here are the updated times for msgspec with the
Comparision between the two msgspec times with and without msgspec
|
@jcrist Thank You 🙏for getting this added so quickly. I did some testing this weekend and more this morning and it is working great. No need to convert all keys to |
Some of you might have already known this but I just learned we don't need to run |
I was wondering If there is a reason |
This issue has now been resolved (PR here). I still recommend encoding decimal values as strings for cross-tool compatibility, but the upcoming msgspec release will include support for encoding and decoding |
I just updated the SDK PR #1784 with msgspec to use version 0.17.0 which has all the fixes and features mentioned above. The A big THANKS 🙏to @jcrist |
Does anyone have any objections to me closing the SDK PRs for |
Just seeing this. The answer is simply oversight 😅.
@BuzzCutNorman None at all, I think folks (including myself) in this discussion feel more inclined to further investigate |
@edgarrmondragon Cool, I will delete those draft PRs tomorrow. Thanks, for answering the question about |
T
Thank you for your work on this @BuzzCutNorman |
@BuzzCutNorman Yeah, I'm very curious if there's gains in deserialization too. Although I don't think it's a blocker for the serialization side, so feel free to add it or punt it as you see fit 😁. |
@jcrist I have been working on adding
I setup a
I looked at Here is an example:
Using
|
I was wondering if we should change the title of this issue from "[Feature]: Faster JSON dumps in |
There currently isn't a clean way of doing this, but it's a feature I'd be happy to add. I've opened jcrist/msgspec#507 to track this. I'd expect this to be a fairly quick feature to add, should be able to get it out with the upcoming release next week.
If the decoding side knows the expected schema (either statically, or dynamically as you've mentioned here), then parsing into a If knowing the schema beforehand (or converting the schema into python type annotations) is tricky, then using the |
format_message
with orjson
format_message
with an alternative JSON SerDe library
That is funny I was just writing an update to say I got a kind of working |
I thought I'd just drop this in here for peoples awareness since it hasnt been raised so far. https://github.com/simdjson/simdjson As far as I'm aware, simdjson is by far the most performant json library. |
Looking at
libpy_simdjson cysimdjson
pysimdjson |
Hi, I have fully implemented msgspec support (including singer_decimal) updated and supported in the original pipelinewise-singer-python framework. I found moving from https://github.com/s7clarke10/pipelinewise-singer-python/releases/tag/4.0.0 I intend to start updating a number of taps which I maintain which use the older singer-python framework to this latest patched version. |
That's great! And FWIW I think there may be even bigger gains to be found when using msgspec in targets, but I've yet to confirm :) |
Feature scope
Taps (catalog, state, stream maps, etc.)
Description
Speed of writing messages can add up when you are dumping a large amount of data. One of the things in the original pipelinewise-singer-python was the use of orjson for doing json dumps as it is a lot faster.
The particular line in particular is this.
sdk/singer_sdk/_singerlib/messages.py
Line 165 in 796ccda
We have a variant of pipelinewise-singer-python which uses orjson in favour of json.dumps for improved speed. The code for that section is as follows.
Sorry the formatting is all out, I couldn't work out how to get the code formatting working.
Best look at the example below.
https://github.com/mjsqu/pipelinewise-singer-python/blob/2a839d245b92d1ec6d5801a30f5280fa77b22d0c/singer/messages.py#L295
I note that while orjson is faster it is not perfect, for example we need to convert decimal data as a string. There are some other considerations mentioned with the module but if they are not an issue, we have measured a considerable performance gain using this and verified this by the cProfile to look at the time consumed.
The text was updated successfully, but these errors were encountered: