-
-
Notifications
You must be signed in to change notification settings - Fork 746
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update JSON serialization and deserialization code to use orjson
library
#5153
Conversation
orjson
library. ## TODO - [ ] Add new json_encode
and json_decode
functions which handle json encoding and decoding using diffrent backends (json, ujson, orjson) - [ ] Add config option / feature flag which dictates which JSON library to use - [ ] Update API layer code (requests, responses) to use json_encode and decode - [ ] Update rest of the code to use json_encode and decode - [ ] Update fast_deepcopy
to use orjson - [ ] Decide if / when orjson
should be default for the new config option valuorjson
library.
orjson
library.orjson
library
library will be using for encoding and decoding. Default it to orjson and add some tests for it.
orjson
libraryorjson
library
A few thoughts:
I think this is not a configuration option we should give to users. We should trust our tests and testing procedures to catch and uncover JSON parsing issues and switch to the backend that we (developers) think gives the most benefit to our users. I do think that orjson is probably the best library for this, and I'm happy to switch to it, but I want us to be prepared for any issues that crop up, and I don't think we should make this user-configurable. |
Which magic method are you referring to? If you mean
That's exactly the reason why that configuration option / feature flag is there. In general, as mentioned in the discussions thread, this change already exposed some issues with incorrect code in some places which mostly results from incorrect and inconsistent conversion between bytes <-> unicode. Incorrect in a sense that right now the code doesn't throw, but the response will be "incorrect" - incorrect in a sense that some string data in responses will contain
I think that's living in an ideal and realistic world which simply doesn't exist in real-life (aka there will always be edge cases which tests won't catch). I do agree though that we shouldn't publicly advertise that feature and treat it more as a feature flag which should be used in the worst case scenario (which hopefully won't happen). I believe we utilized configuration options in similar situations in the past. |
And in general I also think In the past, we have been bit so many times by non-strict nature of various casting and other code. Problem with using best effort, non strict, etc. approach is that it may work for a lot of scenarios, but when it breaks, it's very hard to troubleshoot and fix it. |
I believe packages step is failing because we are using older pip version which doesn't include support for manywheel wheel format orjson uses so it tries to compile it from scratch instead of using a wheel (https://github.com/pypa/manylinux). Will look into fixing that. |
Related PR to get st2-packages build to pass - StackStorm/st2packaging-dockerfiles#103. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
I noticed one minor docstring typo, but whatever. This looks awesome.
just rely on module level constant which we can patch during tests to perform compatibility tests.
Co-authored-by: Jacob Floyd <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can drop the changes in tools/config_gen.py now that json_library
is a module constant instead of a config option.
@blag Can I please get a review when you get a chance. I would like to wrap this on then move to wrapping up DB one now that I'm unblocked by pip stuff. I removed the config option, changed it to a module level constant so we can still exercise it in compatibility tests. |
On mobile right now, will review later today. 👍 |
@blag Thanks. I'm somewhat confused by u16 e2e test failure. Is this a known issue, or a race or smth? Looking at the test output, I'm quite confused - https://gist.github.com/Kami/0d83da44842daa3cbb9dcef670299df6. That alias doesn't have representation (https://github.com/StackStorm-Exchange/stackstorm-st2/blob/master/aliases/actions_list.yaml) so that assertion looks wrong to me. But how did it pass before? And my code touched none of that. I would assume it was race or smth, but assertion doesn't seem to make sense since it doesn't add up with the actual alias defined in stackstorm-st2 pack. To me it seems like that bats test should check for Same tests also pass on other distros which makes me think it's actually some kind of race - if it was a bigger issue related to this change, that would already be caught by other tests. |
OK, ignore my comment above - I'm still digging into end to end tests output and the instance itself... Made some progress: 2021-03-06 22:51:19,147 139824208471152 DEBUG shell [-] Returning.
2021-03-06 22:51:19,148 139824208471152 DEBUG python_runner [-] Returning values: 1, %%%%%~=~=~=************=~=~=~%%%%, Traceback (most recent call last):
File "/opt/stackstorm/st2/lib/python3.6/site-packages/python_runner/python_action_wrapper.py", line 381, in <module>
obj.run()
File "/opt/stackstorm/st2/lib/python3.6/site-packages/python_runner/python_action_wrapper.py", line 252, in run
sys.stdout.write(print_output + "\n")
UnicodeEncodeError: 'ascii' codec can't encode character '\u2022' in position 58: ordinal not in range(128)
, False Looks like it's some default sys.encoding issue on u16, looks like it's set to ascii and not utf8 and that's why it's failing. EDIT: I believe the issue is that locate on u16 is not set correctly for action runner process. |
OK, yeah I was able to reproduce the issue - it's indeed locale related on Ubuntu 16.04. If it's set to utf8 it works fine, but it's set to ascii it blows up due to change in how orjson serialized unicodes (it doesn't require serialized it to ascii escape sequences, but utilizes unicode directly which is preferred). I'm having troubles dynamically setting Previously it would work because @blag I believe this is the same issue you noticed with regards to very large audit logs and a bunch of ``\\\```. This issue results in that as well - it basically happens when server is not using utf-8 locale (which is the case here_ and something tries to log a unicode message. This would cause endless loop when trying to format logging value and constantly failing. The issue is a combination of Python 3 (using unicode instead of string with ascii escape sequence) and server / action runner not having set correct locale. In short, it looks like this issue predates my PR, but my PR made it more likely to happen and got exposes since u16 doesn't seem to have locale set up correctly for action runner. |
sys.tdout.buffer works with bytes where as sys.stdout works with unicode. orjson returns bytes and this way we avoid unncessary conversation back and forth between bytes and unicode. In addition to that, using bytes means it will also work correctly if the system locale is not set to utf-8.
This will make it easier to troubleshoot locale / encoding related issue. Also make sure we print those version related messages under INFO log level instead of DEBUG since they may be material when troubleshooting various issues so we should use INFO.
OK, finally the whole build is green 🎉 🍾 |
tools/config_gen.py
Outdated
@@ -41,6 +41,7 @@ | |||
] | |||
|
|||
SKIP_GROUPS = ["api_pecan", "rbac", "results_tracker"] | |||
SKIP_OPTIONS = ["json_library"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that json_library
is not a config option, do you still want to leave this here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. We talked about not having user config options for json lib.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the config was removed but the this flag was left here so we can also support ignoring specific config options (previously this script didn't support that).
I will change SKIP_OPTIONS
to an empty list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly OK but some clean up requests.
@@ -48,7 +48,7 @@ | |||
from st2common.util import jinja as jinja_utils | |||
from st2common.util import param as param_utils | |||
from st2common.util.config_loader import get_config | |||
from st2common.util.ujson import fast_deepcopy | |||
from st2common.util.deep_copy import fast_deepcopy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we call this module json util or something so to be clear this is json specific operation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intentionally renamed it to deep_copy
since I think previous name was a bad one (I know, I picked it originally :D) - previous name was leaking implementation details which should not matter to the end user.
I think deep_copy
is a better name since it conveys what the module is used for - deep copying dictionaries.
And the module is not JSON specific either, it's can deep copy arbitrary dictionaries with simple types. orjson
is just an implementation detail of that function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want this to be confused with copy.deepcopy which is not json or dict specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I can rename it to fast_deepcopy_dict
then? (the function name that is)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I guess I need to be more explicitly - it doesn't just support dicts either (that's just how we use it).
It supports any kind of value as long as it only contains simple types (so no class instances, etc.).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per discussion on Slack, I pushed a couple of changes - add some more tests, update function docstring comment, rename it to fast_deepcopy_dict
(bd4eb01, c80d07a, 9ba8d45).
As discussed, it can also be used on other simple values (think lists) and not just dictionaries, but using it on dictionaries with simple value types is our primary use case for it so I think that name is an OK compromise for now.
tools/config_gen.py
Outdated
@@ -41,6 +41,7 @@ | |||
] | |||
|
|||
SKIP_GROUPS = ["api_pecan", "rbac", "results_tracker"] | |||
SKIP_OPTIONS = ["json_library"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. We talked about not having user config options for json lib.
dictionari with simple (think JSON) types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Kami Thank you! This is going to significantly improve st2.
This pull request updates code base to use
orjson
library for serialization and de-serialization everywhere.Background, Context
Right now StackStorm code-base uses native python
json
library for serializing / de-serialiazing JSON. Only exception to that is our version ofcopy.deepcopy
function which usesujson
since it's up to 10x faster than nativecopy.deepcopy
.Now that we have dropped support for Python 2, we can look at using
orjson
(https://github.com/ijl/orjson) everywhere.orjson
should be more or less backward compatible withjson
, but also quite a bit faster (even faster thanujson
and unlikeujson
, it's also actively maintainer).Proposed Change
I will update / add new
json_encode()
andjson_decode()
function which should be used everywhere in the code where we need to deal with JSON.Those functions will utilize new feature flag / config option which will allow user to specify which json library to use. This feature flag should mostly serve as a safe-guard so user can revert back to native
json
library in case issues arise once the code is out.TODO
json_encode
andjson_decode
functions which handle json encoding and decoding using diffrent backends (json, ujson, orjson)fast_deepcopy
to use orjsonorjson
should be default for the new config option value