-
-
Notifications
You must be signed in to change notification settings - Fork 746
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize storage (serialization and de-serilization) of very large dictionaries inside MongoDB #4846
Conversation
serialize and unserialize large dictionary data (such as result field, etc.).
serialize and unserialize large dictionary data (such as result field, etc.).
(opt-in) by default.
escaping the values adds tons of overhead.
3f9ca22
to
59e87f9
Compare
st2common/st2common/fields.py
Outdated
def to_mongo(self, value, use_db_field=True, fields=None): | ||
if not cfg.CONF.database.use_json_dict_field: | ||
value = mongoescape.escape_chars(value) | ||
return super(stormbase.EscapedDynamicField, self).to_mongo( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bugs out because JSONDictEscapedFieldCompatibilityField is incompatible with EscapedDynamicField.
@@ -152,6 +152,10 @@ class ActionExecutionAPI(BaseAPI): | |||
@classmethod | |||
def from_model(cls, model, mask_secrets=False): | |||
doc = cls._from_model(model, mask_secrets=mask_secrets) | |||
|
|||
import json | |||
doc['result'] = json.loads(doc['result']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to add this to get the result in JSONDictEscapedFieldCompatibilityField with ujson/cjson backend to work properly. This is because _from_model will call to_mongo which dumps the result in JSON to string. The CLI output will be displayed as a string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This experimental patch is not ready yet. I tested it and it is buggy and also don't like that it needs to json.dumps/loads more than it needed. I commented the problem places in the code in this PR.
Thanks for contributing to this issue. As it has been 90 days since the last activity, we are automatically marking is as stale. If this issue is not relevant or applicable anymore (problem has been fixed in a new version or similar), please close the issue or let us know so we can close it. On the contrary, if the issue is still relevant, there is nothing you need to do, but if you have any additional details or context which would help us when working on this issue, please include it as a comment to this issue. |
I have taken a look at this. +1 from me. Saving the result as a string instead of a mongo object/dict is ideal. There is no current use case to index nested dicts from workflow transitions while running. |
using two different approaches for serializing execution / live action result.
After some discussions on Slack, I decided to try to attack this problem again and wrap up this change - slow insertion of large executions problem (which is fairly common in StackStorm world) still hasn't gone anyway and it's very much alive as it was 3 years ago. I already performed a lot of benchmarkinh, research and prototyping in the past, but since quite a lot of time has passed since then (and data is scattered across various different issues and PRs), I decided to start from scratch with micro benchmarks which compare using different approach for storing Good thing about micro benchmarks is that it's also very easy to add a new one or use a different fixture file, etc. to see how it affects the numbers (and they are also fully reproducible by everyone). First to refresh everyone's memory (including mine). The two issues which contribute to slow insertion / saving of executions with large results:
And one very important invariant on the result field itself - we use dictionary type for execution result field, but for the purpose of database itself, that field is an opaque binary blob - we don't perform any querying on it's values. That's very important because it allows us to treat it as a string / binary blog which makes things easier (in fact, one of the option I already debated in the past is storing result in a separate collection as a string value and then add a FK to LiveAction collection). Proposed ApproachThe proposed change to mitigate / fix this issue is still the same one as I proposed and researched more than a year ago - storing result as a dictionary serialized to a JSON string / blog. Only difference is that now I use A lot of other approaches were also brought up in the past - things such as removing and replacing mongo all together, using a different ORM library, etc. Yes, many of those would likely result in a much better performance if they were executed and implemented correctly (if!), but most of them are not even close to realistic - they are much more risky (much larger change, harder to make it backward compatible, etc.) and simply not realistic since they would require many man month hours worth of work (and that's an ideal scenario, less ideal and more realistic scenario where various edge cases crop up, compatibility issues, etc. is closer to 6 months). This means that this approach is one of the few possible realistic approaches we have left. This approach exploits the invariant that Initial ResultsInitial results from micro benchmarks are very promising. SaveReadFor a very large execution (8 MB), we see almost ~10x improvement in insertion speed (5s vs 500ms) and ~5x improvement in read speed (800ms vs 150ms). The good news is that this change also scales down - aka it's not slower for smaller results - in fact it's also a bit faster for those scenarios as well. Those numbers are from running micro benchmarks on my computer. They also run on continuously CI and everyone who wants to run it can easily run them themselves and also easily add a different dataset (e.g. some real life execution result data). Things to experiment withI still want to experiment with compressing actual json blob using The goal here is not to save number of bytes on disk (those were never really a big issue), but to save number of bytes which need to be sent over the network when talking to the database - this could perhaps help and reduce overall operation time even more for very large executions, but I doubt it will be worth the effort in the end. Will see what data will show us. TODO
|
single key with a large value.
all JSON fixture files so they contain at least one key with character which needs to be escaped.
I added micro benchmarks just for the escaping and unescape function and also unit tests for the new fields. Running existing tests did uncover some issues - in some places, we run a set query for updating a live action and we pass dict for the value (which also kinda brakes the our ORM layer / DB abstraction and shows why such changes are hard - no abstraction is perfect and there are always leaks). Sadly this won't out of the box, I will need to think and experiment some more on how to best approach this. Technically, I think should be able to make it work, as long as I add some additional code to the places which use |
For now, I'm leaning towards adding a header to the serialized field value even though we will likely just support single serialization library and format ( Something along the lines of So the value would look like something like this: This adds some overhead, but it's really just 4 bytes so it's negligible (especially if we ever support compression). |
Can we be opinionated here and support a specific method of serialization? There's too much options to support (i.e. orjson, ujson, cjson) and I rather json be json w/o any prefix that only st2 understands. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - just added a minor comment.
Would like someone else to review as well though, if possible.
@@ -21,6 +21,11 @@ | |||
|
|||
import yaml | |||
|
|||
try: | |||
from yaml import CSafeDumper as YamlSafeDumper |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we possibly add a comment to explain why we may not have CSafeDumper, and need this catch/exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Funny enough, I just added similar change recently (51f811c).
Technically as long as people are using official packages it should be there. But if they do from source install, etc it may not be.
Agreed it's a good ideal and will add a comment as well :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
means YAML loading and serialization will be significantly slower.
8362768
to
bdd8e3c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry - realised I hadn't seen the full PR changes on last review.
Just a couple of other queries...
"This message was generated by StackStorm action " | ||
"send_mail running on %s" % (HOSTNAME) | ||
) | ||
expected_body = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we've kept the code from the if - which if I read it was only if we were running on python2, so should we not have kept the else logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The version of the used library has also been changed / upgraded so now the behavior is different / more correct.
To clarify - we don't want to use escape sequences since that is Python specific and would not display correctly in the email clients, etc.
I assume we just did that change when working on Python 3 support a long time ago since it was the easiest, but technically not the most correct one :)
"This message was generated by StackStorm action " | ||
"send_mail running on %s" % (HOSTNAME) | ||
) | ||
expected_body = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we've kept the code from the if - which if I read it was only if we were running on python2, so should we not have kept the else logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
description: True to compress the response body | ||
type: boolean | ||
default: false | ||
- name: pretty_format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of interest, is the UI going to use the pretty format or the non-pretty format, and does this affect how the response appears in the UI at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, UI will use pretty format, but keep in mind that this API endpoint is only used when displaying results for large executions - we don't display it directly in the UI anymore (since that would freeze the browser. etc.), but we just display link to the API endpoint which returns raw result.
Here is a WIP st2web change - StackStorm/st2web#868 (https://github.com/StackStorm/st2web/pull/868/files#diff-38caf478018c81662a6660b616ab9dd030868fb0aec1c01a6c3916465d44f6f0R48).
In short - it doesn't affect output of st2web for executions which we display results directly in the ui (and those go through the code highlighter widget which does re-formatting anyway).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but there's a few small improvements that could be done.
@blag Thanks for the review. I will go ahead and merge this since it's already as massive PR, but I plan to open another small PR on top of that in the near future. Having said that, 'm a bit OCD and that extra query on get in st2web change (StackStorm/st2web#868) was not making me happy, so I identified another approach I could use to avoid additional query in st2web to get execution result when result size < threshold. Basically, I will modify API and add Instead of performing a simple "get one by id" query, we will perform a more complex query (something along the lines of Good news is that this query is fast since first filter already filter on the PK so it should add negligible overhead (funny enough, based on the testing I did, sometimes it even returns faster than just a plan get by ID query, no idea why), but it will allow us to simplify st2web change and also avoid one extra API operation on st2web side. |
I'm looking to patch this PR into my current st2 installation. Any documentation I should refer to? Tips? I'm running v3.4 on Ubuntu 18 and st2 slows to a crawl when we run a few nested workflows, even though I set |
@jegesh this has been merged and will be included in the next release which should be out soon. In the mean time you can use dev unstable package for testing if you feel adventurous. |
This pull request builds on top of various research I did in other PRs and issues in the past (#4837, #4838, etc).
The goal is to speed up serialization and de-serialization of very large dictionaries (e.g.large results produced by various actions) inside MongoDB.
Right now we store those dictionaries as native MongoDB dictionaries, but this means we need to escape the keys, since they can't contain
$
and.
which are special characters.This escaping is inefficient and wasteful with large (and nested) dictionaries.
Proposed Implementation / Improvement
This pull request tries to optimize that by serializing those dictionaries as JSON instead.
Keep in mind that this is fine, since we always treat those values as opaque strings anyway.
In fact, a future improvement would be to have a result database object per runner, this way we could utilize a simple "string" type for the actual result, but that's a much more involved change.
Some actual numbers with synthetic dataset are available in other issues mentioned above. Actual numbers with production real-life data are to follow once we deploy this to CI/CD server.
The change is fully backward compatible, but it's behind a feature flag (opt-in) until we are really sure after running it in production for a while with diverse datasets it doesn't make some corner / edge case worse or similar.
Related optimization I talked about to @m4dcoder is getting rid of using
EscapedDict
DB field type in scenarios where keys can't contain special characters (e.g. potentially various workflow related models).TODO