Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize storage (serialization and de-serilization) of very large dictionaries inside MongoDB #4846

Merged
merged 107 commits into from
Mar 20, 2021

Commits on Dec 20, 2019

  1. Add new JSONDictField which allows us to more efficently store,

    serialize and unserialize large dictionary data (such as result field,
    etc.).
    Kami committed Dec 20, 2019
    Configuration menu
    Copy the full SHA
    a93f9c2 View commit details
    Browse the repository at this point in the history

Commits on Feb 21, 2020

  1. Add new JSONDictField which allows us to more efficently store,

    serialize and unserialize large dictionary data (such as result field,
    etc.).
    Kami authored and m4dcoder committed Feb 21, 2020
    Configuration menu
    Copy the full SHA
    a89d658 View commit details
    Browse the repository at this point in the history
  2. Add a feature flag for using new json dict field, set it to false

    (opt-in) by default.
    Kami authored and m4dcoder committed Feb 21, 2020
    Configuration menu
    Copy the full SHA
    fe5e33d View commit details
    Browse the repository at this point in the history
  3. Use new JSON dict field for dictionaries which can be very large where

    escaping the values adds tons of overhead.
    Kami authored and m4dcoder committed Feb 21, 2020
    Configuration menu
    Copy the full SHA
    f0919c9 View commit details
    Browse the repository at this point in the history

Commits on Feb 22, 2020

  1. Configuration menu
    Copy the full SHA
    59e87f9 View commit details
    Browse the repository at this point in the history

Commits on Feb 18, 2021

  1. Configuration menu
    Copy the full SHA
    2024a1e View commit details
    Browse the repository at this point in the history
  2. Add a micro-benchmark which comparsed execution save + read times for

    using two different approaches for serializing execution / live action
    result.
    Kami committed Feb 18, 2021
    Configuration menu
    Copy the full SHA
    2f969f8 View commit details
    Browse the repository at this point in the history
  3. Add another micro benchmark fixture which represents a dictionary with a

    single key with a large value.
    Kami committed Feb 18, 2021
    Configuration menu
    Copy the full SHA
    5971302 View commit details
    Browse the repository at this point in the history
  4. Add micro-benchmark for escape_chars() and unescape_chars() and update

    all JSON fixture files so they contain at least one key with character
    which needs to be escaped.
    Kami committed Feb 18, 2021
    Configuration menu
    Copy the full SHA
    f19f0fd View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    81672f2 View commit details
    Browse the repository at this point in the history
  6. Merge branch 'optimize_escaped_dict_fields' of github.com:StackStorm/…

    …st2 into optimize_escaped_dict_fields
    Kami committed Feb 18, 2021
    Configuration menu
    Copy the full SHA
    2a15e8b View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    44bdbad View commit details
    Browse the repository at this point in the history
  8. Add some more tests.

    Kami committed Feb 18, 2021
    Configuration menu
    Copy the full SHA
    052fde7 View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    dbc5f3d View commit details
    Browse the repository at this point in the history

Commits on Feb 19, 2021

  1. Configuration menu
    Copy the full SHA
    424f3d7 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    ff485ca View commit details
    Browse the repository at this point in the history
  3. Update docstring.

    Kami committed Feb 19, 2021
    Configuration menu
    Copy the full SHA
    6b1abf0 View commit details
    Browse the repository at this point in the history
  4. Add new "finalized_timestamp" field to the Execution and LiveAction

    object.
    
    This will provide us better visibility into how long action runner needs
    to process the execution completely - this means not just the runner
    running the action, but also the action runner container persisting the
    result and corresponding objects to the database.
    Kami committed Feb 19, 2021
    Configuration menu
    Copy the full SHA
    89617ff View commit details
    Browse the repository at this point in the history
  5. Add changelog entry.

    Kami committed Feb 19, 2021
    Configuration menu
    Copy the full SHA
    fa03d2f View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    68feb47 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    4cb7a1d View commit details
    Browse the repository at this point in the history
  8. Fix lint.

    Kami committed Feb 19, 2021
    Configuration menu
    Copy the full SHA
    e1d085d View commit details
    Browse the repository at this point in the history
  9. Add python runner action which can be used for testing and timing large

    execution result save times.
    Kami committed Feb 19, 2021
    Configuration menu
    Copy the full SHA
    d9ad62b View commit details
    Browse the repository at this point in the history
  10. Update changelog.

    Kami committed Feb 19, 2021
    Configuration menu
    Copy the full SHA
    e20242f View commit details
    Browse the repository at this point in the history
  11. Add TODO comment.

    Kami committed Feb 19, 2021
    Configuration menu
    Copy the full SHA
    95de8ff View commit details
    Browse the repository at this point in the history
  12. Update the field and implement another approach which uses additional

    header for the binary field value.
    
    This header tells us which serialization format and compression (if any)
    is used for a specific field value.
    
    Using a header format gives us more, flexibility, makes it more future
    proof (e.g. ability to change the format in the future) and also ability
    to implement things such as per-field compression.
    Kami committed Feb 19, 2021
    Configuration menu
    Copy the full SHA
    42f70e7 View commit details
    Browse the repository at this point in the history
  13. Fix lint.

    Kami committed Feb 19, 2021
    Configuration menu
    Copy the full SHA
    8017e2a View commit details
    Browse the repository at this point in the history
  14. Configuration menu
    Copy the full SHA
    cbb4cb1 View commit details
    Browse the repository at this point in the history

Commits on Feb 20, 2021

  1. Re-generate requiremennts files.

    Kami committed Feb 20, 2021
    Configuration menu
    Copy the full SHA
    49b4033 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    d93bd9d View commit details
    Browse the repository at this point in the history

Commits on Feb 21, 2021

  1. Also apply the same field optimizations changes to all the workflows

    related models.
    
    Based on end to end testings, this results in massive speed ups for
    workflows which pass larger data sets around.
    
    See #4846 (comment)
    for some numbers and details.
    Kami committed Feb 21, 2021
    Configuration menu
    Copy the full SHA
    c8f4022 View commit details
    Browse the repository at this point in the history
  2. Update changelog.

    Kami committed Feb 21, 2021
    Configuration menu
    Copy the full SHA
    88151da View commit details
    Browse the repository at this point in the history
  3. For now, only utilize JSONDictField for fields which are for all

    purposes already "immutable" and make sure we always write them out to
    the database, even on partial dict update.
    
    Also add tests for it.
    Kami committed Feb 21, 2021
    Configuration menu
    Copy the full SHA
    a239061 View commit details
    Browse the repository at this point in the history
  4. Implement dict value change tracking for our custom JSONDictField.

    This dict value tracking allows us to track when a dict item value has
    changed and only write the value to the database on existing document /
    model update in case it has changed.
    
    This is a very important property since it allows us to implement
    efficient partial document updates.
    
    With that change, JSONDictField now also works in exactly the same
    manner as existing mongoengine DictField field type.
    
    Also add tests for various edge cases which would fail if value change
    tracking was not correctly implemented or working.
    Kami committed Feb 21, 2021
    Configuration menu
    Copy the full SHA
    7cd3ec4 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    82965ef View commit details
    Browse the repository at this point in the history
  6. Add orquesta workflow action which can be used to test passing large

    around around (both - returning it as a result and also as a next task
    context).
    Kami committed Feb 21, 2021
    Configuration menu
    Copy the full SHA
    eaccea2 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    9682fac View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    bc9e9c2 View commit details
    Browse the repository at this point in the history
  9. Apply same optimizatons to trigger_instance.payload field.

    This way we also get better throughput and lower CPU utilization for
    rules engine when working with larger trigger instances.
    Kami committed Feb 21, 2021
    Configuration menu
    Copy the full SHA
    2ac3fda View commit details
    Browse the repository at this point in the history
  10. Add correct file.

    Kami committed Feb 21, 2021
    Configuration menu
    Copy the full SHA
    49d6134 View commit details
    Browse the repository at this point in the history
  11. Configuration menu
    Copy the full SHA
    242c676 View commit details
    Browse the repository at this point in the history

Commits on Feb 23, 2021

  1. Also add benchmark for model with multiple fields of the same type and

    also for the native dict field type.
    Kami committed Feb 23, 2021
    Configuration menu
    Copy the full SHA
    75ab254 View commit details
    Browse the repository at this point in the history
  2. Hook micro benchmarks to CI.

    Kami committed Feb 23, 2021
    Configuration menu
    Copy the full SHA
    4933573 View commit details
    Browse the repository at this point in the history
  3. Updat the new field type and make sure we also correctly track changes

    in dict list items and mark parent dict field as changed if any dict
    list item has changed.
    Kami committed Feb 23, 2021
    Configuration menu
    Copy the full SHA
    e8745d8 View commit details
    Browse the repository at this point in the history
  4. Use consistent action name.

    Kami committed Feb 23, 2021
    Configuration menu
    Copy the full SHA
    560d616 View commit details
    Browse the repository at this point in the history

Commits on Feb 24, 2021

  1. Simplify the code - instead of having another finalized_timestamp

    attribute, update end_timestamp instead at the very end.
    
    This way execution duration will be more accurately reported.
    Kami committed Feb 24, 2021
    Configuration menu
    Copy the full SHA
    147a02b View commit details
    Browse the repository at this point in the history
  2. Update st2 execution get command to also display log attribute by

    default.
    
    This should make it easier to infer actual execution run time duration
    and state transitions.
    Kami committed Feb 24, 2021
    Configuration menu
    Copy the full SHA
    405e039 View commit details
    Browse the repository at this point in the history
  3. Update affected tests.

    Kami committed Feb 24, 2021
    Configuration menu
    Copy the full SHA
    78f89ab View commit details
    Browse the repository at this point in the history

Commits on Feb 25, 2021

  1. Configuration menu
    Copy the full SHA
    7810aa8 View commit details
    Browse the repository at this point in the history
  2. Update affected tests - live action and action execution timestamp may

    now be a bit different, depending on how long it takes to persist each
    corresponding object in the database.
    
    Also fix tests to utilize correct dict type for the result.
    Kami committed Feb 25, 2021
    Configuration menu
    Copy the full SHA
    b31c006 View commit details
    Browse the repository at this point in the history
  3. Throw more user-friendly error.

    Kami committed Feb 25, 2021
    Configuration menu
    Copy the full SHA
    5988b5c View commit details
    Browse the repository at this point in the history

Commits on Feb 26, 2021

  1. micro-benchmarks task is very slow on CI so for now, only run it on

    nightly scheduled basis.
    Kami committed Feb 26, 2021
    Configuration menu
    Copy the full SHA
    71791b3 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    1d178df View commit details
    Browse the repository at this point in the history

Commits on Feb 27, 2021

  1. Include the following changes which makes action registration 15-20%

    faster (especially visible with packs which have many actions such as
    the aws one):
    
    * Utilize ``fast_deepcopy`` for making deep copies of dicts in json
      schema code (that code only works with simple native JSON type so this
      function can be used without any issues).
    * Update registrator code to use runner db cache. This means that
      instead of doing N queries where N is number of actions to be
      registered, now we will do only M queries where M is number of unique
      runners actions utilize (in most cases thats < 4).
    * Update existing action retrieval code to only retrieve fields we need
      (id, pack, ref). We really only need ID to check if the object already
      exists and perform upsert. Retrieving all the fields we don't use
      is wasteful and slow for actions with many parameters.
    * Use C version of the YAML safe loader when loading YAML metadata. C
      version is a lot faster.
    Kami committed Feb 27, 2021
    Configuration menu
    Copy the full SHA
    053bd93 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    824c2ea View commit details
    Browse the repository at this point in the history
  3. Fix failing test.

    Kami committed Feb 27, 2021
    Configuration menu
    Copy the full SHA
    1a99eee View commit details
    Browse the repository at this point in the history
  4. Fix rst syntax.

    Kami committed Feb 27, 2021
    Configuration menu
    Copy the full SHA
    ca49b10 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    4289b9e View commit details
    Browse the repository at this point in the history
  6. Update more places in the code where we only work with simple / native

    JSON types to utilize fast_deepcopy() instead of copy.deepcopy().
    
    This should result in fast copy times and as such faster secret masking,
    etc.
    Kami committed Feb 27, 2021
    Configuration menu
    Copy the full SHA
    9f0a6ba View commit details
    Browse the repository at this point in the history
  7. Update nose tests target to exclude resource registrar debug log

    messages by default.
    
    This should make troubleshooting failures a lot easier - before that
    change, those log messages would add tons of noise (we load resource
    fixtures for each single test) and make actual test failures hard to
    troubleshoot.
    Kami committed Feb 27, 2021
    Configuration menu
    Copy the full SHA
    4793ba3 View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    dbc1460 View commit details
    Browse the repository at this point in the history

Commits on Mar 6, 2021

  1. Merge branch 'master' of github.com:StackStorm/st2 into optimize_esca…

    …ped_dict_fieldsA
    
    Also format new code with black.
    Kami committed Mar 6, 2021
    Configuration menu
    Copy the full SHA
    af961fb View commit details
    Browse the repository at this point in the history
  2. Use lazy import since right now zstandard is only used for tests and

    benchmarks and it's a testing dependency.
    Kami committed Mar 6, 2021
    Configuration menu
    Copy the full SHA
    64dbe5a View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    167ca3f View commit details
    Browse the repository at this point in the history

Commits on Mar 7, 2021

  1. Configuration menu
    Copy the full SHA
    8902d06 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    6eabd5b View commit details
    Browse the repository at this point in the history
  3. Make sure we don't call unescape_chars() on the JSONDictField field

    values since it's not required and may break things by decoding bytes to
    string and adding trailing character.
    Kami committed Mar 7, 2021
    Configuration menu
    Copy the full SHA
    2c2cb74 View commit details
    Browse the repository at this point in the history
  4. Update changelog.

    Kami committed Mar 7, 2021
    Configuration menu
    Copy the full SHA
    93d859c View commit details
    Browse the repository at this point in the history
  5. Remove unused options.

    Kami committed Mar 7, 2021
    Configuration menu
    Copy the full SHA
    2ea37db View commit details
    Browse the repository at this point in the history

Commits on Mar 12, 2021

  1. Add additional timer metrics to the action runner which will provide

    better operational visibility into some steps of the action runner.
    Kami committed Mar 12, 2021
    Configuration menu
    Copy the full SHA
    0f293ee View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    a46831e View commit details
    Browse the repository at this point in the history

Commits on Mar 14, 2021

  1. Configuration menu
    Copy the full SHA
    c8c3b91 View commit details
    Browse the repository at this point in the history
  2. Remove incorrect log message which was causing unncessary log churn in

    action runner.
    
    That exception does not represent a fatal error so we should not log
    anything.
    
    In fact, it's quite a common and expected scenario that a key doesn't
    contain a JSON string.
    Kami committed Mar 14, 2021
    Configuration menu
    Copy the full SHA
    b2ed03b View commit details
    Browse the repository at this point in the history

Commits on Mar 15, 2021

  1. Also json instead of orjson so action can also be used with older

    versions of StackStorm.
    Kami committed Mar 15, 2021
    Configuration menu
    Copy the full SHA
    9feb81e View commit details
    Browse the repository at this point in the history
  2. Store "result_size field on the ActionExecutionDB.

    This field is populated lazily on model save.
    
    It will allow us to implement more efficient data retrieval in the web
    ui and other clients since we will be able to avoid retrieving the whole
    result for executions with very large results.
    "
    Kami committed Mar 15, 2021
    Configuration menu
    Copy the full SHA
    9f4f523 View commit details
    Browse the repository at this point in the history
  3. Add new WIP API endpoint for returning / downloading raw action

    execution result.
    
    This endpoint is to be used with webui for executions with large
    results.
    Kami committed Mar 15, 2021
    Configuration menu
    Copy the full SHA
    d0f0d78 View commit details
    Browse the repository at this point in the history

Commits on Mar 16, 2021

  1. Configuration menu
    Copy the full SHA
    b0dea78 View commit details
    Browse the repository at this point in the history
  2. Update URL path, add tests.

    Kami committed Mar 16, 2021
    Configuration menu
    Copy the full SHA
    756b916 View commit details
    Browse the repository at this point in the history
  3. Update "result_size" field for action execution and live action DB model

    inside action runner at the end after save.
    
    I was hoping we will be able to avoid one additional serialization, but
    sadly we can't if we don't want to massively hack and monkey patch
    mongoengine.
    
    And that monkeypatching is not worth it since serialization is fast
    enough.
    
    To put things into perspective - takes takes 7ms for 4 MB result which
    is nothing compared to other DB operations durations. And for smaller
    results it even gets to the sub ms aka nanosecond range.
    Kami committed Mar 16, 2021
    Configuration menu
    Copy the full SHA
    a47461b View commit details
    Browse the repository at this point in the history
  4. Move calculation and setting of the result_size field to the

    update_execution() serivce method and don't update end timestamp for
    liveaction and execution DB model at the end.
    
    Technically with new perf optimizations code, DB operations are very
    fast already and this way we avoid 2 additional queries and save up to
    500ms when storing very large executions.
    
    And doing it inside that function also means we can correctly update it
    for workflow executions when they finish.
    Kami committed Mar 16, 2021
    Configuration menu
    Copy the full SHA
    2005126 View commit details
    Browse the repository at this point in the history
  5. Add changelog entry.

    Kami committed Mar 16, 2021
    Configuration menu
    Copy the full SHA
    086be02 View commit details
    Browse the repository at this point in the history
  6. Re-generate api spec.

    Kami committed Mar 16, 2021
    Configuration menu
    Copy the full SHA
    8e0c312 View commit details
    Browse the repository at this point in the history
  7. Fix typo.

    Kami committed Mar 16, 2021
    Configuration menu
    Copy the full SHA
    cd9eba7 View commit details
    Browse the repository at this point in the history
  8. Fix failing test.

    Kami committed Mar 16, 2021
    Configuration menu
    Copy the full SHA
    1a932ca View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    e72215f View commit details
    Browse the repository at this point in the history
  10. Fix merge conflicts.

    Kami committed Mar 16, 2021
    Configuration menu
    Copy the full SHA
    9e336d8 View commit details
    Browse the repository at this point in the history
  11. Fix test method name.

    Kami committed Mar 16, 2021
    Configuration menu
    Copy the full SHA
    d373cf5 View commit details
    Browse the repository at this point in the history

Commits on Mar 18, 2021

  1. Configuration menu
    Copy the full SHA
    224dfba View commit details
    Browse the repository at this point in the history
  2. Add micro benchmark which times saving and reading large string value

    from a database using string and binary field type.
    Kami committed Mar 18, 2021
    Configuration menu
    Copy the full SHA
    3cc71ef View commit details
    Browse the repository at this point in the history
  3. Merge branch 'optimize_escaped_dict_fields' of github.com:StackStorm/…

    …st2 into optimize_escaped_dict_fields
    Kami committed Mar 18, 2021
    Configuration menu
    Copy the full SHA
    ac4efbd View commit details
    Browse the repository at this point in the history

Commits on Mar 19, 2021

  1. Configuration menu
    Copy the full SHA
    051a691 View commit details
    Browse the repository at this point in the history
  2. Update CLI to use C version of the YAML safe dumper when pretty

    formatting execution result for display and orjson when parsing API
    response.
    
    This should result in "st2 execution get" and other commands to finish
    faster, especially when working with large executions.
    
    For example, locally running st2 execution get on execution with 8 MB
    result takes 18 seconds before this change and less than 6 seconds with
    this change.
    Kami committed Mar 19, 2021
    Configuration menu
    Copy the full SHA
    94b6298 View commit details
    Browse the repository at this point in the history
  3. Clarify the comment.

    Kami committed Mar 19, 2021
    Configuration menu
    Copy the full SHA
    d1df1cd View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    b13c195 View commit details
    Browse the repository at this point in the history
  5. Log a warning message if pyyaml C bindings are not available since it

    means YAML loading and serialization will be significantly slower.
    Kami committed Mar 19, 2021
    Configuration menu
    Copy the full SHA
    51f811c View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    4672495 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    a25efa6 View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    bdd8e3c View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    a098315 View commit details
    Browse the repository at this point in the history
  10. Use fast dict copy.

    Kami committed Mar 19, 2021
    Configuration menu
    Copy the full SHA
    e818158 View commit details
    Browse the repository at this point in the history
  11. For performance reasons, use udatetime library for parsing rfc3339 /

    iso8601 date strings where possible.
    Kami committed Mar 19, 2021
    Configuration menu
    Copy the full SHA
    48d612d View commit details
    Browse the repository at this point in the history
  12. ujson is not only used for tests / benchmarks so move it to

    tests-requirements.
    Kami committed Mar 19, 2021
    Configuration menu
    Copy the full SHA
    71ffb1a View commit details
    Browse the repository at this point in the history
  13. Fix typo.

    Kami committed Mar 19, 2021
    Configuration menu
    Copy the full SHA
    46ba2c9 View commit details
    Browse the repository at this point in the history
  14. Add TODO comment.

    Kami committed Mar 19, 2021
    Configuration menu
    Copy the full SHA
    a27245f View commit details
    Browse the repository at this point in the history
  15. Fix affected test.

    Kami committed Mar 19, 2021
    Configuration menu
    Copy the full SHA
    1a91394 View commit details
    Browse the repository at this point in the history

Commits on Mar 20, 2021

  1. Apply suggestions from code review

    Co-authored-by: blag <[email protected]>
    Kami and blag authored Mar 20, 2021
    Configuration menu
    Copy the full SHA
    cbd0259 View commit details
    Browse the repository at this point in the history
  2. Fix syntax, add comments.

    Kami committed Mar 20, 2021
    Configuration menu
    Copy the full SHA
    3b47856 View commit details
    Browse the repository at this point in the history