Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writes to Mongodb in Actionrunner are slow and cpu intensive with large documents #4798

Closed
jdmeyer3 opened this issue Oct 2, 2019 · 5 comments

Comments

@jdmeyer3
Copy link
Contributor

jdmeyer3 commented Oct 2, 2019

SUMMARY

When an action return ~4MB of data the actionrunner writing to MongoDB takes ~20 seconds for each document. During the writes, the CPU utilization spikes to 100%.

STACKSTORM VERSION

St2 v3.1.0

OS: CentOS 7.7.1908 Kernel 3.10.0-957.el7.x86_64
Kubernetes v1.14.1
Docker 19.03.2
Base Docker Image: CentOS 7.6.1810 (custom image)

Steps to reproduce the problem

In a Python 3 pack, create a python action with the following code

#!/usr/bin/env python3

import json

import requests
import datetime
from st2common.runners.base_action import Action


class RequestLargeDataset(Action):
    """
    Retrieves ~4MB of json data
    """

    def run(self, **kwargs):
        dataset = requests.get('https://data.ct.gov/api/views/rybz-nyjw/rows.json?accessType=DOWNLOAD')
        data = json.loads(dataset.text)
        return data

Expected Results

Low CPU utilization and the result returning within a relatively short time (<5 seconds)

Actual Results

CPU spiked to 100% for 20-30 seconds

image

adding some logs around st2actions.container.base.py:296 with

    def _update_status(self, liveaction_id, status, result, context):
        try:
            LOG.error("updating liveaction")
            LOG.debug('Setting status: %s for liveaction: %s', status, liveaction_id)
            liveaction_db, state_changed = self._update_live_action_db(
                liveaction_id, status, result, context)
            LOG.error("done updating liveaction")
        except Exception as e:
            LOG.exception(
                'Cannot update liveaction '
                '(id: %s, status: %s, result: %s).' % (
                    liveaction_id, status, result)
            )
            raise e

        try:
            LOG.error('updating execution')
            executions.update_execution(liveaction_db, publish=state_changed)
            LOG.error('done updating execution')
            extra = {'liveaction_db': liveaction_db}
            LOG.debug('Updated liveaction after run', extra=extra)
        except Exception as e:
            LOG.exception(
                'Cannot update action execution for liveaction '
                '(id: %s, status: %s, result: %s).' % (
                    liveaction_id, status, result)
            )
            raise e

        return liveaction_db

I'm getting logs like

2019-10-02 18:17:23,854 ERROR [-] updating liveaction
2019-10-02 18:17:28,963 ERROR [-] done updating liveaction
2019-10-02 18:17:28,964 ERROR [-] updating execution
2019-10-02 18:17:33,268 ERROR [-] done updating execution

Where it takes 10 seconds to update liveaction and another 5 seconds to update execution.

Additionally, I've done a cProfile against the actionrunner and I'm seeing the following
image

It looks like the CPU is spending most of its time building the mongoengine document object. May be related to MongoEngine/mongoengine#1230

@Kami
Copy link
Member

Kami commented Oct 2, 2019

Thanks for reporting this.

There we indeed multiple mongoengine issues related to writing large documents in the past.

Over the releases, the performance was improved in mongoengine itself and we've also done some performance optimizations, but it could be that there are more issues hiding (would need to dig in).

In the past, we also discussed an option of skipping mongoengine serialization / conversion layer all together and using pymongo directly in critical code paths. But this would require quite some code changes.

@Kami
Copy link
Member

Kami commented Feb 21, 2021

I know this issue has been opened quite a long time ago, but the problem hasn't gone away yet and I finally had a chance to work on #4846 and made good progress on it.

Initial micro benchmarks and end to end load tests are very promising - we see database write speed improvements up to 10x and up to 5x for reads when writing / reading execution results - all of that of course also translates to much faster action execution and workflow completion times for executions / workflows which operate on larger datasets.

Some more data and numbers is available here - #4846 (comment), #4846 (comment), #4846 (comment).

I hope there will be no last minute surprises and we will be able to include those changes with v3.5.0 release.

@Kami
Copy link
Member

Kami commented Feb 21, 2021

Oh, and when I was testing the fix end to end with Python runner action which returns 4 MB of data (very similar to the example scenario you used), the whole action execution with the new code on my computer now takes ~1 second vs ~12 seconds with current master.

This includes the whole action execution flow end to end - from execution being scheduled, action ran by the Python runner and all the state persisted to the database (action execution, live action objects, etc.). Actual database writes itself for those two models are now in the range of ~200 ms range.

@amanda11
Copy link
Contributor

As per comments above, this should be much improved in 3.5.0.

@arm4b
Copy link
Member

arm4b commented Aug 22, 2021

I really think we can close this now with the #4846 performance improvements by @Kami.

@jdmeyer3 Feel free to re-open it if you're still seeing an unacceptable performance with the st2 v3.5.0.

@arm4b arm4b closed this as completed Aug 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants