heavily optimize the write speed of the callback receiver #5618

ryanpetrello · 2020-01-09T16:28:52Z

Notes for QE:

Any tests that rely on or change logging settings now need a 1 second sleep (Ryan ran integration and didn't see any failures related to this, though)
@kdelee has been working on more robust websocket job event tests. Would be good to get his input on testing this PR.
Want to make sure we don't regress on UI stdout/websocket testing

ryanpetrello · 2020-01-09T16:32:17Z

awx/main/dispatch/worker/callback.py

+            return 'FLUSH'
+
+    def flush(self):
+        should_flush = (


The main optimization is to avoid an insert per event, but instead buffer event saves using bulk_create. The way this works is that we'll buffer up to 1000 events at once (though it's very doubtful that anything on the other end is publishing 4k events per second).

Additionally, in the work loop that pulls events over the IPC pipe, we time out every tenth of a second, and use a special FLUSH message to signal that we should flush:

https://github.com/ansible/awx/pull/5618/files#diff-f37b92a11438678a6a32ac23a7790f05R39

So effectively, inserts will buffer for up to a tenth of a second. If there's a lot coming in, we're doing bulk inserts to cut down on individual insert + commit costs, and this is making a very notable performance difference for high-write workloads.

ryanpetrello · 2020-01-09T16:34:39Z

awx/main/models/events.py

-                    update_fields.append(field)
-
-            # Update host related field from host_name.
-            if hasattr(self, 'job') and not self.host_id and self.host_name:


Another very large expense is that on every event insert, we're doing some type of inventory/host query here to map a host name for the current job to a host ID.

This individual queries aren't slow, but this model means that if you INSERT 100K events, you're doing 100K SELECT queries.

It turns out that this is completely unnecessary - because we compose the inventory ourselves, we can instead just pass the Tower Host ID through hostvars, and include it in the event payload before we even get to this code. That implementation lives here:

https://github.com/ansible/awx/pull/5618/files#diff-9d4ea1dd908b35fb92eaede4bd10bb46R1004

ryanpetrello · 2020-01-09T16:36:12Z

awx/main/models/events.py

-                if self.failed is True:
-                    kwargs['failed'] = True
-                if kwargs:
-                    JobEvent.objects.filter(job_id=self.job_id, uuid=self.parent_uuid).update(**kwargs)


This is one of the worst offenders. Much like the per-event hostname lookup, this query means that on every INSERT where there's a parent_uuid (many of them, depending on your playbook), we're also doing a per-event UPDATE. Even worse, in situations where you've got many events that actually share the same parent (such as within a with_items context), you're actually issuing the exact same query over and over. As you can imagine, this is really expensive if your main_jobevent is particularly large.

I observed this exact issue recently while profiling slowness in a customer install.

Instead of handling this per-event, this behavior can be implemented much more efficiently once at the very end of the job run:

https://github.com/ansible/awx/pull/5618/files#diff-9b1e5b80fcb01fab3e7a0ece4e371a50R299

ryanpetrello · 2020-01-09T16:37:20Z

awx/main/models/events.py

@@ -456,38 +409,6 @@ def get_absolute_url(self, request=None):
    def __str__(self):
        return u'%s @ %s' % (self.get_event_display2(), self.created.isoformat())

-    def _update_from_event_data(self):
-        # Update job event hostname


We no longer need any of this code, because it's already in the event that goes across the message bus.

ryanpetrello · 2020-01-09T16:37:41Z

awx/main/models/events.py

-        return updated_fields
-
-    def _update_hosts(self, extra_host_pks=None):
-        # Update job event hosts m2m from host_name, propagate to parent events.


This method was dead code that nothing was calling. I can't see any evidence of its use since ~2014.

ryanpetrello · 2020-01-09T16:38:18Z

awx/main/models/events.py

-        # If update_fields has been specified, add our field names to it,
-        # if it hasn't been specified, then we're just doing a normal save.
-        update_fields = kwargs.get('update_fields', [])
+    def _update_from_event_data(self):


bulk_create doesn't actually call .save(), so I've changed this method to be something we can call manually to update the Django ORM objects before they're passed to bulk_create.

copy_ansible_runner_on_event_status_into_tower_runner_on_event_status or some other, more descriptive function name.

Organizationally, I see how it's a bit confusing, see discussion on host_name elsewhere here. That field was promoted from the "Ansible" event data to the "runner" event data in the callback receiver before the message was sent. It would be better to have all these promotion tasks put into the same place ("copy" in your wording). But that organizational issue won't be helped by improved naming. So if we're not refactoring, I would leave it as-is.

Also, the logging here is a really important feature, and I would certainly like to separate that from the other actions, for testing and other things.

Agreed, all of these methods and inheritance in models/events.py are really starting to show how complex they've grown. I wouldn't mind refactoring it to clean it up some, but I'm also not interested in doing it right now.

ryanpetrello · 2020-01-09T16:38:59Z

awx/main/models/events.py

@@ -271,34 +271,44 @@ def get_event_display2(self):

    def _update_from_event_data(self):


Same comment as https://github.com/ansible/awx/pull/5618/files#r364842220

You'll notice I've removed the updated_fields tracking because this method doesn't actually call .save() anymore (it just updates properties of self).

ryanpetrello · 2020-01-09T16:39:33Z

awx/main/dispatch/worker/callback.py

+                for e in events:
+                    e.created = e.modified = now
+                cls.objects.bulk_create(events)
+                for e in events:


bulk_create doesn't fire signals, so we have to iterate and call post_save ourselves so that websocket messages are emitted.

Actually, it's perfectly fast, to be honest (now that I've sped up the signal code to not use a DRF serializer).

ryanpetrello · 2020-01-09T16:40:51Z

awx/main/signals.py

    instance = kwargs['instance']
    created = kwargs['created']
    if created:
-        event_serializer = serializer(instance)


It turns out that this serializer has a lot of (mostly Django and DRF) overhead, and we don't need or display the majority of the content it generates anyways (like related links and summary fields).

This change makes this code a bit more brittle, but this is a critically important section of our code from a performance perspective. In my profiling, this change cut down per-event costs by over 50% 😱.

ryanpetrello · 2020-01-09T16:41:24Z

awx/main/signals.py

+        if isinstance(instance, JobEvent):
+            url = '/api/v2/job_events/{}'.format(instance.id)
+        if isinstance(instance, AdHocCommandEvent):
+            url = '/api/v2/ad_hoc_command_events/{}'.format(instance.id)


These are the only two types of events with top-level URLs in the API.

softwarefactory-project-zuul · 2020-01-09T16:44:28Z

Build failed.

awx-api-lint : SUCCESS in 4m 21s
awx-api : FAILURE in 11m 14s
awx-ui : SUCCESS in 6m 54s
awx-ui-next : SUCCESS in 8m 29s
awx-swagger : FAILURE in 14m 47s
awx-detect-schema-change : FAILURE in 10m 15s (non-voting)
awx-ansible-modules : SUCCESS in 4m 47s

AlanCoding · 2020-01-09T16:51:31Z

awx/main/dispatch/worker/base.py

@@ -128,7 +131,7 @@ def work_loop(self, queue, finished, idx, *args):
            if os.getppid() != ppid:
                break
            try:
-                body = queue.get(block=True, timeout=1)
+                body = self.read(queue)


no, looks like this change is just to accommodate the pattern in the callback-specific worker code

Right, this is just to persist the behavior for the base case (the dispatcher).

matburt · 2020-01-09T16:55:27Z

awx/settings/defaults.py

@@ -573,9 +573,6 @@ def IS_TESTING(argv=None):
 # Additional environment variables to be passed to the ansible subprocesses
 AWX_TASK_ENV = {}

-# Flag to enable/disable updating hosts M2M when saving job events.
-CAPTURE_JOB_EVENT_HOSTS = False


This was the original shortcut to not do this if someone needed to eek out performance gains.

Looks like it got turned off as default behavior years ago - 🤷‍♂

https://github.com/ansible/awx/blame/devel/awx/settings/defaults.py#L577

ryanpetrello · 2020-01-09T17:08:27Z

awx/main/dispatch/worker/callback.py

+        try:
+            return queue.get(block=True, timeout=.25)
+        except QueueEmpty:
+            return 'FLUSH'


If there haven't been any new events from RMQ in .25 seconds, flush with bulk_create.

ryanpetrello · 2020-01-09T17:09:14Z

awx/main/dispatch/worker/callback.py

+
+    def flush(self):
+        should_flush = (
+            any([len(events) == 2000 for events in self.buff.values()]) or


2000 is probably overkill - I doubt we'd see this many events (effectively, 8k/s) per worker process from a real Ansible playbook because playbooks that are actually doing real things don't actually emit events this fast.

ryanpetrello · 2020-01-09T17:12:11Z

awx/main/dispatch/worker/callback.py

            job_identifier = 'unknown job'
            job_key = 'unknown'
-            for key in event_map.keys():
+            for key, cls in event_map.items():


We've actually got to track a dictionary of buffers indexed by the event class. That way if we get:

JobEvent, JobEvent, ProjectUpdateEvent, InventoryUpdateEvent, JobEvent.

When it's time to flush, we run:

JobEvent.objects.bulk_create([JE, JE JE]) ProjectUpdateEvent.objects.bulk_create([PUE]) InventoryUpdateEvent.objects.bulk_create([IUE])

ryanpetrello · 2020-01-09T17:13:28Z

awx/main/dispatch/worker/callback.py

@@ -104,7 +133,14 @@ def _save_event_data():
            retries = 0
            while retries <= self.MAX_RETRIES:
                try:
-                    _save_event_data()
+                    kwargs = cls.create_from_data(**body)
+                    workflow_job_id = kwargs.pop('workflow_job_id', None)


workflow_job_id isn't an actual column, so we can't pass it to the event __init__().

Instead, we use setattr on the instance so that message sent to external loggers contains it.

related: #4731

softwarefactory-project-zuul · 2020-01-09T17:14:13Z

Build failed.

awx-api-lint : SUCCESS in 5m 19s
awx-api : FAILURE in 10m 36s
awx-ui : SUCCESS in 5m 44s
awx-ui-next : SUCCESS in 7m 46s
awx-swagger : FAILURE in 12m 43s
awx-detect-schema-change : FAILURE in 9m 48s (non-voting)
awx-ansible-modules : SUCCESS in 3m 17s

ryanpetrello · 2020-01-09T17:15:52Z

awx/main/models/events.py

            except (AttributeError, TypeError):
                pass
+
+            if isinstance(self, JobEvent):
+                hostnames = self._hostnames()


This is all job-event specific processing that happens for playbook_on_stats that I've moved here for consistency.

ryanpetrello · 2020-01-09T18:05:01Z

I tested this some by spinning up two m3.2xlarge in ec2, one for AWX, one for postgres.

Given that this is an 8 core box, I decided to set JOB_EVENT_WORKERS = 8.

Next, I stopped the callback receiver and pushed 1.5M rows into the callback_tasks queue:

~ awx-manage shell_plus
>>> from awx.main.queue import CallbackQueueDispatcher
>>> d = CallbackQueueDispatcher()
>>> import uuid
>>> for i in range(1500000):
...  d.dispatch({'uuid': str(uuid.uuid4()), 'stdout': 'Line %d' % i,'job_id': 149})

(...and waited a few minutes)

Next, I restarted the callback receiver and waited to see what throughput looked like.

awx=> SELECT COUNT(*) FROM main_jobevent WHERE created > now() - '1 minute'::interval;
 count
--------
 109279
(1 row)

So around 1800 events per second.

I couldn't continue my experiment further because I ran out of disk space 😂.

softwarefactory-project-zuul · 2020-01-13T22:49:13Z

Build succeeded.

awx-api-lint : SUCCESS in 3m 52s
awx-api : SUCCESS in 12m 10s
awx-ui : SUCCESS in 5m 21s
awx-ui-next : SUCCESS in 11m 47s
awx-swagger : SUCCESS in 8m 51s
awx-detect-schema-change : SUCCESS in 8m 53s (non-voting)
awx-ansible-modules : SUCCESS in 3m 26s

softwarefactory-project-zuul · 2020-01-14T01:35:56Z

Build succeeded.

awx-api-lint : SUCCESS in 3m 01s
awx-api : SUCCESS in 8m 27s
awx-ui : SUCCESS in 5m 20s
awx-ui-next : SUCCESS in 11m 44s
awx-swagger : SUCCESS in 13m 09s
awx-detect-schema-change : SUCCESS in 13m 08s (non-voting)
awx-ansible-modules : SUCCESS in 3m 18s

softwarefactory-project-zuul · 2020-01-14T02:00:10Z

Build succeeded.

awx-api-lint : SUCCESS in 3m 25s
awx-api : SUCCESS in 9m 14s
awx-ui : SUCCESS in 6m 47s
awx-ui-next : SUCCESS in 9m 02s
awx-swagger : SUCCESS in 11m 48s
awx-detect-schema-change : SUCCESS in 9m 38s (non-voting)
awx-ansible-modules : SUCCESS in 4m 00s

softwarefactory-project-zuul · 2020-01-14T13:32:15Z

Build succeeded.

awx-api-lint : SUCCESS in 5m 31s
awx-api : SUCCESS in 12m 02s
awx-ui : SUCCESS in 7m 41s
awx-ui-next : SUCCESS in 12m 10s
awx-swagger : SUCCESS in 14m 50s
awx-detect-schema-change : SUCCESS in 11m 49s (non-voting)
awx-ansible-modules : SUCCESS in 6m 03s

softwarefactory-project-zuul · 2020-01-14T14:15:16Z

Build succeeded.

awx-api-lint : SUCCESS in 2m 52s
awx-api : SUCCESS in 9m 48s
awx-ui : SUCCESS in 6m 00s
awx-ui-next : SUCCESS in 10m 44s
awx-swagger : SUCCESS in 12m 17s
awx-detect-schema-change : SUCCESS in 9m 26s (non-voting)
awx-ansible-modules : SUCCESS in 4m 02s

softwarefactory-project-zuul · 2020-01-14T15:33:19Z

Build succeeded.

awx-api-lint : SUCCESS in 6m 35s
awx-api : SUCCESS in 10m 24s
awx-ui : SUCCESS in 5m 25s
awx-ui-next : SUCCESS in 8m 47s
awx-swagger : SUCCESS in 10m 28s
awx-detect-schema-change : SUCCESS in 10m 19s (non-voting)
awx-ansible-modules : SUCCESS in 3m 37s

AlanCoding · 2020-01-14T16:41:28Z

awx/main/management/commands/callback_stats.py

+class Command(BaseCommand):
+
+    def handle(self, *args, **options):
+        super(Command, self).__init__()


is there any reason that handle calls the super __init__?

Nope, I think that's just a copy-paste typo. Good eye.

softwarefactory-project-zuul · 2020-01-14T16:46:39Z

Build succeeded.

awx-api-lint : SUCCESS in 4m 00s
awx-api : SUCCESS in 12m 12s
awx-ui : SUCCESS in 4m 45s
awx-ui-next : SUCCESS in 7m 51s
awx-swagger : SUCCESS in 8m 41s
awx-detect-schema-change : SUCCESS in 13m 10s (non-voting)
awx-ansible-modules : SUCCESS in 3m 15s

additionaly, optimize away several per-event host lookups and changed/failed propagation lookups we've always performed these (fairly expensive) queries *on every event save* - if you're processing tens of thousands of events in short bursts, this is way too slow this commit also introduces a new command for profiling the insertion rate of events, `awx-manage callback_stats` see: ansible#5514

AlanCoding

The only other thing I want to say is that I really want to have more things planned for the management command. But I have no problem with adding it at the moment, or having this as its default behavior.

softwarefactory-project-zuul · 2020-01-14T17:17:10Z

Build succeeded.

awx-api-lint : SUCCESS in 3m 07s
awx-api : SUCCESS in 10m 06s
awx-ui : SUCCESS in 8m 39s
awx-ui-next : SUCCESS in 7m 42s
awx-swagger : SUCCESS in 8m 39s
awx-detect-schema-change : SUCCESS in 11m 44s (non-voting)
awx-ansible-modules : SUCCESS in 3m 54s

softwarefactory-project-zuul · 2020-01-14T17:37:18Z

Build succeeded (gate pipeline).

awx-api-lint : SUCCESS in 2m 44s
awx-api : SUCCESS in 9m 37s
awx-ui : SUCCESS in 6m 59s
awx-ui-next : SUCCESS in 8m 36s
awx-swagger : SUCCESS in 9m 09s
awx-detect-schema-change : SUCCESS in 11m 00s (non-voting)
awx-ansible-modules : SUCCESS in 3m 27s
awx-push-new-schema : SUCCESS in 9m 26s (non-voting)

the callback receiver is still fairly slow when logging is enabled due to constant setting lookups; this speeds things up considerably related: ansible#5618

ryanpetrello commented Jan 9, 2020

View reviewed changes

ryanpetrello requested review from AlanCoding, matburt, chrismeyersfsu and ghjm January 9, 2020 16:42

AlanCoding reviewed Jan 9, 2020

View reviewed changes

matburt reviewed Jan 9, 2020

View reviewed changes

ryanpetrello force-pushed the callback-write-speed branch from 447fdfa to eed7201 Compare January 9, 2020 17:00

ryanpetrello requested review from beeankha, fosterseth, shanemcd, jladdjr and rebeccahhh January 9, 2020 17:06

ryanpetrello commented Jan 9, 2020

View reviewed changes

ryanpetrello force-pushed the callback-write-speed branch from 3c8c90f to dfd4801 Compare January 13, 2020 22:31

ryanpetrello force-pushed the callback-write-speed branch from dfd4801 to 7a06468 Compare January 14, 2020 01:22

ryanpetrello force-pushed the callback-write-speed branch from 7a06468 to 052f6df Compare January 14, 2020 01:47

ryanpetrello force-pushed the callback-write-speed branch from 052f6df to d10432a Compare January 14, 2020 13:16

ryanpetrello requested a review from kdelee January 14, 2020 13:55

ryanpetrello force-pushed the callback-write-speed branch from d10432a to b904165 Compare January 14, 2020 14:02

ryanpetrello force-pushed the callback-write-speed branch from b904165 to 553671b Compare January 14, 2020 15:22

ryanpetrello force-pushed the callback-write-speed branch from 553671b to 7aff1ee Compare January 14, 2020 16:32

fosterseth approved these changes Jan 14, 2020

View reviewed changes

AlanCoding reviewed Jan 14, 2020

View reviewed changes

ryanpetrello force-pushed the callback-write-speed branch from 7aff1ee to 306f504 Compare January 14, 2020 17:04

AlanCoding approved these changes Jan 14, 2020

View reviewed changes

ryanpetrello added the mergeit label Jan 14, 2020

softwarefactory-project-zuul bot merged commit b12c2a1 into ansible:devel Jan 14, 2020

AlanCoding mentioned this pull request Jan 15, 2020

Remove two unused parent relationships from JobEvent model #5669

Merged

ryanpetrello mentioned this pull request Jan 22, 2020

further optimize conf.settings access when logging is enabled #5739

Merged

		@@ -271,34 +271,44 @@ def get_event_display2(self):

		def _update_from_event_data(self):

heavily optimize the write speed of the callback receiver #5618

heavily optimize the write speed of the callback receiver #5618

Conversation

ryanpetrello commented Jan 9, 2020 • edited Loading

ryanpetrello Jan 9, 2020 • edited Loading

Choose a reason for hiding this comment

ryanpetrello Jan 9, 2020 • edited Loading

Choose a reason for hiding this comment

ryanpetrello Jan 9, 2020 • edited Loading

Choose a reason for hiding this comment

ryanpetrello Jan 9, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryanpetrello Jan 14, 2020 • edited Loading

Choose a reason for hiding this comment

ryanpetrello Jan 9, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryanpetrello Jan 9, 2020 • edited Loading

Choose a reason for hiding this comment

ryanpetrello Jan 9, 2020 • edited Loading

Choose a reason for hiding this comment

ryanpetrello Jan 9, 2020 • edited Loading

Choose a reason for hiding this comment

softwarefactory-project-zuul bot commented Jan 9, 2020

This comment was marked as resolved.

Choose a reason for hiding this comment

ryanpetrello Jan 9, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryanpetrello Jan 9, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryanpetrello Jan 9, 2020 • edited Loading

Choose a reason for hiding this comment

ryanpetrello Jan 9, 2020 • edited Loading

Choose a reason for hiding this comment

ryanpetrello Jan 9, 2020 • edited Loading

Choose a reason for hiding this comment

ryanpetrello Jan 9, 2020 • edited Loading

Choose a reason for hiding this comment

softwarefactory-project-zuul bot commented Jan 9, 2020

ryanpetrello Jan 9, 2020 • edited Loading

Choose a reason for hiding this comment

ryanpetrello commented Jan 9, 2020 • edited Loading

softwarefactory-project-zuul bot commented Jan 13, 2020

softwarefactory-project-zuul bot commented Jan 14, 2020

softwarefactory-project-zuul bot commented Jan 14, 2020

softwarefactory-project-zuul bot commented Jan 14, 2020

softwarefactory-project-zuul bot commented Jan 14, 2020

softwarefactory-project-zuul bot commented Jan 14, 2020

Choose a reason for hiding this comment

ryanpetrello Jan 14, 2020 • edited Loading

Choose a reason for hiding this comment

softwarefactory-project-zuul bot commented Jan 14, 2020

AlanCoding left a comment

Choose a reason for hiding this comment

softwarefactory-project-zuul bot commented Jan 14, 2020

softwarefactory-project-zuul bot commented Jan 14, 2020

ryanpetrello commented Jan 9, 2020 •

edited

Loading

ryanpetrello Jan 9, 2020 •

edited

Loading

ryanpetrello Jan 9, 2020 •

edited

Loading

ryanpetrello Jan 9, 2020 •

edited

Loading

ryanpetrello Jan 9, 2020 •

edited

Loading

ryanpetrello Jan 14, 2020 •

edited

Loading

ryanpetrello Jan 9, 2020 •

edited

Loading

ryanpetrello Jan 9, 2020 •

edited

Loading

ryanpetrello Jan 9, 2020 •

edited

Loading

ryanpetrello Jan 9, 2020 •

edited

Loading

ryanpetrello Jan 9, 2020 •

edited

Loading

ryanpetrello Jan 9, 2020 •

edited

Loading

ryanpetrello Jan 9, 2020 •

edited

Loading

ryanpetrello Jan 9, 2020 •

edited

Loading

ryanpetrello Jan 9, 2020 •

edited

Loading

ryanpetrello Jan 9, 2020 •

edited

Loading

ryanpetrello Jan 9, 2020 •

edited

Loading

ryanpetrello commented Jan 9, 2020 •

edited

Loading

ryanpetrello Jan 14, 2020 •

edited

Loading