Improve general performance for a variety of high-load job launch use cases #8403

chrismeyersfsu · 2020-10-15T19:52:32Z

reduce per-job database query count

Do not query the database for the set of Instance that belong to the
group for which we are trying to fit a job on, for each job.
Instead, cache the set of instances per-instance group.
reduce parent->child lock contention

We update the parent unified job template to point at new jobs
created. We also update a similar foreign key when the job finishes
running. This causes lock contention when the job template is
allow_simultaneous and there are a lot of jobs from that job template
running in parallel. I've seen as bad as 5 minutes waiting for the lock
when a job finishes.
This change moves the parent->child update to OUTSIDE of the
transaction if the job is allow_simultaneous (inherited from the parent
unified job). We sacrafice a bit of correctness for performance. The
logic is, if you are launching 1,000 parallel jobs do you really care
that the job template contains a pointer to the last one you launched?
Probably not. If you do, you can always query jobs related to the job
template sorted by created time.

softwarefactory-project-zuul · 2020-10-15T20:09:07Z

Build succeeded.

awx-api-lint : SUCCESS in 1m 53s
awx-api : SUCCESS in 7m 08s
awx-ui : SUCCESS in 3m 57s
awx-ui-next : SUCCESS in 12m 19s
awx-swagger : SUCCESS in 9m 07s
awx-detect-schema-change : SUCCESS in 16m 10s (non-voting)
awx-ansible-modules : SUCCESS in 2m 17s

ryanpetrello · 2020-10-19T14:47:52Z

awx/main/models/unified_jobs.py

+        # This dodges lock contention at the expense of the foreign key not being
+        # completely correct.
+        if getattr(self, 'allow_simultaneous', False):
+            connection.on_commit(self._update_parent_instance)


Is this necessary if self.status == status_before: ?

Would this be better?

if self.status != status_before: if getattr(self, 'allow_simultaneous', False): connection.on_commit(self._update_parent_instance) else: self._update_parent_instance()

* We update the parent unified job template to point at new jobs created. We also update a similar foreign key when the job finishes running. This causes lock contention when the job template is allow_simultaneous and there are a lot of jobs from that job template running in parallel. I've seen as bad as 5 minutes waiting for the lock when a job finishes. * This change moves the parent->child update to OUTSIDE of the transaction if the job is allow_simultaneous (inherited from the parent unified job). We sacrafice a bit of correctness for performance. The logic is, if you are launching 1,000 parallel jobs do you really care that the job template contains a pointer to the last one you launched? Probably not. If you do, you can always query jobs related to the job template sorted by created time.

* Do not query the database for the set of Instance that belong to the group for which we are trying to fit a job on, for each job. * Instead, cache the set of instances per-instance group.

softwarefactory-project-zuul · 2020-10-19T15:05:42Z

Build succeeded.

awx-api-lint : SUCCESS in 1m 58s
awx-api : SUCCESS in 7m 26s
awx-ui : SUCCESS in 3m 31s
awx-ui-next : SUCCESS in 8m 10s
awx-swagger : SUCCESS in 8m 07s
awx-detect-schema-change : SUCCESS in 8m 52s (non-voting)
awx-ansible-modules : SUCCESS in 2m 37s

ryanpetrello · 2020-10-19T19:13:27Z

regate

softwarefactory-project-zuul · 2020-10-19T19:38:31Z

Build succeeded (gate pipeline).

awx-api-lint : SUCCESS in 2m 10s
awx-api : SUCCESS in 11m 20s
awx-ui : SUCCESS in 3m 55s
awx-ui-next : SUCCESS in 8m 35s
awx-swagger : SUCCESS in 16m 24s
awx-detect-schema-change : SUCCESS in 16m 25s (non-voting)
awx-ansible-modules : SUCCESS in 3m 09s
awx-push-new-schema : SUCCESS in 8m 03s (non-voting)

ryanpetrello requested review from fosterseth and ryanpetrello October 19, 2020 14:46

ryanpetrello reviewed Oct 19, 2020

View reviewed changes

chrismeyersfsu added 2 commits October 19, 2020 10:54

reduce per-job database query count

2eac5a8

* Do not query the database for the set of Instance that belong to the group for which we are trying to fit a job on, for each job. * Instead, cache the set of instances per-instance group.

chrismeyersfsu force-pushed the fix-same_jt_abuse_devel branch from 7b0ce7a to 2eac5a8 Compare October 19, 2020 14:56

ryanpetrello approved these changes Oct 19, 2020

View reviewed changes

fosterseth approved these changes Oct 19, 2020

View reviewed changes

ryanpetrello mentioned this pull request Oct 19, 2020

reduce per-job database query count #8333

Closed

ryanpetrello changed the title ~~Fix same jt abuse devel~~ Improve performance for a variety of high-load job launch use cases Oct 19, 2020

ryanpetrello changed the title ~~Improve performance for a variety of high-load job launch use cases~~ Improve general performance for a variety of high-load job launch use cases Oct 19, 2020

chrismeyersfsu added the mergeit label Oct 19, 2020

softwarefactory-project-zuul bot merged commit d7864c5 into ansible:devel Oct 19, 2020

wenottingham mentioned this pull request Feb 5, 2021

Possible bug when job fails in tower with syntax error, no notification is generated #9245

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve general performance for a variety of high-load job launch use cases #8403

Improve general performance for a variety of high-load job launch use cases #8403

chrismeyersfsu commented Oct 15, 2020

softwarefactory-project-zuul bot commented Oct 15, 2020

ryanpetrello Oct 19, 2020

softwarefactory-project-zuul bot commented Oct 19, 2020

ryanpetrello commented Oct 19, 2020

softwarefactory-project-zuul bot commented Oct 19, 2020

Improve general performance for a variety of high-load job launch use cases #8403

Improve general performance for a variety of high-load job launch use cases #8403

Conversation

chrismeyersfsu commented Oct 15, 2020

softwarefactory-project-zuul bot commented Oct 15, 2020

ryanpetrello Oct 19, 2020

Choose a reason for hiding this comment

softwarefactory-project-zuul bot commented Oct 19, 2020

ryanpetrello commented Oct 19, 2020

softwarefactory-project-zuul bot commented Oct 19, 2020