Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigating performance issues with good_job #896

Open
ollym opened this issue Mar 17, 2023 · 28 comments
Open

Investigating performance issues with good_job #896

ollym opened this issue Mar 17, 2023 · 28 comments

Comments

@ollym
Copy link

ollym commented Mar 17, 2023

Our team are investigating performance bottlenecks with high volume queries on the database and over the last 24hrs good_job has the 3rd worst query by total load time.

The query specifically is this one:

SELECT
  $3 AS one
FROM
  "good_jobs"
LEFT JOIN
  pg_locks
ON
  pg_locks.locktype = $4
  AND pg_locks.objsubid = $5
  AND pg_locks.classid = ($6 || SUBSTR(MD5($7 || $8 || "good_jobs"."active_job_id"::text), $9, $10))::bit(32)::int
  AND pg_locks.objid = (($11 || SUBSTR(MD5($12 || $13 || "good_jobs"."active_job_id"::text), $14, $15))::bit(64) << $16)::bit(32)::int
WHERE
  "good_jobs"."finished_at" IS NULL
  AND ("pg_locks"."pid" = pg_backend_pid())
  AND "good_jobs"."id" = $1
LIMIT
  $2

The query plan analysis looks like this:

CleanShot 2023-03-17 at 12 49 41

Specifically the Function scan seems suspicious, and we suspect it's because when you use pg_backend_pid() in the query it's having to call the function on every individual row and nothing can be optimised by the query engine.

Tracking the issue down specifically to here:
https://github.com/bensheldon/good_job/blob/main/app/models/good_job/lockable.rb#L345-L360

How about you rewrite this:

SELECT 1 AS one
FROM pg_locks
WHERE pg_locks.locktype = 'advisory'
  AND pg_locks.objsubid = 1
  AND pg_locks.classid = ('x' || substr(md5($1::text), 1, 16))::bit(32)::int
  AND pg_locks.objid = (('x' || substr(md5($2::text), 1, 16))::bit(64) << 32)::bit(32)::int
  AND pg_locks.pid = pg_backend_pid()

To this:

WITH pg_backend AS MATERIALIZED (SELECT pg_backend_pid() AS pid)
SELECT 1 AS one
FROM pg_locks, pg_backend
WHERE pg_locks.locktype = 'advisory'
  AND pg_locks.objsubid = 1
  AND pg_locks.classid = ('x' || substr(md5($1::text), 1, 16))::bit(32)::int
  AND pg_locks.objid = (('x' || substr(md5($2::text), 1, 16))::bit(64) << 32)::bit(32)::int
  AND pg_locks.pid = pg_backend.pid

Which will avoid the function scan.

Creating this as an issue instead of a PR as we haven't yet tried it so not sure yet if that's the main issue or not.

@bensheldon
Copy link
Owner

@ollym oooh, nice find! That function scan looks bad.

I could imagine it being one of two problems:

  • The function scan. I like your solution of turning that into a Materialized CTE (I think it could also be a nested select, but I like CTEs)
  • Querying pg_locks being slow, which is more difficult to address.

Let me know if you're able to profile it on your side. Otherwise I'm happy to make the change speculatively.

@ollym
Copy link
Author

ollym commented Mar 17, 2023

@bensheldon this is the change we're using in production:
#897

Will report back on what difference it's made

@ollym
Copy link
Author

ollym commented Mar 18, 2023

Closing the PR because after 24hrs we observed no difference in performance, it's still the 3rd worst performing query by load time even with the change.

So unfortunately it appears related to querying pg_locks

@bensheldon
Copy link
Owner

bensheldon commented Mar 18, 2023

@ollym nuts!

well, it's not a difficult change for me to simply take a 2nd lock on the record to accomplish the same thing. It will lead to one more query to unlock the job, but I have to imagine that's still more performant than what you're seeing by querying pg_locks.

Sort of on a tangent: can you confirm that this query is in fact slower than the immediately preceding query that fetch-and-locks the job in the first place? (in other words, if this is the 3rd slowest query in your app, is a different GoodJob query the 1st or 2nd?)

@ollym
Copy link
Author

ollym commented Mar 18, 2023

@bensheldon The first 2 queries are completely unrelated to good_job and fully expected to be up there.

Further down the list, we also have this:

SELECT
  "good_jobs"."active_job_id"
FROM
  "good_jobs"
LEFT JOIN
  pg_locks
ON
  pg_locks.locktype = $3
  AND pg_locks.objsubid = $4
  AND pg_locks.classid = ($5 || SUBSTR(MD5($6 || $7 || "good_jobs"."active_job_id"::text), $8, $9))::bit(32)::int
  AND pg_locks.objid = (($10 || SUBSTR(MD5($11 || $12 || "good_jobs"."active_job_id"::text), $13, $14))::bit(64) << $15)::bit(32)::int
WHERE
  "good_jobs"."finished_at" IS NULL
  AND "good_jobs"."concurrency_key" = $1
  AND "pg_locks"."locktype" IS NOT NULL
ORDER BY
  COALESCE(performed_at, scheduled_at, created_at) ASC
LIMIT
  $2

Query plan for that one here:

CleanShot 2023-03-18 at 16 34 50

Likely for the same reason, but less impactful because we don't use concurrency keys on all jobs.

Tldr; querying pg_locks appears a performance bottle-neck at our scale at least, and I think would be better avoided if at all possible.

We're processing around 2000 jobs per minute but with spikes up to 10X that, and performance on pg_locks doesn't appear to be linear, so when the job backlog starts to increase, the query latency does as well:

CleanShot 2023-03-18 at 16 39 28

If you have a PR you want us to test out in production I'm happy to.

@bensheldon
Copy link
Owner

bensheldon commented Mar 18, 2023

@ollym I just pushed up #898. I have to head out for the day, but I'll look at it more later. Thank you again for working on this with me!

@ollym
Copy link
Author

ollym commented Mar 18, 2023

Running it in production now. Will report back results.

@ollym
Copy link
Author

ollym commented Mar 18, 2023

@bensheldon So far so good! It's dropped out of our top 100 queries so i'd say this is a massive performance boost, and honestly probably the preferred way to check locks with postgres (rather than querying the pg_locks view).

We still see this one in our logs though:

SELECT
  "good_jobs"."active_job_id"
FROM
  "good_jobs"
LEFT JOIN
  pg_locks
ON
  pg_locks.locktype = $3
  AND pg_locks.objsubid = $4
  AND pg_locks.classid = ($5 || SUBSTR(MD5($6 || $7 || "good_jobs"."active_job_id"::text), $8, $9))::bit(32)::int
  AND pg_locks.objid = (($10 || SUBSTR(MD5($11 || $12 || "good_jobs"."active_job_id"::text), $13, $14))::bit(64) << $15)::bit(32)::int
WHERE
  "good_jobs"."finished_at" IS NULL
  AND "good_jobs"."concurrency_key" = $1
  AND "pg_locks"."locktype" IS NOT NULL
ORDER BY
  COALESCE(performed_at, scheduled_at, created_at) ASC
LIMIT
  $2

I assume just because you haven't applied the same fix to the concurrency logic. I'd say even at this early stage that it's a big win and wherever you're querying pg_locks just replace with another try advisory lock.

@ollym
Copy link
Author

ollym commented Mar 19, 2023

@bensheldon just reporting 24hrs in, there's no trace of the query in our logs so all good!

@bensheldon
Copy link
Owner

@ollym Bad news, I did some stress testing and I'm not confident that merely taking a 2nd lock is fully safe. I think you should go back to the release version, despite the performance problem.

I need to do more investigation, but I'm occasionally seeing jobs be performed twice when taking the 2nd lock.

@ollym
Copy link
Author

ollym commented Mar 19, 2023

Have you thought about wrapping the execution in a transaction and then using pg_advisory_xact_lock?

We use something similar internally for something else and it's been great.

@bensheldon
Copy link
Owner

Here's what I'm observing with my change in #898, and also is the clearest understanding I've had on how GoodJob's CTE-advisory-lock strategy's challenges.

Good/Expected: The query correctly results match the where-conditions and limit-count, and has acquired an advisory lock on the record. This happens most of the time.

Sometimes, bad/unexpected things happen.

  • The result doesn't match the where-conditions. A record is returned that has already been returned. Speculation: Something like this is happening, speculatively:
    1. Thread 1: Starts query and begins materializing CTE results
    2. Thread 2: Finishing up execution, updates record and unlocks advisory lock
    3. Thread 1: Having materialized CTE, it takes an advisory lock on the record
  • The returned record doesn't have an advisory lock by the current connection.
  • A different, unreturned record has an advisory lock by the current connection. I'm not quite sure why this or the preceding happens given that the pg_try_advisory_lock is one of the where-conditions.

I'm not quite sure if these can all happen at once, or if one precludes each other i.e. either the correct record is locked or the wrong record is locked, but never 2 records will be incorrectly locked; I'm not sure.

As it is, checking whether the current record is locked is correct (and I'm thinking I should also reload and double-check that performed_at is null too). I will try to make the lock-check query faster.

Have you thought about wrapping the execution in a transaction and then using pg_advisory_xact_lock?

Unfortunately, I don't want to wrap the entire job execution within a transaction. I think that would have its own performance problems.

@ollym
Copy link
Author

ollym commented Mar 21, 2023

@bensheldon what are your concerns about performance? I've been discussing with the team and we can't think of a good reason why you should need advisory locks at all. We think this method:

def self.perform_with_advisory_lock(parsed_queues: nil, queue_select_limit: nil)
  execution = nil
  result = nil
  unfinished.dequeueing_ordered(parsed_queues).only_scheduled.limit(1).with_advisory_lock(unlock_session: true, select_limit: queue_select_limit) do |executions|
    execution = executions.first
    break if execution.blank?
    break :unlocked unless execution&.executable?

    yield(execution) if block_given?
    result = execution.perform
  end
  execution&.run_callbacks(:perform_unlocked)

  result
end

Could be made dramatically more simple to something like:

def self.perform_with_advisory_lock(parsed_queues: nil, queue_select_limit: nil)
  execution = nil
  result = nil

  transaction do
    execution = unfinished.dequeueing_ordered(parsed_queues).only_scheduled.lock('FOR UPDATE SKIP LOCKED').first
    break if execution.blank?

    yield(execution) if block_given?
    result = execution.perform
  end
  execution&.run_callbacks(:perform_unlocked)

  result
end

And just leverage the locking capabilities on the execution row itself rather than having to mess around with advisory locks or querying pg_lock table, etc.

Curious about your approach?

@bensheldon
Copy link
Owner

The downsides of longrunning transactions in a job runner where jobs could run for hours or days, as I understand them are:

  • VACUUM problems. Postgres has to maintain its log all the back to the start of the oldest active transaction. There's also some other really bad states like transaction id wraparound that can happen.
  • Implicit Locks and Deadlocks. It's a lot easier to inadvertently run an UPDATE query during the normal course of job administration/Dashboard than it is to trigger an advisory lock. It's more to be mindful of, and easier to get wrong,
  • Application concerns. I don't want to burden the application developer with knowing their code is running inside a transaction (and maybe they don't want it to). I think it's possible to make the transaction unjoinable, but I dunno if that's perfect.

@ollym
Copy link
Author

ollym commented Mar 22, 2023

The job runner could checkout and maintain its own dedicated connection that it uses for these locks.

Agreed that would mean for example updating state on the job record itself wouldn't be possible during an execution but I'd argue that's a benefit. Move state change/progress reporting off the job row itself - maybe into a new table.

Using FOR UPDATE SKIP LOCKED the feature specifically intended for what you're trying to do, and so PG will be heavily optimised for it. Instead of the workaround with querying pg_locks and using advisory locks which I think will just continue to get more twisted and convoluted as the library and workloads that use it grows.

The remaining issue is long running jobs (hours or days) it's general consensus this is bad practice anyway, and you shouldn't be compromising library optimisations to support this. The solution is to break your job down into batches (which you have first class support for). Large applications with swarms of job workers running in k8s pods that are constantly being automatically autoscaled by queue size will inevitably kill a long running job mid-way through anyway.

This is all definitely a target for v4 or beyond. Would love if some others would give their thoughts too.

@ollym
Copy link
Author

ollym commented Apr 19, 2023

@bensheldon this query has now taken 2nd spot in the query list, so we're going to look at a doing a good_job fork to change the behaviour to use FOR UPDATE SKIP LOCKED and avoid querying the pg_locks function directly.

@bensheldon
Copy link
Owner

@ollym I'll be curious to see what you come up with.

I'm slowly making progress towards using row-level locks (more details in #831), but you'll have a big head start not being constrained by everyone else and semantic versioning. I've tried standing on best practice, but gave that up.

In terms of the steps in my mind, that will probably be 2 major versions away:

@ollym
Copy link
Author

ollym commented Apr 28, 2023

@bensheldon the team came up with another idea I wanted to run by you.

(1) Adjust JobPerformer to take "process_id" parameter in its initializer.
(2) Add process_id (uuid) to Job to track which process is currently performing that job
(3) Adjust JobPerformer#next to do something like:

transaction do
  job = job_query.where(process_id: nil).lock('FOR UPDATE SKIP LOCKED').first
  job.update!(process_id: process_id)
end

job.perform

(4) Keep track of process_ids in the good_job_processes table and in the event of a crash etc. Just make the process_id column null again and it'll get picked up by the next worker.

This would avoid having to perform the job within a transaction and should handle recovering from SIGKILLs and network partitions if you implement point 2.

What do you think?

@ollym
Copy link
Author

ollym commented Apr 28, 2023

Edit:

Sorry I should have read your proposal which appears to be exactly what I mentioned.

I also wanted to share https://github.com/rainforestapp/queue_classic_plus/blob/master/lib/queue_classic_plus/queue_classic/queue.rb#L13-L33
Which appears to be the exact implementation

@bensheldon
Copy link
Owner

I just released v3.15.7 with #946

I don't think it will address the big picture performance issues, but might be a small improvement.

From spending some time EXPLAIN ANALYZE this morning, it looks like the function scan is coming from querying pg_locks because that view is calling into pg_lock_status():

Screenshot 2023-04-30 at 12 17 55 PM

From casual searching, no one is saying how to optimize that, so I assume it's not possible.

Thanks for sharing the SQL from queue_class_plus. That looks exactly right.

@ollym
Copy link
Author

ollym commented May 1, 2023

Just deployed v3.15.7 will let you know if we notice any difference

@ollym
Copy link
Author

ollym commented Jun 15, 2023

@bensheldon we're currently running the latest version, performance is still the worst query over the last 24hrs.

CleanShot 2023-06-15 at 10 48 31

This is the query:

SELECT
  "good_jobs".*
FROM
  "good_jobs"
WHERE
  "good_jobs"."id" IN (
  WITH
    "rows" AS MATERIALIZED (
    SELECT
      "good_jobs"."id",
      "good_jobs"."active_job_id"
    FROM
      "good_jobs"
    WHERE
      "good_jobs"."queue_name" = $2
      AND "good_jobs"."finished_at" IS NULL
      AND ("good_jobs"."scheduled_at" <= $3
        OR "good_jobs"."scheduled_at" IS NULL)
    ORDER BY
      priority ASC
    NULLS LAST
      ,
      "good_jobs"."created_at" ASC
    LIMIT
      $4 )
  SELECT
    "rows"."id"
  FROM
    "rows"
  WHERE
    pg_try_advisory_lock(($5 || SUBSTR(MD5($6 || $7 || "rows"."active_job_id"::text), $8, $9))::bit(64)::bigint)
  LIMIT
    $1)
ORDER BY
  priority ASC
NULLS LAST
  ,
  "good_jobs"."created_at" ASC

The query plan being reported in GCP is this:

CleanShot 2023-06-15 at 10 49 52

@ollym
Copy link
Author

ollym commented Jun 15, 2023

Our team has the following suggestions that might improve the query:

  1. Make good_jobs.scheduled_at not null, and use a default value of now()
  2. Remove using the CTE

Something like this:

SELECT
  "good_jobs".*
FROM
  "good_jobs"
WHERE
  "good_jobs"."queue_name" = $2
  AND "good_jobs"."finished_at" IS NULL
  AND "good_jobs"."scheduled_at" <= $3
  AND pg_try_advisory_lock(($5 || SUBSTR(MD5($6 || $7 || "good_jobs"."active_job_id"::text), $8, $9))::bit(64)::bigint)
ORDER BY
  "good_jobs"."priority" ASC NULLS LAST,
  "good_jobs"."created_at" ASC
LIMIT $1

@stem
Copy link

stem commented Oct 7, 2023

@ollym I have the same performance issue as you do, and found that a reindex of the index_good_jobs_on_concurrency_key_when_unfinished index helps a lot.

For the same query, the explain output:

  • Execution Time went from ~375ms ~22ms
  • Buffers: shared hit went from 107803 to 31

You can see the query latency drop in the following screenshot:
Screenshot 2023-10-07 at 18 46 26

As the whole point of a job is to be finished at some point, I think that all indexes having a WHERE finished_at IS NULL should be reindexed from time to time.

I've tried to reindex all the 3 others and went down from

> \di+ index_good_job*_(unfinished|scheduled_at)
                                        List of relations
                            Name                             | Access method |  Size  
-------------------------------------------------------------+---------------+--------
 index_good_jobs_jobs_on_priority_created_at_when_unfinished | btree         | 226 MB 
 index_good_jobs_on_concurrency_key_when_unfinished          | btree         | 376 kB 
 index_good_jobs_on_queue_name_and_scheduled_at              | btree         | 232 MB 
 index_good_jobs_on_scheduled_at                             | btree         | 182 MB 
(4 rows)

to

> \di+ index_good_job*_(unfinished|scheduled_at)*
                                        List of relations
                           Name                             | Access method |  Size  
------------------------------------------------------------+---------------+--------
index_good_jobs_jobs_on_priority_created_at_when_unfinished | btree         | 544 kB 
index_good_jobs_on_concurrency_key_when_unfinished          | btree         | 400 kB 
index_good_jobs_on_queue_name_and_scheduled_at              | btree         | 640 kB 
index_good_jobs_on_scheduled_at                             | btree         | 184 kB 
(4 rows)

Any though?

@bensheldon
Copy link
Owner

bensheldon commented Oct 23, 2023

fyi, I've removed the verification of the advisory lock (e.g. querying pg_locks) in the job performance: #1113

...and released just now in v3.20.0

@ollym
Copy link
Author

ollym commented Oct 24, 2023

@bensheldon we rolled out v3.20 this morning and early indications are very little difference.

SELECT
  $3 AS one
FROM
  "good_jobs"
LEFT JOIN
  pg_locks
ON
  pg_locks.locktype = $4
  AND pg_locks.objsubid = $5
  AND pg_locks.classid = ($6 || SUBSTR(MD5($7 || $8 || "good_jobs"."active_job_id"::text), $9, $10))::bit(32)::int
  AND pg_locks.objid = (($11 || SUBSTR(MD5($12 || $13 || "good_jobs"."active_job_id"::text), $14, $15))::bit(64) << $16)::bit(32)::int
WHERE
  "good_jobs"."finished_at" IS NULL
  AND ("pg_locks"."pid" = pg_backend_pid())
  AND "good_jobs"."id" = $1
LIMIT
  $2

Query results in:

CleanShot 2023-10-24 at 10 32 53@2x

Interestingly the function scan is the slowest part which makes me think having pg_backend_pid() within the query prevents any optimisations from taking place. Given this value is static against the connection it should be possible to cache the pid value against the connection and just include it as a parameter in the query, i suspect that will speed things up significantly.

@bensheldon
Copy link
Owner

@ollym That query should no longer be taking place when jobs are dequeued/performed; it was removed entirely: https://github.com/bensheldon/good_job/pull/1113/files#diff-9d00af5d59d9a2b36d9ae265b052ec47618988f8d6e648b001901a7da8c3dc8aL493-R497

Are you still seeing the query being run after upgrading to v3.20.0?

@ollym
Copy link
Author

ollym commented Nov 3, 2023

@bensheldon sorry for the slow reply.

So yes that query is no longer performed which is great. The other query was still slow and still our top worst performing. We process 1500 jobs a minute on average, and with the default configuration our good_jobs table had 20M records in it.

We changed the configuration to:

config.good_job.cleanup_preserved_jobs_before_seconds_ago = 10.minutes

(default was 14 days)

And the query is now down to 7/8 on our list which is great. We also occasionally saw the issue reported here:
#896 (comment)

And running REINDEX TABLE good_jobs and REINDEX TABLE good_job_executions also fixed the issue for us. When there were 20M rows it was locking the table for up to 2 minutes which wasn't great, but now it runs in under a second so we have it in a daily cron job.

Ideally we do not want to preserve jobs at all, we simply have too many, but that will break cron (which we use) so @bensheldon in conjunction with #1130 and getting CRON to work with un-preserved jobs #927, we should be in a good final place and i'll close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Inbox
Development

No branches or pull requests

3 participants