-
-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigating performance issues with good_job #896
Comments
@ollym oooh, nice find! That function scan looks bad. I could imagine it being one of two problems:
Let me know if you're able to profile it on your side. Otherwise I'm happy to make the change speculatively. |
@bensheldon this is the change we're using in production: Will report back on what difference it's made |
Closing the PR because after 24hrs we observed no difference in performance, it's still the 3rd worst performing query by load time even with the change. So unfortunately it appears related to querying |
@ollym nuts! well, it's not a difficult change for me to simply take a 2nd lock on the record to accomplish the same thing. It will lead to one more query to unlock the job, but I have to imagine that's still more performant than what you're seeing by querying Sort of on a tangent: can you confirm that this query is in fact slower than the immediately preceding query that fetch-and-locks the job in the first place? (in other words, if this is the 3rd slowest query in your app, is a different GoodJob query the 1st or 2nd?) |
@bensheldon The first 2 queries are completely unrelated to Further down the list, we also have this: SELECT
"good_jobs"."active_job_id"
FROM
"good_jobs"
LEFT JOIN
pg_locks
ON
pg_locks.locktype = $3
AND pg_locks.objsubid = $4
AND pg_locks.classid = ($5 || SUBSTR(MD5($6 || $7 || "good_jobs"."active_job_id"::text), $8, $9))::bit(32)::int
AND pg_locks.objid = (($10 || SUBSTR(MD5($11 || $12 || "good_jobs"."active_job_id"::text), $13, $14))::bit(64) << $15)::bit(32)::int
WHERE
"good_jobs"."finished_at" IS NULL
AND "good_jobs"."concurrency_key" = $1
AND "pg_locks"."locktype" IS NOT NULL
ORDER BY
COALESCE(performed_at, scheduled_at, created_at) ASC
LIMIT
$2 Query plan for that one here: Likely for the same reason, but less impactful because we don't use concurrency keys on all jobs. Tldr; querying We're processing around 2000 jobs per minute but with spikes up to 10X that, and performance on If you have a PR you want us to test out in production I'm happy to. |
Running it in production now. Will report back results. |
@bensheldon So far so good! It's dropped out of our top 100 queries so i'd say this is a massive performance boost, and honestly probably the preferred way to check locks with postgres (rather than querying the pg_locks view). We still see this one in our logs though: SELECT
"good_jobs"."active_job_id"
FROM
"good_jobs"
LEFT JOIN
pg_locks
ON
pg_locks.locktype = $3
AND pg_locks.objsubid = $4
AND pg_locks.classid = ($5 || SUBSTR(MD5($6 || $7 || "good_jobs"."active_job_id"::text), $8, $9))::bit(32)::int
AND pg_locks.objid = (($10 || SUBSTR(MD5($11 || $12 || "good_jobs"."active_job_id"::text), $13, $14))::bit(64) << $15)::bit(32)::int
WHERE
"good_jobs"."finished_at" IS NULL
AND "good_jobs"."concurrency_key" = $1
AND "pg_locks"."locktype" IS NOT NULL
ORDER BY
COALESCE(performed_at, scheduled_at, created_at) ASC
LIMIT
$2 I assume just because you haven't applied the same fix to the concurrency logic. I'd say even at this early stage that it's a big win and wherever you're querying |
@bensheldon just reporting 24hrs in, there's no trace of the query in our logs so all good! |
@ollym Bad news, I did some stress testing and I'm not confident that merely taking a 2nd lock is fully safe. I think you should go back to the release version, despite the performance problem. I need to do more investigation, but I'm occasionally seeing jobs be performed twice when taking the 2nd lock. |
Have you thought about wrapping the execution in a transaction and then using pg_advisory_xact_lock? We use something similar internally for something else and it's been great. |
Here's what I'm observing with my change in #898, and also is the clearest understanding I've had on how GoodJob's CTE-advisory-lock strategy's challenges. Good/Expected: The query correctly results match the where-conditions and limit-count, and has acquired an advisory lock on the record. This happens most of the time. Sometimes, bad/unexpected things happen.
I'm not quite sure if these can all happen at once, or if one precludes each other i.e. either the correct record is locked or the wrong record is locked, but never 2 records will be incorrectly locked; I'm not sure. As it is, checking whether the current record is locked is correct (and I'm thinking I should also reload and double-check that
Unfortunately, I don't want to wrap the entire job execution within a transaction. I think that would have its own performance problems. |
@bensheldon what are your concerns about performance? I've been discussing with the team and we can't think of a good reason why you should need advisory locks at all. We think this method: def self.perform_with_advisory_lock(parsed_queues: nil, queue_select_limit: nil)
execution = nil
result = nil
unfinished.dequeueing_ordered(parsed_queues).only_scheduled.limit(1).with_advisory_lock(unlock_session: true, select_limit: queue_select_limit) do |executions|
execution = executions.first
break if execution.blank?
break :unlocked unless execution&.executable?
yield(execution) if block_given?
result = execution.perform
end
execution&.run_callbacks(:perform_unlocked)
result
end Could be made dramatically more simple to something like: def self.perform_with_advisory_lock(parsed_queues: nil, queue_select_limit: nil)
execution = nil
result = nil
transaction do
execution = unfinished.dequeueing_ordered(parsed_queues).only_scheduled.lock('FOR UPDATE SKIP LOCKED').first
break if execution.blank?
yield(execution) if block_given?
result = execution.perform
end
execution&.run_callbacks(:perform_unlocked)
result
end And just leverage the locking capabilities on the execution row itself rather than having to mess around with advisory locks or querying pg_lock table, etc. Curious about your approach? |
The downsides of longrunning transactions in a job runner where jobs could run for hours or days, as I understand them are:
|
The job runner could checkout and maintain its own dedicated connection that it uses for these locks. Agreed that would mean for example updating state on the job record itself wouldn't be possible during an execution but I'd argue that's a benefit. Move state change/progress reporting off the job row itself - maybe into a new table. Using The remaining issue is long running jobs (hours or days) it's general consensus this is bad practice anyway, and you shouldn't be compromising library optimisations to support this. The solution is to break your job down into batches (which you have first class support for). Large applications with swarms of job workers running in k8s pods that are constantly being automatically autoscaled by queue size will inevitably kill a long running job mid-way through anyway. This is all definitely a target for v4 or beyond. Would love if some others would give their thoughts too. |
@bensheldon this query has now taken 2nd spot in the query list, so we're going to look at a doing a good_job fork to change the behaviour to use |
@ollym I'll be curious to see what you come up with. I'm slowly making progress towards using row-level locks (more details in #831), but you'll have a big head start not being constrained by everyone else and semantic versioning. I've tried standing on best practice, but gave that up. In terms of the steps in my mind, that will probably be 2 major versions away:
|
@bensheldon the team came up with another idea I wanted to run by you. (1) Adjust transaction do
job = job_query.where(process_id: nil).lock('FOR UPDATE SKIP LOCKED').first
job.update!(process_id: process_id)
end
job.perform (4) Keep track of process_ids in the good_job_processes table and in the event of a crash etc. Just make the process_id column null again and it'll get picked up by the next worker. This would avoid having to perform the job within a transaction and should handle recovering from SIGKILLs and network partitions if you implement point 2. What do you think? |
Edit: Sorry I should have read your proposal which appears to be exactly what I mentioned. I also wanted to share https://github.com/rainforestapp/queue_classic_plus/blob/master/lib/queue_classic_plus/queue_classic/queue.rb#L13-L33 |
I just released v3.15.7 with #946 I don't think it will address the big picture performance issues, but might be a small improvement. From spending some time From casual searching, no one is saying how to optimize that, so I assume it's not possible. Thanks for sharing the SQL from |
Just deployed |
@bensheldon we're currently running the latest version, performance is still the worst query over the last 24hrs. This is the query: SELECT
"good_jobs".*
FROM
"good_jobs"
WHERE
"good_jobs"."id" IN (
WITH
"rows" AS MATERIALIZED (
SELECT
"good_jobs"."id",
"good_jobs"."active_job_id"
FROM
"good_jobs"
WHERE
"good_jobs"."queue_name" = $2
AND "good_jobs"."finished_at" IS NULL
AND ("good_jobs"."scheduled_at" <= $3
OR "good_jobs"."scheduled_at" IS NULL)
ORDER BY
priority ASC
NULLS LAST
,
"good_jobs"."created_at" ASC
LIMIT
$4 )
SELECT
"rows"."id"
FROM
"rows"
WHERE
pg_try_advisory_lock(($5 || SUBSTR(MD5($6 || $7 || "rows"."active_job_id"::text), $8, $9))::bit(64)::bigint)
LIMIT
$1)
ORDER BY
priority ASC
NULLS LAST
,
"good_jobs"."created_at" ASC The query plan being reported in GCP is this: |
Our team has the following suggestions that might improve the query:
Something like this: SELECT
"good_jobs".*
FROM
"good_jobs"
WHERE
"good_jobs"."queue_name" = $2
AND "good_jobs"."finished_at" IS NULL
AND "good_jobs"."scheduled_at" <= $3
AND pg_try_advisory_lock(($5 || SUBSTR(MD5($6 || $7 || "good_jobs"."active_job_id"::text), $8, $9))::bit(64)::bigint)
ORDER BY
"good_jobs"."priority" ASC NULLS LAST,
"good_jobs"."created_at" ASC
LIMIT $1 |
@ollym I have the same performance issue as you do, and found that a reindex of the For the same query, the explain output:
You can see the query latency drop in the following screenshot: As the whole point of a job is to be finished at some point, I think that all indexes having a I've tried to reindex all the 3 others and went down from
to
Any though? |
fyi, I've removed the verification of the advisory lock (e.g. querying ...and released just now in |
@bensheldon we rolled out SELECT
$3 AS one
FROM
"good_jobs"
LEFT JOIN
pg_locks
ON
pg_locks.locktype = $4
AND pg_locks.objsubid = $5
AND pg_locks.classid = ($6 || SUBSTR(MD5($7 || $8 || "good_jobs"."active_job_id"::text), $9, $10))::bit(32)::int
AND pg_locks.objid = (($11 || SUBSTR(MD5($12 || $13 || "good_jobs"."active_job_id"::text), $14, $15))::bit(64) << $16)::bit(32)::int
WHERE
"good_jobs"."finished_at" IS NULL
AND ("pg_locks"."pid" = pg_backend_pid())
AND "good_jobs"."id" = $1
LIMIT
$2 Query results in: Interestingly the function scan is the slowest part which makes me think having |
@ollym That query should no longer be taking place when jobs are dequeued/performed; it was removed entirely: https://github.com/bensheldon/good_job/pull/1113/files#diff-9d00af5d59d9a2b36d9ae265b052ec47618988f8d6e648b001901a7da8c3dc8aL493-R497 Are you still seeing the query being run after upgrading to v3.20.0? |
@bensheldon sorry for the slow reply. So yes that query is no longer performed which is great. The other query was still slow and still our top worst performing. We process 1500 jobs a minute on average, and with the default configuration our good_jobs table had 20M records in it. We changed the configuration to:
(default was 14 days) And the query is now down to 7/8 on our list which is great. We also occasionally saw the issue reported here: And running Ideally we do not want to preserve jobs at all, we simply have too many, but that will break cron (which we use) so @bensheldon in conjunction with #1130 and getting CRON to work with un-preserved jobs #927, we should be in a good final place and i'll close this issue. |
Our team are investigating performance bottlenecks with high volume queries on the database and over the last 24hrs
good_job
has the 3rd worst query by total load time.The query specifically is this one:
The query plan analysis looks like this:
Specifically the
Function scan
seems suspicious, and we suspect it's because when you usepg_backend_pid()
in the query it's having to call the function on every individual row and nothing can be optimised by the query engine.Tracking the issue down specifically to here:
https://github.com/bensheldon/good_job/blob/main/app/models/good_job/lockable.rb#L345-L360
How about you rewrite this:
To this:
Which will avoid the function scan.
Creating this as an issue instead of a PR as we haven't yet tried it so not sure yet if that's the main issue or not.
The text was updated successfully, but these errors were encountered: