Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix performance regression in all releases-views #1237

Merged
merged 1 commit into from
Jan 7, 2021

Conversation

syphar
Copy link
Member

@syphar syphar commented Jan 2, 2021

All pages that show data from web::releases::get_releases are insanely slow (>1s).
This PR fixes this.

Local tests and explain looks good now. I could only test with faked data locally, but likely the behaviour will be the same on prod.

The behaviour is the same for the views by release-time, for the views by github-stars we don't show releases any more that are on non-github repos.

I also restructured the related tests a bit, I'm happy to revert if you don't like it. In my mind it fits better when tests only test a single thing with a visible input-data, execution, assertion structure.

old query plan, releases by github stars

┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                        QUERY PLAN                                                                         │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Limit  (cost=20504.68..20504.69 rows=1 width=557) (actual time=4157.308..4157.910 rows=25 loops=1)                                                        │
│   ->  Sort  (cost=20504.68..20504.69 rows=1 width=557) (actual time=4157.288..4157.596 rows=25 loops=1)                                                   │
│         Sort Key: github_repos.stars DESC NULLS LAST                                                                                                      │
│         Sort Method: top-N heapsort  Memory: 27kB                                                                                                         │
│         ->  Nested Loop Left Join  (cost=3099.40..20504.67 rows=1 width=557) (actual time=866.612..3749.925 rows=51776 loops=1)                           │
│               Join Filter: ((releases.github_repo)::text = (github_repos.id)::text)                                                                       │
│               Rows Removed by Join Filter: 51761                                                                                                          │
│               ->  Gather  (cost=3099.40..20503.65 rows=1 width=582) (actual time=866.550..1314.885 rows=51776 loops=1)                                    │
│                     Workers Planned: 2                                                                                                                    │
│                     Workers Launched: 2                                                                                                                   │
│                     ->  Hash Join  (cost=2099.40..19503.55 rows=1 width=582) (actual time=869.435..2655.396 rows=17259 loops=3)                           │
│                           Hash Cond: ((releases.crate_id = crates.id) AND (releases.id = crates.latest_version_id))                                       │
│                           ->  Parallel Seq Scan on releases  (cost=0.00..16716.26 rows=131026 width=579) (actual time=0.095..814.734 rows=104821 loops=3) │
│                           ->  Hash  (cost=1322.76..1322.76 rows=51776 width=19) (actual time=868.599..868.616 rows=51776 loops=3)                         │
│                                 Buckets: 65536  Batches: 1  Memory Usage: 3198kB                                                                          │
│                                 ->  Seq Scan on crates  (cost=0.00..1322.76 rows=51776 width=19) (actual time=0.040..429.363 rows=51776 loops=3)          │
│               ->  Seq Scan on github_repos  (cost=0.00..1.01 rows=1 width=33) (actual time=0.008..0.016 rows=1 loops=51776)                               │
│ Planning Time: 1.205 ms                                                                                                                                   │
│ Execution Time: 4158.608 ms                                                                                                                               │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(19 rows)

new query plan, release by github stars

┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                            QUERY PLAN                                                                            │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Limit  (cost=1.14..137.06 rows=25 width=557) (actual time=0.039..0.381 rows=25 loops=1)                                                                          │
│   ->  Nested Loop  (cost=1.14..283513.21 rows=52146 width=557) (actual time=0.038..0.377 rows=25 loops=1)                                                        │
│         ->  Nested Loop  (cost=0.84..178928.89 rows=316801 width=550) (actual time=0.032..0.283 rows=85 loops=1)                                                 │
│               ->  Index Scan using github_repos_stars_idx on github_repos  (cost=0.42..10360.44 rows=316801 width=22) (actual time=0.014..0.028 rows=85 loops=1) │
│                     Index Cond: (stars IS NOT NULL)                                                                                                              │
│               ->  Index Scan using releases_github_repo_idx on releases  (cost=0.42..0.52 rows=1 width=564) (actual time=0.002..0.002 rows=1 loops=85)           │
│                     Index Cond: ((github_repo)::text = (github_repos.id)::text)                                                                                  │
│         ->  Index Scan using crates_latest_version_idx on crates  (cost=0.29..0.32 rows=1 width=15) (actual time=0.001..0.001 rows=0 loops=85)                   │
│               Index Cond: (latest_version_id = releases.id)                                                                                                      │
│ Planning Time: 0.481 ms                                                                                                                                          │
│ Execution Time: 0.419 ms                                                                                                                                         │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(11 rows)

old query plan, release by release-time


;
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                        QUERY PLAN                                                                         │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Limit  (cost=20504.68..20504.69 rows=1 width=557) (actual time=3248.766..3249.370 rows=25 loops=1)                                                        │
│   ->  Sort  (cost=20504.68..20504.69 rows=1 width=557) (actual time=3248.752..3249.080 rows=25 loops=1)                                                   │
│         Sort Key: releases.release_time DESC NULLS LAST                                                                                                   │
│         Sort Method: top-N heapsort  Memory: 26kB                                                                                                         │
│         ->  Nested Loop Left Join  (cost=3099.40..20504.67 rows=1 width=557) (actual time=652.129..2926.578 rows=51776 loops=1)                           │
│               Join Filter: ((releases.github_repo)::text = (github_repos.id)::text)                                                                       │
│               Rows Removed by Join Filter: 51761                                                                                                          │
│               ->  Gather  (cost=3099.40..20503.65 rows=1 width=582) (actual time=652.087..995.741 rows=51776 loops=1)                                     │
│                     Workers Planned: 2                                                                                                                    │
│                     Workers Launched: 2                                                                                                                   │
│                     ->  Hash Join  (cost=2099.40..19503.55 rows=1 width=582) (actual time=649.110..2065.535 rows=17259 loops=3)                           │
│                           Hash Cond: ((releases.crate_id = crates.id) AND (releases.id = crates.latest_version_id))                                       │
│                           ->  Parallel Seq Scan on releases  (cost=0.00..16716.26 rows=131026 width=579) (actual time=0.122..655.740 rows=104821 loops=3) │
│                           ->  Hash  (cost=1322.76..1322.76 rows=51776 width=19) (actual time=648.415..648.438 rows=51776 loops=3)                         │
│                                 Buckets: 65536  Batches: 1  Memory Usage: 3198kB                                                                          │
│                                 ->  Seq Scan on crates  (cost=0.00..1322.76 rows=51776 width=19) (actual time=0.030..322.152 rows=51776 loops=3)          │
│               ->  Seq Scan on github_repos  (cost=0.00..1.01 rows=1 width=33) (actual time=0.006..0.013 rows=1 loops=51776)                               │
│ Planning Time: 2.768 ms                                                                                                                                   │
│ Execution Time: 3249.898 ms                                                                                                                               │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

new query plan, release by release-time

┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                            QUERY PLAN                                                                            │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Limit  (cost=1.01..69.63 rows=25 width=557) (actual time=0.265..1.983 rows=25 loops=1)                                                                           │
│   ->  Nested Loop Left Join  (cost=1.01..143130.67 rows=52146 width=557) (actual time=0.264..1.975 rows=25 loops=1)                                              │
│         ->  Nested Loop  (cost=0.59..116157.63 rows=52146 width=571) (actual time=0.252..1.214 rows=25 loops=1)                                                  │
│               ->  Index Scan using releases_release_time_idx on releases  (cost=0.30..11573.32 rows=316801 width=564) (actual time=0.030..0.074 rows=85 loops=1) │
│                     Index Cond: (release_time IS NOT NULL)                                                                                                       │
│               ->  Index Scan using crates_latest_version_idx on crates  (cost=0.29..0.32 rows=1 width=15) (actual time=0.011..0.013 rows=0 loops=85)             │
│                     Index Cond: (latest_version_id = releases.id)                                                                                                │
│         ->  Index Scan using github_repos_pkey on github_repos  (cost=0.42..0.52 rows=1 width=22) (actual time=0.029..0.029 rows=1 loops=25)                     │
│               Index Cond: ((id)::text = (releases.github_repo)::text)                                                                                            │
│ Planning Time: 1.614 ms                                                                                                                                          │
│ Execution Time: 2.046 ms                                                                                                                                         │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(11 rows)

@syphar
Copy link
Member Author

syphar commented Jan 2, 2021

I'm not sure which way to go now, probably a decision for the maintainers.

Easiest approach is reducing the results for the stars-views.

@Nemo157
Copy link
Member

Nemo157 commented Jan 4, 2021

Can we make releases.release_time non-nullable? We appear to just default it to now if we fail to get it from crates.io.

@syphar
Copy link
Member Author

syphar commented Jan 4, 2021

Can we make releases.release_time non-nullable? We appear to just default it to now if we fail to get it from crates.io.

I'll check it

@Nemo157
Copy link
Member

Nemo157 commented Jan 4, 2021

It actually is already NOT NULL; the NULLS LAST is only applied when sorting by stars (but why is sorting by release_time so slow then 🤔).

@syphar
Copy link
Member Author

syphar commented Jan 4, 2021

Nulls last is also used for the release time,

Also some other indexes were missing

@syphar
Copy link
Member Author

syphar commented Jan 4, 2021

for example one on releases.github_repo which is used for the join to github_repos.id

I'll finish everything up, then you'll see

@syphar syphar marked this pull request as ready for review January 4, 2021 12:48
@syphar
Copy link
Member Author

syphar commented Jan 4, 2021

@Nemo157 could be ready for another round of checks :)

@syphar syphar changed the title fix performance regression in all releases-views (Draft for now) fix performance regression in all releases-views Jan 4, 2021
@Nemo157
Copy link
Member

Nemo157 commented Jan 4, 2021

Very briefly skimming it looks good, I'm too 😪 to properly review the tests today though, I'll take a look tomorrow if jyn514 doesn't before then.

"create indexes for crates, github_repos and releases",
// upgrade
"
CREATE INDEX crates_latest_version_idx ON crates (latest_version_id);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

latest_version_id is denormalized data, I think. We should be able to recalculate it from release_time, although I guess if we ever fix #708 that would have to change.

No need to change it here, mostly just musing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked, latest_version_id is set in db::add_package::add_package_into_database.

So we should be fine.

If we ever change that for #708 (which seems to be not only a technical change, but a change in the meaning of the lists), we will either drop the field (which let's the tests here fail), or fill the field with the (then) correct version

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In then, we previously used the field in the WHERE part, so that doesn't change :)

Comment on lines +612 to +613
CREATE INDEX releases_github_repo_idx ON releases (github_repo);
CREATE INDEX github_repos_stars_idx ON github_repos(stars DESC);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to measure how much disk space these take up? Anything less than ~500 MB is probably fine if it speeds up queries this much.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

full index size for the tables locally (all indexes)

cratesfyi=# \e
┌──────────────┬───────────────┬──────────────┐
│     name     │ relation_size │ indexes_size │
├──────────────┼───────────────┼──────────────┤
│ crates       │ 6488 kB       │ 8184 kB      │
│ releases     │ 39 MB         │ 19 MB        │
│ github_repos │ 29 MB         │ 19 MB        │
└──────────────┴───────────────┴──────────────┘

if you want to check production (quickly hacked together, likely there is a nicer way, probably also by index)

select 
name, 
pg_size_pretty(pg_relation_size(name)) as relation_size,
pg_size_pretty(pg_indexes_size(name)) as indexes_size
FROM 
    (
    select 'crates' as name UNION ALL 
    select 'releases' as name UNION ALL 
    select 'github_repos'  as name 
) as ii

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

     name     | relation_size | indexes_size 
--------------+---------------+--------------
 crates       | 2872 kB       | 4336 kB
 releases     | 398 MB        | 23 MB
 github_repos | 4392 kB       | 1584 kB
(3 rows)

Yeah that seems fine.

@Nemo157
Copy link
Member

Nemo157 commented Jan 5, 2021

LGTM :shipit:

@jyn514 jyn514 added the S-waiting-on-deploy This PR is ready to be merged, but is waiting for an admin to have time to deploy it label Jan 7, 2021
@jyn514 jyn514 merged commit ca8176c into rust-lang:master Jan 7, 2021
@syphar syphar deleted the releases-speed branch January 8, 2021 06:42
@pietroalbini pietroalbini removed the S-waiting-on-deploy This PR is ready to be merged, but is waiting for an admin to have time to deploy it label Jan 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants