Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/devices page with filters cause DB load with thousands of devices #1501

Closed
jjcarstens opened this issue Sep 9, 2024 · 4 comments
Closed
Assignees
Labels
bug Something isn't working

Comments

@jjcarstens
Copy link
Collaborator

jjcarstens commented Sep 9, 2024

Describe the bug
First let me say the feature to support saved and shareable search params in #1458 is awesome! 🎉

But I do think it is causing us issues 😱

We've recently had a situation where our DB load was very high causing lots of alerts and these 2 queries were the big offenders:

SQL Things

Filtering /devices

SELECT 
d0."id", d0."org_id", d0."product_id", d0."deployment_id", d0."firmware_metadata", d0."identifier", d0."description", d0."updates_enabled", d0."tags", d0."deleted_at", d0."update_attempts", d0."updates_blocked_until", d0."connection_status", d0."connection_established_at", d0."connection_disconnected_at", d0."connection_last_seen_at", d0."connection_types", d0."connecting_code", d0."connection_metadata", d0."inserted_at", d0."updated_at", p2."id", p2."org_id", p2."name", p2."deleted_at", p2."delta_updatable", p2."inserted_at", p2."updated_at", d3."id", d3."firmware_id", d3."product_id", d3."org_id", d3."archive_id", d3."conditions", d3."device_failure_threshold", d3."device_failure_rate_seconds", d3."device_failure_rate_amount", d3."failure_threshold", d3."failure_rate_seconds", d3."failure_rate_amount", d3."is_active", d3."name", d3."healthy", d3."penalty_timeout_minutes", d3."connecting_code", d3."concurrent_updates", d3."total_updating_devices", d3."current_updated_devices", d3."inflight_update_expiration_minutes", d3."inserted_at", d3."updated_at", f4."id", f4."org_id", f4."product_id", f4."org_key_id", f4."architecture", f4."author", f4."delta_updatable", f4."description", f4."misc", f4."platform", f4."size", f4."upload_metadata", f4."uuid", f4."vcs_identifier", f4."version", f4."inserted_at", f4."updated_at", o1."id", o1."name", o1."deleted_at", o1."audit_log_days_to_keep", o1."inserted_at", o1."updated_at" 
FROM "devices" AS d0 
LEFT OUTER JOIN "orgs" AS o1 ON (o1."id" = d0."org_id") AND (o1."deleted_at" IS NULL) 
LEFT OUTER JOIN "products" AS p2 ON (p2."id" = d0."product_id") AND (p2."deleted_at" IS NULL) 
LEFT OUTER JOIN "deployments" AS d3 ON d3."id" = d0."deployment_id" 
LEFT OUTER JOIN "firmwares" AS f4 ON f4."id" = d3."firmware_id" 
WHERE (d0."org_id" = $1) AND (d0."product_id" = $2) AND (d0."deleted_at" IS NULL) AND (d0."identifier" ILIKE $3) 
ORDER BY d0."identifier" 
LIMIT $4 OFFSET $5

EXPLAIN ANALYZE with tags

                                                                                    QUERY PLAN                                                                                    
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.43..10249.86 rows=25 width=2107) (actual time=1455.228..1455.231 rows=0 loops=1)
   ->  Nested Loop Left Join  (cost=0.43..400548.42 rows=977 width=2107) (actual time=1455.227..1455.230 rows=0 loops=1)
         Join Filter: (f4.id = d3.firmware_id)
         ->  Nested Loop Left Join  (cost=0.43..399959.11 rows=977 width=839) (actual time=1455.226..1455.228 rows=0 loops=1)
               Join Filter: (d3.id = d0.deployment_id)
               ->  Nested Loop Left Join  (cost=0.43..399870.30 rows=977 width=646) (actual time=1455.226..1455.227 rows=0 loops=1)
                     Join Filter: (p2.id = d0.product_id)
                     ->  Nested Loop Left Join  (cost=0.43..399854.46 rows=977 width=594) (actual time=1455.225..1455.227 rows=0 loops=1)
                           Join Filter: (o1.id = d0.org_id)
                           ->  Index Scan using devices_identifier_index on devices d0  (cost=0.43..399838.54 rows=977 width=547) (actual time=1455.225..1455.225 rows=0 loops=1)
                                 Filter: ((deleted_at IS NULL) AND (org_id = 2) AND (product_id = 13) AND (array_to_string(tags, ','::text) ~~* '%howdy%'::text))
                                 Rows Removed by Filter: 1079528
                           ->  Materialize  (cost=0.00..1.27 rows=1 width=47) (never executed)
                                 ->  Seq Scan on orgs o1  (cost=0.00..1.26 rows=1 width=47) (never executed)
                                       Filter: ((deleted_at IS NULL) AND (id = 2))
                     ->  Materialize  (cost=0.00..1.18 rows=1 width=52) (never executed)
                           ->  Seq Scan on products p2  (cost=0.00..1.18 rows=1 width=52) (never executed)
                                 Filter: ((deleted_at IS NULL) AND (id = 13))
               ->  Materialize  (cost=0.00..1.09 rows=6 width=193) (never executed)
                     ->  Seq Scan on deployments d3  (cost=0.00..1.06 rows=6 width=193) (never executed)
         ->  Materialize  (cost=0.00..2.62 rows=41 width=1268) (never executed)
               ->  Seq Scan on firmwares f4  (cost=0.00..2.41 rows=41 width=1268) (never executed)
 Planning Time: 0.618 ms
 Execution Time: 1455.312 ms
(24 rows)

EXPLAIN ANALYZE with identifier

                                                                         QUERY PLAN                                                                          
-------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=102317.82..102320.74 rows=25 width=2107) (actual time=443.508..444.857 rows=0 loops=1)
   ->  Gather Merge  (cost=102317.82..102323.89 rows=52 width=2107) (actual time=443.507..444.856 rows=0 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Sort  (cost=101317.79..101317.86 rows=26 width=2107) (actual time=436.906..436.910 rows=0 loops=3)
               Sort Key: d0.identifier
               Sort Method: quicksort  Memory: 25kB
               Worker 0:  Sort Method: quicksort  Memory: 25kB
               Worker 1:  Sort Method: quicksort  Memory: 25kB
               ->  Hash Left Join  (cost=6.35..101317.18 rows=26 width=2107) (actual time=436.860..436.864 rows=0 loops=3)
                     Hash Cond: (d0.product_id = p2.id)
                     ->  Hash Left Join  (cost=5.16..101315.64 rows=26 width=2055) (actual time=436.859..436.862 rows=0 loops=3)
                           Hash Cond: (d0.org_id = o1.id)
                           ->  Hash Left Join  (cost=3.89..101314.01 rows=26 width=2008) (actual time=436.859..436.861 rows=0 loops=3)
                                 Hash Cond: (d0.deployment_id = d3.id)
                                 ->  Parallel Seq Scan on devices d0  (cost=0.00..101310.01 rows=26 width=547) (actual time=436.859..436.859 rows=0 loops=3)
                                       Filter: ((deleted_at IS NULL) AND ((identifier)::text ~~* 'SQ%'::text) AND (org_id = 2) AND (product_id = 13))
                                       Rows Removed by Filter: 359843
                                 ->  Hash  (cost=3.81..3.81 rows=6 width=1461) (never executed)
                                       ->  Hash Right Join  (cost=1.14..3.81 rows=6 width=1461) (never executed)
                                             Hash Cond: (f4.id = d3.firmware_id)
                                             ->  Seq Scan on firmwares f4  (cost=0.00..2.41 rows=41 width=1268) (never executed)
                                             ->  Hash  (cost=1.06..1.06 rows=6 width=193) (never executed)
                                                   ->  Seq Scan on deployments d3  (cost=0.00..1.06 rows=6 width=193) (never executed)
                           ->  Hash  (cost=1.26..1.26 rows=1 width=47) (never executed)
                                 ->  Seq Scan on orgs o1  (cost=0.00..1.26 rows=1 width=47) (never executed)
                                       Filter: ((deleted_at IS NULL) AND (id = 2))
                     ->  Hash  (cost=1.18..1.18 rows=1 width=52) (never executed)
                           ->  Seq Scan on products p2  (cost=0.00..1.18 rows=1 width=52) (never executed)
                                 Filter: ((deleted_at IS NULL) AND (id = 13))
 Planning Time: 0.980 ms
 Execution Time: 445.040 ms
(32 rows)

Counting devices (somewhere?)

SELECT count('*') FROM "devices" AS d0 LEFT OUTER JOIN "orgs" AS o1 ON (o1."id" = d0."org_id") AND (o1."deleted_at" IS NULL) LEFT OUTER JOIN "products" AS p2 ON (p2."id" = d0."product_id") AND (p2."deleted_at" IS NULL) LEFT OUTER JOIN "deployments" AS d3 ON d3."id" = d0."deployment_id" LEFT OUTER JOIN "firmwares" AS f4 ON f4."id" = d3."firmware_id" WHERE (d0."org_id" = $1) AND (d0."product_id" = $2) AND (d0."deleted_at" IS NULL) AND (array_to_string(d0."tags", ',') ILIKE $3)

These seem to be coming from the /devices index page when filtering. My hunch is multiple people have left the /devices index page open in tabs and something is triggering the handle_params callback to cause research quite frequently. Though I really have no knowledge or proof to back that

To Reproduce (my guess?)
Steps to reproduce the behavior:

  1. Create thousands of device records
  2. Leave some tabs open filtering on at least identifier and/or tags
  3. Wait for Ecto queue time to start increasing
@jjcarstens jjcarstens added the bug Something isn't working label Sep 9, 2024
@joshk
Copy link
Collaborator

joshk commented Sep 9, 2024

In 99.99999% of cases, a few indexes will fix these issues.

First off, the second count('*') query is done for pagination purposes, to figure out how many pages of results there are. Once we fix the indexes it will fixes both queries.

Also, because of how LiveView works, these queries are being called twice each, once on dead view load, and once on mount. We can also fix this by only loading the data on mount using something like start_async, plus a nice 'loading devices' message. Its worth looking into once we see how fast we can make these queries.

I can also see, from checking the indexes in the DB, that we are missing two indexes for the devices table: org_id and product_id.

I think our big culprit is the recent change to the double wildcards for identifier. https://stackoverflow.com/questions/1566717/postgresql-like-query-performance-variations

We will need to optimize the index used for it to be performant.

@joshk
Copy link
Collaborator

joshk commented Sep 9, 2024

Reading over https://www.cybertec-postgresql.com/en/postgresql-more-performance-for-like-and-ilike-statements/, we would need to switch the index used by identifier from btree to gin, and add the pg_trgm extension.

CREATE EXTENSION pg_trgm;

@jjcarstens
Copy link
Collaborator Author

Updated with some EXPLAIN ANALYZE

Before we get too lost on a red-herring, we are currently deployed to 1684dcf which is before the double wildcard on identifier, so that won't be our only problem (though could be coming 😉 )

@joshk
Copy link
Collaborator

joshk commented Sep 10, 2024

I think #1503 and #1504 will fix this issue

@joshk joshk self-assigned this Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants