Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client unable to query entry number 1 in the API #3954

Closed
mlissner opened this issue Apr 8, 2024 · 16 comments · Fixed by #3957
Closed

Client unable to query entry number 1 in the API #3954

mlissner opened this issue Apr 8, 2024 · 16 comments · Fixed by #3957

Comments

@mlissner
Copy link
Member

mlissner commented Apr 8, 2024

They're doing a pretty straightforward API request (don't click this, it's timing out):

https://www.courtlistener.com/api/rest/v3/docket-entries/?entry_number=1&docket__court=nysd

Either it's sad because of the join (I hope not) or because of the filter on the entry number. I think it's the latter because that field doesn't have an index on it.

A simple solution would be to add an index on the field:

models.Index(
    fields=["entry_number"],
    name='entry_number_idx',
),

But I can't think of a reason why anybody would ever look up an entry other than #1, as above, so I'm thinking we could save some DB memory with:

models.Index(
    fields=["entry_number"],
    name='entry_number_idx',
    condition=Q(entry_number=1),
),

But then the cardinality of the index gets really small, making me wonder if it would do us much good! I admit I'm outside my zone of comfort.

Next step:

  • Do a test before creating the second index and afterwards.

If performance is fixed by the index with the very narrow condition, great, we're off to the races.

@mlissner
Copy link
Member Author

mlissner commented Apr 8, 2024

Here are the two queries that are run when you do this request:

SELECT "search_court"."id", "search_court"."parent_court_id", "search_court"."pacer_court_id", "search_court"."pacer_has_rss_feed", "search_court"."pacer_rss_entry_types", "search_court"."date_last_pacer_contact", "search_court"."fjc_court_id", "search_court"."date_modified", "search_court"."in_use", "search_court"."has_opinion_scraper", "search_court"."has_oral_argument_scraper", "search_court"."position", "search_court"."citation_string", "search_court"."short_name", "search_court"."full_name", "search_court"."url", "search_court"."start_date", "search_court"."end_date", "search_court"."jurisdiction", "search_court"."notes" FROM "search_court" WHERE "search_court"."id" = 'nysd' LIMIT 21; args=('nysd',);
SELECT COUNT(*) AS "__count" FROM "search_docketentry" WHERE ("search_docketentry"."entry_number" = 1 AND "search_docketentry"."docket_id" IN (SELECT U0."id" FROM "search_docket" U0 WHERE U0."court_id" = 'nysd')); args=(Int8(1), 'nysd');

@mlissner
Copy link
Member Author

mlissner commented Apr 8, 2024

And here's the EXPLAIN:

courtlistener=> EXPLAIN SELECT COUNT(*) AS "__count" FROM "search_docketentry" WHERE (  "search_docketentry"."entry_number" = 1 AND "search_docketentry"."docket_id" IN (
    SELECT U0."id" FROM "search_docket" U0 WHERE U0."court_id" = 'nysd'
  )
);
                                                                         QUERY PLAN                                                                          
-------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=10030541.37..10030541.38 rows=1 width=8)
   ->  Gather  (cost=10030541.16..10030541.37 rows=2 width=8)
         Workers Planned: 2
         ->  Partial Aggregate  (cost=10029541.16..10029541.17 rows=1 width=8)
               ->  Parallel Hash Join  (cost=147930.60..10029459.27 rows=32753 width=0)
                     Hash Cond: (search_docketentry.docket_id = u0.id)
                     ->  Parallel Seq Scan on search_docketentry  (cost=0.00..9843506.00 rows=3547304 width=4)
                           Filter: (entry_number = 1)
                     ->  Parallel Hash  (cost=143744.28..143744.28 rows=255146 width=4)
                           ->  Parallel Index Only Scan using search_dock_court_i_a043ae_idx on search_docket u0  (cost=0.56..143744.28 rows=255146 width=4)
                                 Index Cond: (court_id = 'nysd'::text)

That's for the count. I imagine there'd be a third query if we had some content in here.

Note that we're doing a Seq Scan on the search_docketentry table even though I created the index with the condition = 1 using this query:

CREATE INDEX CONCURRENTLY "entry_number_idx" ON "search_docketentry" ("entry_number") WHERE "entry_number" = 1;

So that index didn't do what I'd hoped. On the other hand, it occurs to me that we might not have updated the stats on this DB since it was created from a replica. Maybe we need to do that.

@mlissner
Copy link
Member Author

mlissner commented Apr 8, 2024

OK, currently running ANALYZE on the DB. We might want to make this a standard part of our replica process, since it:

Collects statistics about the contents of tables in the database, and stores the results in the pg_statistic system catalog. Subsequently, the query planner uses these statistics to help determine the most efficient execution plans for queries.

https://www.postgresql.org/docs/current/sql-analyze.html

@mlissner
Copy link
Member Author

mlissner commented Apr 8, 2024

Lame, no change. But I've updated the replication docs to run ANALYZE. Lesson learned.

Not sure of next steps. Probably need to either:

  1. Figure out an index that would help; or
  2. Figure out how to make DJango make a smarter query (possible??)

@mlissner
Copy link
Member Author

mlissner commented Apr 9, 2024

@ERosendo or @albertisfu do you have suggestions for next steps here? Trying to think if this is even something we can solve.

@albertisfu
Copy link
Contributor

yeah, this one seems complicated. Perhaps we can try applying a composite index that includes the docket_id, something like:

models.Index(
    fields=["entry_number", "docket_id"],
    name='entry_number_idx',
    condition=Q(entry_number=1),
)

Since the query uses both fields to retrieve the docket entries:
search_docketentry"."entry_number" = 1 AND "search_docketentry"."docket_id"

@mlissner
Copy link
Member Author

mlissner commented Apr 9, 2024

Great point. I'll give that a try.

@mlissner
Copy link
Member Author

mlissner commented Apr 9, 2024

OK, new index in place, and...

 Finalize Aggregate  (cost=10047819.26..10047819.27 rows=1 width=8)
   ->  Gather  (cost=10047819.04..10047819.25 rows=2 width=8)
         Workers Planned: 2
         ->  Partial Aggregate  (cost=10046819.04..10046819.05 rows=1 width=8)
               ->  Parallel Hash Join  (cost=164551.64..10046726.68 rows=36945 width=0)
                     Hash Cond: (search_docketentry.docket_id = u0.id)
                     ->  Parallel Seq Scan on search_docketentry  (cost=0.00..9843506.00 rows=3598492 width=4)
                           Filter: (entry_number = 1)
                     ->  Parallel Hash  (cost=159895.50..159895.50 rows=283771 width=4)
                           ->  Parallel Index Only Scan using search_dock_court_i_a043ae_idx on search_docket u0  (cost=0.56..159895.50 rows=283771 width=4)
                                 Index Cond: (court_id = 'nysd'::text)

So...no change.

@albertisfu
Copy link
Contributor

albertisfu commented Apr 9, 2024

Got it. It's strange that the planner is not using the index, which seems to be due to how the query is structured.

I did try the following query:

EXPLAIN SELECT COUNT(*) AS "__count"
FROM "search_docketentry"
INNER JOIN "search_docket" ON "search_docketentry"."docket_id" = "search_docket"."id"
WHERE "search_docketentry"."entry_number" = 1
AND "search_docket"."court_id" = 'nysd';

Which uses a JOIN instead of a subquery.

With this query the composited index is being used:

models.Index(
    fields=["entry_number", "docket_id"],
    name='entry_number_idx',
)

"Aggregate  (cost=8.41..8.42 rows=1 width=8)"
"  ->  Nested Loop  (cost=0.29..8.40 rows=2 width=0)"
"        ->  Index Only Scan using search_dock_court_i_a043ae_idx on search_docket  (cost=0.14..4.16 rows=1 width=4)"
"              Index Cond: (court_id = 'nysd'::text)"
"        ->  Index Only Scan using entry_number_idx on search_docketentry  (cost=0.15..4.21 rows=3 width=4)"
"              Index Cond: ((entry_number = 1) AND (docket_id = search_docket.id))"

However, note that the index that worked is the one without the constraint entry_number = 1. I tested it as well with the constrained index, and in that case, the index is not used; instead, it only uses the docket_id index.

@ERosendo
Copy link
Contributor

ERosendo commented Apr 9, 2024

@albertisfu did you run that test in the dev db? or in your local db?

@albertisfu
Copy link
Contributor

I tested it on my local database, so it is not representative in terms of the number of records.

@ERosendo
Copy link
Contributor

ERosendo commented Apr 9, 2024

@albertisfu thanks for clarifying!

In this scenario, the serializer class generates the queries, limiting our control over SQL specificity. We can consider exploring alternative approaches if we need finer control in the future.

I'm still hoping a VACUUM can resolve the issue.

@mlissner
Copy link
Member Author

mlissner commented Apr 9, 2024

Well, I ran VACUUM and got:

courtlistener=> EXPLAIN SELECT COUNT(*) AS "__count" FROM "search_docketentry" WHERE ("search_docketentry"."entry_number" = 1 AND "search_docketentry"."docket_id" IN (SELECT U0."id" FROM "search_docket" U0 WHERE U0."court_id" = 'nysd'));
                                                                  QUERY PLAN                                                                  
----------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=926629.48..926629.49 rows=1 width=8)
   ->  Gather  (cost=926629.27..926629.48 rows=2 width=8)
         Workers Planned: 2
         ->  Partial Aggregate  (cost=925629.27..925629.28 rows=1 width=8)
               ->  Merge Join  (cost=6.59..925538.36 rows=36364 width=0)
                     Merge Cond: (search_docketentry.docket_id = u0.id)
                     ->  Parallel Index Only Scan using entry_number_idx on search_docketentry  (cost=0.43..751069.76 rows=3541920 width=4)
                     ->  Index Only Scan using search_dock_court_i_a043ae_idx on search_docket u0  (cost=0.56..163868.29 rows=681050 width=4)
                           Index Cond: (court_id = 'nysd'::text)

Which is a lot better. Running it with an EXPLAIN ANALYZE took about 2.5s.

So vacuuming really seemed to help or the new index + the vacuum helped. I have always assumed that tweaking vacuums by hand wasn't necessary. I guess it's time to start doing that.

@mlissner
Copy link
Member Author

mlissner commented Apr 9, 2024

And just for completeness, I removed the index and tried the query again. It took 186s:

 Finalize Aggregate  (cost=10045179.49..10045179.50 rows=1 width=8) (actual time=186228.666..186265.362 rows=1 loops=1)
   ->  Gather  (cost=10045179.28..10045179.49 rows=2 width=8) (actual time=186225.954..186265.352 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate  (cost=10044179.28..10044179.29 rows=1 width=8) (actual time=186219.153..186219.245 rows=1 loops=3)
               ->  Parallel Hash Join  (cost=164551.64..10044086.40 rows=37152 width=0) (actual time=185144.922..186212.839 rows=50207 loops=3)
                     Hash Cond: (search_docketentry.docket_id = u0.id)
                     ->  Parallel Seq Scan on search_docketentry  (cost=0.00..9840654.67 rows=3618702 width=4) (actual time=1.060..183194.963 rows=2942904 loops=3)
                           Filter: (entry_number = 1)
                           Rows Removed by Filter: 119907470
                     ->  Parallel Hash  (cost=159895.50..159895.50 rows=283771 width=4) (actual time=233.946..233.946 rows=208445 loops=3)
                           Buckets: 131072  Batches: 16  Memory Usage: 2592kB
                           ->  Parallel Index Only Scan using search_dock_court_i_a043ae_idx on search_docket u0  (cost=0.56..159895.50 rows=283771 width=4) (actual time=0.033..149.048 rows=208445 loops=3)
                                 Index Cond: (court_id = 'nysd'::text)
                                 Heap Fetches: 56469
 Planning Time: 0.826 ms
 Execution Time: 186265.414 ms

@mlissner
Copy link
Member Author

mlissner commented Apr 9, 2024

Cool, so I think the way forward here is:

  1. Add the index

  2. See if it works by doing the EXPLAIN ANALYZE command.

    I'm tempted to believe that replicas have problems with vacuum'ing properly, because they get the data in weird ways that doesn't trigger the autovacuum tool. Maybe it's better in prod.

  3. If that fails, monkey with the autovacuum settings. It appears they may not work very well for busy or large tables. A few references:

@mlissner
Copy link
Member Author

OK, so this worked fine in prod after creating the index. I think that means we should run VACUUM ANALYZE on our replicas as a final step, since they seem to have bad statistics after being created. Progress.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants