Client unable to query entry number 1 in the API #3954

mlissner · 2024-04-08T21:21:47Z

They're doing a pretty straightforward API request (don't click this, it's timing out):

https://www.courtlistener.com/api/rest/v3/docket-entries/?entry_number=1&docket__court=nysd

Either it's sad because of the join (I hope not) or because of the filter on the entry number. I think it's the latter because that field doesn't have an index on it.

A simple solution would be to add an index on the field:

models.Index(
    fields=["entry_number"],
    name='entry_number_idx',
),

But I can't think of a reason why anybody would ever look up an entry other than #1, as above, so I'm thinking we could save some DB memory with:

models.Index(
    fields=["entry_number"],
    name='entry_number_idx',
    condition=Q(entry_number=1),
),

But then the cardinality of the index gets really small, making me wonder if it would do us much good! I admit I'm outside my zone of comfort.

Next step:

Do a test before creating the second index and afterwards.

If performance is fixed by the index with the very narrow condition, great, we're off to the races.

The text was updated successfully, but these errors were encountered:

mlissner · 2024-04-08T23:40:14Z

Here are the two queries that are run when you do this request:

SELECT "search_court"."id", "search_court"."parent_court_id", "search_court"."pacer_court_id", "search_court"."pacer_has_rss_feed", "search_court"."pacer_rss_entry_types", "search_court"."date_last_pacer_contact", "search_court"."fjc_court_id", "search_court"."date_modified", "search_court"."in_use", "search_court"."has_opinion_scraper", "search_court"."has_oral_argument_scraper", "search_court"."position", "search_court"."citation_string", "search_court"."short_name", "search_court"."full_name", "search_court"."url", "search_court"."start_date", "search_court"."end_date", "search_court"."jurisdiction", "search_court"."notes" FROM "search_court" WHERE "search_court"."id" = 'nysd' LIMIT 21; args=('nysd',);

SELECT COUNT(*) AS "__count" FROM "search_docketentry" WHERE ("search_docketentry"."entry_number" = 1 AND "search_docketentry"."docket_id" IN (SELECT U0."id" FROM "search_docket" U0 WHERE U0."court_id" = 'nysd')); args=(Int8(1), 'nysd');

mlissner · 2024-04-08T23:45:15Z

And here's the EXPLAIN:

courtlistener=> EXPLAIN SELECT COUNT(*) AS "__count" FROM "search_docketentry" WHERE (  "search_docketentry"."entry_number" = 1 AND "search_docketentry"."docket_id" IN (
    SELECT U0."id" FROM "search_docket" U0 WHERE U0."court_id" = 'nysd'
  )
);
                                                                         QUERY PLAN                                                                          
-------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=10030541.37..10030541.38 rows=1 width=8)
   ->  Gather  (cost=10030541.16..10030541.37 rows=2 width=8)
         Workers Planned: 2
         ->  Partial Aggregate  (cost=10029541.16..10029541.17 rows=1 width=8)
               ->  Parallel Hash Join  (cost=147930.60..10029459.27 rows=32753 width=0)
                     Hash Cond: (search_docketentry.docket_id = u0.id)
                     ->  Parallel Seq Scan on search_docketentry  (cost=0.00..9843506.00 rows=3547304 width=4)
                           Filter: (entry_number = 1)
                     ->  Parallel Hash  (cost=143744.28..143744.28 rows=255146 width=4)
                           ->  Parallel Index Only Scan using search_dock_court_i_a043ae_idx on search_docket u0  (cost=0.56..143744.28 rows=255146 width=4)
                                 Index Cond: (court_id = 'nysd'::text)

That's for the count. I imagine there'd be a third query if we had some content in here.

Note that we're doing a Seq Scan on the search_docketentry table even though I created the index with the condition = 1 using this query:

CREATE INDEX CONCURRENTLY "entry_number_idx" ON "search_docketentry" ("entry_number") WHERE "entry_number" = 1;

So that index didn't do what I'd hoped. On the other hand, it occurs to me that we might not have updated the stats on this DB since it was created from a replica. Maybe we need to do that.

mlissner · 2024-04-08T23:51:07Z

OK, currently running ANALYZE on the DB. We might want to make this a standard part of our replica process, since it:

Collects statistics about the contents of tables in the database, and stores the results in the pg_statistic system catalog. Subsequently, the query planner uses these statistics to help determine the most efficient execution plans for queries.

https://www.postgresql.org/docs/current/sql-analyze.html

mlissner · 2024-04-08T23:56:23Z

Lame, no change. But I've updated the replication docs to run ANALYZE. Lesson learned.

Not sure of next steps. Probably need to either:

Figure out an index that would help; or
Figure out how to make DJango make a smarter query (possible??)

mlissner · 2024-04-09T00:00:48Z

@ERosendo or @albertisfu do you have suggestions for next steps here? Trying to think if this is even something we can solve.

albertisfu · 2024-04-09T15:16:43Z

yeah, this one seems complicated. Perhaps we can try applying a composite index that includes the docket_id, something like:

models.Index(
    fields=["entry_number", "docket_id"],
    name='entry_number_idx',
    condition=Q(entry_number=1),
)

Since the query uses both fields to retrieve the docket entries:
search_docketentry"."entry_number" = 1 AND "search_docketentry"."docket_id"

mlissner · 2024-04-09T15:59:26Z

Great point. I'll give that a try.

mlissner · 2024-04-09T18:09:41Z

OK, new index in place, and...

 Finalize Aggregate  (cost=10047819.26..10047819.27 rows=1 width=8)
   ->  Gather  (cost=10047819.04..10047819.25 rows=2 width=8)
         Workers Planned: 2
         ->  Partial Aggregate  (cost=10046819.04..10046819.05 rows=1 width=8)
               ->  Parallel Hash Join  (cost=164551.64..10046726.68 rows=36945 width=0)
                     Hash Cond: (search_docketentry.docket_id = u0.id)
                     ->  Parallel Seq Scan on search_docketentry  (cost=0.00..9843506.00 rows=3598492 width=4)
                           Filter: (entry_number = 1)
                     ->  Parallel Hash  (cost=159895.50..159895.50 rows=283771 width=4)
                           ->  Parallel Index Only Scan using search_dock_court_i_a043ae_idx on search_docket u0  (cost=0.56..159895.50 rows=283771 width=4)
                                 Index Cond: (court_id = 'nysd'::text)

So...no change.

albertisfu · 2024-04-09T19:52:47Z

Got it. It's strange that the planner is not using the index, which seems to be due to how the query is structured.

I did try the following query:

EXPLAIN SELECT COUNT(*) AS "__count"
FROM "search_docketentry"
INNER JOIN "search_docket" ON "search_docketentry"."docket_id" = "search_docket"."id"
WHERE "search_docketentry"."entry_number" = 1
AND "search_docket"."court_id" = 'nysd';

Which uses a JOIN instead of a subquery.

With this query the composited index is being used:

models.Index(
    fields=["entry_number", "docket_id"],
    name='entry_number_idx',
)

"Aggregate  (cost=8.41..8.42 rows=1 width=8)"
"  ->  Nested Loop  (cost=0.29..8.40 rows=2 width=0)"
"        ->  Index Only Scan using search_dock_court_i_a043ae_idx on search_docket  (cost=0.14..4.16 rows=1 width=4)"
"              Index Cond: (court_id = 'nysd'::text)"
"        ->  Index Only Scan using entry_number_idx on search_docketentry  (cost=0.15..4.21 rows=3 width=4)"
"              Index Cond: ((entry_number = 1) AND (docket_id = search_docket.id))"

However, note that the index that worked is the one without the constraint entry_number = 1. I tested it as well with the constrained index, and in that case, the index is not used; instead, it only uses the docket_id index.

ERosendo · 2024-04-09T19:54:52Z

@albertisfu did you run that test in the dev db? or in your local db?

albertisfu · 2024-04-09T19:56:16Z

I tested it on my local database, so it is not representative in terms of the number of records.

ERosendo · 2024-04-09T20:08:53Z

@albertisfu thanks for clarifying!

In this scenario, the serializer class generates the queries, limiting our control over SQL specificity. We can consider exploring alternative approaches if we need finer control in the future.

I'm still hoping a VACUUM can resolve the issue.

mlissner · 2024-04-09T21:57:12Z

Well, I ran VACUUM and got:

courtlistener=> EXPLAIN SELECT COUNT(*) AS "__count" FROM "search_docketentry" WHERE ("search_docketentry"."entry_number" = 1 AND "search_docketentry"."docket_id" IN (SELECT U0."id" FROM "search_docket" U0 WHERE U0."court_id" = 'nysd'));
                                                                  QUERY PLAN                                                                  
----------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=926629.48..926629.49 rows=1 width=8)
   ->  Gather  (cost=926629.27..926629.48 rows=2 width=8)
         Workers Planned: 2
         ->  Partial Aggregate  (cost=925629.27..925629.28 rows=1 width=8)
               ->  Merge Join  (cost=6.59..925538.36 rows=36364 width=0)
                     Merge Cond: (search_docketentry.docket_id = u0.id)
                     ->  Parallel Index Only Scan using entry_number_idx on search_docketentry  (cost=0.43..751069.76 rows=3541920 width=4)
                     ->  Index Only Scan using search_dock_court_i_a043ae_idx on search_docket u0  (cost=0.56..163868.29 rows=681050 width=4)
                           Index Cond: (court_id = 'nysd'::text)

Which is a lot better. Running it with an EXPLAIN ANALYZE took about 2.5s.

So vacuuming really seemed to help or the new index + the vacuum helped. I have always assumed that tweaking vacuums by hand wasn't necessary. I guess it's time to start doing that.

mlissner · 2024-04-09T22:06:03Z

And just for completeness, I removed the index and tried the query again. It took 186s:

 Finalize Aggregate  (cost=10045179.49..10045179.50 rows=1 width=8) (actual time=186228.666..186265.362 rows=1 loops=1)
   ->  Gather  (cost=10045179.28..10045179.49 rows=2 width=8) (actual time=186225.954..186265.352 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate  (cost=10044179.28..10044179.29 rows=1 width=8) (actual time=186219.153..186219.245 rows=1 loops=3)
               ->  Parallel Hash Join  (cost=164551.64..10044086.40 rows=37152 width=0) (actual time=185144.922..186212.839 rows=50207 loops=3)
                     Hash Cond: (search_docketentry.docket_id = u0.id)
                     ->  Parallel Seq Scan on search_docketentry  (cost=0.00..9840654.67 rows=3618702 width=4) (actual time=1.060..183194.963 rows=2942904 loops=3)
                           Filter: (entry_number = 1)
                           Rows Removed by Filter: 119907470
                     ->  Parallel Hash  (cost=159895.50..159895.50 rows=283771 width=4) (actual time=233.946..233.946 rows=208445 loops=3)
                           Buckets: 131072  Batches: 16  Memory Usage: 2592kB
                           ->  Parallel Index Only Scan using search_dock_court_i_a043ae_idx on search_docket u0  (cost=0.56..159895.50 rows=283771 width=4) (actual time=0.033..149.048 rows=208445 loops=3)
                                 Index Cond: (court_id = 'nysd'::text)
                                 Heap Fetches: 56469
 Planning Time: 0.826 ms
 Execution Time: 186265.414 ms

mlissner · 2024-04-09T22:40:52Z

Cool, so I think the way forward here is:

Add the index
See if it works by doing the EXPLAIN ANALYZE command.

I'm tempted to believe that replicas have problems with vacuum'ing properly, because they get the data in weird ways that doesn't trigger the autovacuum tool. Maybe it's better in prod.
If that fails, monkey with the autovacuum settings. It appears they may not work very well for busy or large tables. A few references:

Fixes: #3954

mlissner · 2024-04-10T00:24:59Z

OK, so this worked fine in prod after creating the index. I think that means we should run VACUUM ANALYZE on our replicas as a final step, since they seem to have bad statistics after being created. Progress.

mlissner added this to @erosendo's backlog Apr 8, 2024

mlissner added a commit that referenced this issue Apr 9, 2024

feat(db): Add index on docket_id and entry_number

cce4d6b

Fixes: #3954

mlissner mentioned this issue Apr 9, 2024

Make entry_number queries faster #3957

Merged

mlissner closed this as completed in #3957 Apr 9, 2024

github-project-automation bot moved this to Done in @erosendo's backlog Apr 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Client unable to query entry number 1 in the API #3954

Client unable to query entry number 1 in the API #3954

mlissner commented Apr 8, 2024

mlissner commented Apr 8, 2024

mlissner commented Apr 8, 2024 •

edited

Loading

mlissner commented Apr 8, 2024

mlissner commented Apr 8, 2024

mlissner commented Apr 9, 2024

albertisfu commented Apr 9, 2024

mlissner commented Apr 9, 2024

mlissner commented Apr 9, 2024 •

edited

Loading

albertisfu commented Apr 9, 2024 •

edited

Loading

ERosendo commented Apr 9, 2024 •

edited

Loading

albertisfu commented Apr 9, 2024

ERosendo commented Apr 9, 2024

mlissner commented Apr 9, 2024

mlissner commented Apr 9, 2024

mlissner commented Apr 9, 2024

mlissner commented Apr 10, 2024

Client unable to query entry number 1 in the API #3954

Client unable to query entry number 1 in the API #3954

Comments

mlissner commented Apr 8, 2024

mlissner commented Apr 8, 2024

mlissner commented Apr 8, 2024 • edited Loading

mlissner commented Apr 8, 2024

mlissner commented Apr 8, 2024

mlissner commented Apr 9, 2024

albertisfu commented Apr 9, 2024

mlissner commented Apr 9, 2024

mlissner commented Apr 9, 2024 • edited Loading

albertisfu commented Apr 9, 2024 • edited Loading

ERosendo commented Apr 9, 2024 • edited Loading

albertisfu commented Apr 9, 2024

ERosendo commented Apr 9, 2024

mlissner commented Apr 9, 2024

mlissner commented Apr 9, 2024

mlissner commented Apr 9, 2024

mlissner commented Apr 10, 2024

mlissner commented Apr 8, 2024 •

edited

Loading

mlissner commented Apr 9, 2024 •

edited

Loading

albertisfu commented Apr 9, 2024 •

edited

Loading

ERosendo commented Apr 9, 2024 •

edited

Loading