Big performance boost on faceting: skip the inner order by #1394

simonw · 2021-07-14T23:32:29Z

I just noticed something that could make for a huge performance improvement in faceting.

The default query used by Datasette when faceting looks like this:

select
  country_long,
  count(*)
from (
  select * from [global-power-plants] order by rowid
)
where
  country_long is not null
group by
  country_long
order by
  count(*) desc

Here it takes 53ms: https://global-power-plants.datasettes.com/global-power-plants?sql=select%0D%0A++country_long%2C%0D%0A++count%28*%29%0D%0Afrom+%28%0D%0A++select+*+from+%5Bglobal-power-plants%5D+order+by+rowid%0D%0A%29%0D%0Awhere%0D%0A++country_long+is+not+null%0D%0Agroup+by%0D%0A++country_long%0D%0Aorder+by%0D%0A++count%28*%29+desc

Note that there's a order by rowid in there which isn't necessary - the order on that inner query doesn't matter since we're grouping and counting.

I had assumed SQLite would optimize this away - but it turns out it doesn't! Consider this version of the query, with that pointless order by removed:

select
  country_long,
  count(*)
from (
  select * from [global-power-plants]
)
where
  country_long is not null
group by
  country_long
order by
  count(*) desc

https://global-power-plants.datasettes.com/global-power-plants?sql=select%0D%0A++country_long%2C%0D%0A++count%28*%29%0D%0Afrom+%28%0D%0A++select+*+from+%5Bglobal-power-plants%5D%0D%0A%29%0D%0Awhere%0D%0A++country_long+is+not+null%0D%0Agroup+by%0D%0A++country_long%0D%0Aorder+by%0D%0A++count%28*%29+desc runs in 7.2ms!

I tried this optimization on a table with 2.5m rows in it - without the optimization it took 5 seconds, with the optimization it took 450ms. So this is a very significant improvement!

The text was updated successfully, but these errors were encountered:

simonw · 2021-07-14T23:35:18Z

The challenge here is that faceting doesn't currently modify the inner SQL at all - it wraps it so that it can work against any SQL statement (though Datasette itself does not yet take advantage of that ability, only offering faceting on table pages).

So just removing the order by wouldn't be appropriate if the inner query looked something like this:

select * from items order by created desc limit 100

Since the intent there would be to return facet counts against only the most recent 100 items.

In SQLite the limit has to come after the order by though, so the fix here could be as easy as using a regular expression to identify queries that end with order by COLUMN (desc)? and stripping off that clause.

simonw · 2021-07-15T00:01:47Z

I wrote this code:

_order_by_re = re.compile(r"(^.*) order by [a-zA-Z_][a-zA-Z0-9_]+( desc)?$", re.DOTALL)
_order_by_braces_re = re.compile(r"(^.*) order by \[[^\]]+\]( desc)?$", re.DOTALL)


def strip_order_by(sql):
    for regex in (_order_by_re, _order_by_braces_re):
        match = regex.match(sql)
        if match is not None:
            return match.group(1)
    return sql

@pytest.mark.parametrize(
    "sql,expected",
    [
        ("blah", "blah"),
        ("select * from foo", "select * from foo"),
        ("select * from foo order by bah", "select * from foo"),
        ("select * from foo order by bah desc", "select * from foo"),
        ("select * from foo order by [select]", "select * from foo"),
        ("select * from foo order by [select] desc", "select * from foo"),
    ],
)
def test_strip_order_by(sql, expected):
    assert strip_order_by(sql) == expected

But it turns out I don't need it! The SQL that is passed to the facet class is created by this code:

datasette/datasette/views/table.py

Lines 677 to 684 in ba11ef2

    
           sql_no_limit = ( 
        
               "select {select_all_columns} from {table_name} {where}{order_by}".format( 
        
                   select_all_columns=select_all_columns, 
        
                   table_name=escape_sqlite(table), 
        
                   where=where_clause, 
        
                   order_by=order_by, 
        
               ) 
        
           )

And the only place that uses that sql_no_limit variable is here:

datasette/datasette/views/table.py

Lines 733 to 745 in ba11ef2

    
           for klass in facet_classes: 
        
               facet_instances.append( 
        
                   klass( 
        
                       self.ds, 
        
                       request, 
        
                       database, 
        
                       sql=sql_no_limit, 
        
                       params=params, 
        
                       table=table, 
        
                       metadata=table_metadata, 
        
                       row_count=filtered_table_rows_count, 
        
                   ) 
        
               )

So I can change that to sql_no_limit_no_order and fix the bug that way instead.

Refs simonw/datasette#1394

Refs #1365, #1371, #1377, #1384, #1387, #1388, #1389, #1394

simonw · 2021-07-15T17:58:03Z

Started a conversation about this on the SQLite forum: https://sqlite.org/forum/forumpost/2d76f2bcf65d256a?t=h

simonw · 2021-07-16T02:23:32Z

Wrote about this in the annotated release notes for 0.58: https://simonwillison.net/2021/Jul/16/datasette-058/

simonw added faceting performance labels Jul 14, 2021

simonw added a commit that referenced this issue Jul 15, 2021

Big performance boost for faceting, closes #1394

57c294e

simonw closed this as completed in a6c8e7f Jul 15, 2021

simonw added a commit to simonw/project-pelican that referenced this issue Jul 15, 2021

Use main branch to get faceting performance boost

75b3f96

Refs simonw/datasette#1394

simonw added a commit that referenced this issue Jul 15, 2021

Release 0.58

e27dd7c

Refs #1365, #1371, #1377, #1384, #1387, #1388, #1389, #1394

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Big performance boost on faceting: skip the inner order by #1394

Big performance boost on faceting: skip the inner order by #1394

simonw commented Jul 14, 2021

simonw commented Jul 14, 2021

simonw commented Jul 15, 2021

simonw commented Jul 15, 2021

simonw commented Jul 16, 2021

Big performance boost on faceting: skip the inner order by #1394

Big performance boost on faceting: skip the inner order by #1394

Comments

simonw commented Jul 14, 2021

simonw commented Jul 14, 2021

simonw commented Jul 15, 2021

simonw commented Jul 15, 2021

simonw commented Jul 16, 2021