Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Big performance boost on faceting: skip the inner order by #1394

Closed
simonw opened this issue Jul 14, 2021 · 4 comments
Closed

Big performance boost on faceting: skip the inner order by #1394

simonw opened this issue Jul 14, 2021 · 4 comments

Comments

@simonw
Copy link
Owner

simonw commented Jul 14, 2021

I just noticed something that could make for a huge performance improvement in faceting.

The default query used by Datasette when faceting looks like this:

select
  country_long,
  count(*)
from (
  select * from [global-power-plants] order by rowid
)
where
  country_long is not null
group by
  country_long
order by
  count(*) desc

Here it takes 53ms: https://global-power-plants.datasettes.com/global-power-plants?sql=select%0D%0A++country_long%2C%0D%0A++count%28*%29%0D%0Afrom+%28%0D%0A++select+*+from+%5Bglobal-power-plants%5D+order+by+rowid%0D%0A%29%0D%0Awhere%0D%0A++country_long+is+not+null%0D%0Agroup+by%0D%0A++country_long%0D%0Aorder+by%0D%0A++count%28*%29+desc

Note that there's a order by rowid in there which isn't necessary - the order on that inner query doesn't matter since we're grouping and counting.

I had assumed SQLite would optimize this away - but it turns out it doesn't! Consider this version of the query, with that pointless order by removed:

select
  country_long,
  count(*)
from (
  select * from [global-power-plants]
)
where
  country_long is not null
group by
  country_long
order by
  count(*) desc

https://global-power-plants.datasettes.com/global-power-plants?sql=select%0D%0A++country_long%2C%0D%0A++count%28*%29%0D%0Afrom+%28%0D%0A++select+*+from+%5Bglobal-power-plants%5D%0D%0A%29%0D%0Awhere%0D%0A++country_long+is+not+null%0D%0Agroup+by%0D%0A++country_long%0D%0Aorder+by%0D%0A++count%28*%29+desc runs in 7.2ms!

I tried this optimization on a table with 2.5m rows in it - without the optimization it took 5 seconds, with the optimization it took 450ms. So this is a very significant improvement!

@simonw
Copy link
Owner Author

simonw commented Jul 14, 2021

The challenge here is that faceting doesn't currently modify the inner SQL at all - it wraps it so that it can work against any SQL statement (though Datasette itself does not yet take advantage of that ability, only offering faceting on table pages).

So just removing the order by wouldn't be appropriate if the inner query looked something like this:

select * from items order by created desc limit 100

Since the intent there would be to return facet counts against only the most recent 100 items.

In SQLite the limit has to come after the order by though, so the fix here could be as easy as using a regular expression to identify queries that end with order by COLUMN (desc)? and stripping off that clause.

@simonw
Copy link
Owner Author

simonw commented Jul 15, 2021

I wrote this code:

_order_by_re = re.compile(r"(^.*) order by [a-zA-Z_][a-zA-Z0-9_]+( desc)?$", re.DOTALL)
_order_by_braces_re = re.compile(r"(^.*) order by \[[^\]]+\]( desc)?$", re.DOTALL)


def strip_order_by(sql):
    for regex in (_order_by_re, _order_by_braces_re):
        match = regex.match(sql)
        if match is not None:
            return match.group(1)
    return sql

@pytest.mark.parametrize(
    "sql,expected",
    [
        ("blah", "blah"),
        ("select * from foo", "select * from foo"),
        ("select * from foo order by bah", "select * from foo"),
        ("select * from foo order by bah desc", "select * from foo"),
        ("select * from foo order by [select]", "select * from foo"),
        ("select * from foo order by [select] desc", "select * from foo"),
    ],
)
def test_strip_order_by(sql, expected):
    assert strip_order_by(sql) == expected

But it turns out I don't need it! The SQL that is passed to the facet class is created by this code:

sql_no_limit = (
"select {select_all_columns} from {table_name} {where}{order_by}".format(
select_all_columns=select_all_columns,
table_name=escape_sqlite(table),
where=where_clause,
order_by=order_by,
)
)

And the only place that uses that sql_no_limit variable is here:

for klass in facet_classes:
facet_instances.append(
klass(
self.ds,
request,
database,
sql=sql_no_limit,
params=params,
table=table,
metadata=table_metadata,
row_count=filtered_table_rows_count,
)
)

So I can change that to sql_no_limit_no_order and fix the bug that way instead.

@simonw simonw closed this as completed in a6c8e7f Jul 15, 2021
simonw added a commit to simonw/project-pelican that referenced this issue Jul 15, 2021
simonw added a commit that referenced this issue Jul 15, 2021
@simonw
Copy link
Owner Author

simonw commented Jul 15, 2021

Started a conversation about this on the SQLite forum: https://sqlite.org/forum/forumpost/2d76f2bcf65d256a?t=h

@simonw
Copy link
Owner Author

simonw commented Jul 16, 2021

Wrote about this in the annotated release notes for 0.58: https://simonwillison.net/2021/Jul/16/datasette-058/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant