-
-
Notifications
You must be signed in to change notification settings - Fork 691
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Big performance boost on faceting: skip the inner order by #1394
Comments
The challenge here is that faceting doesn't currently modify the inner SQL at all - it wraps it so that it can work against any SQL statement (though Datasette itself does not yet take advantage of that ability, only offering faceting on table pages). So just removing the order by wouldn't be appropriate if the inner query looked something like this: select * from items order by created desc limit 100 Since the intent there would be to return facet counts against only the most recent 100 items. In SQLite the |
I wrote this code: _order_by_re = re.compile(r"(^.*) order by [a-zA-Z_][a-zA-Z0-9_]+( desc)?$", re.DOTALL)
_order_by_braces_re = re.compile(r"(^.*) order by \[[^\]]+\]( desc)?$", re.DOTALL)
def strip_order_by(sql):
for regex in (_order_by_re, _order_by_braces_re):
match = regex.match(sql)
if match is not None:
return match.group(1)
return sql
@pytest.mark.parametrize(
"sql,expected",
[
("blah", "blah"),
("select * from foo", "select * from foo"),
("select * from foo order by bah", "select * from foo"),
("select * from foo order by bah desc", "select * from foo"),
("select * from foo order by [select]", "select * from foo"),
("select * from foo order by [select] desc", "select * from foo"),
],
)
def test_strip_order_by(sql, expected):
assert strip_order_by(sql) == expected But it turns out I don't need it! The SQL that is passed to the facet class is created by this code: datasette/datasette/views/table.py Lines 677 to 684 in ba11ef2
And the only place that uses that datasette/datasette/views/table.py Lines 733 to 745 in ba11ef2
So I can change that to |
Started a conversation about this on the SQLite forum: https://sqlite.org/forum/forumpost/2d76f2bcf65d256a?t=h |
Wrote about this in the annotated release notes for 0.58: https://simonwillison.net/2021/Jul/16/datasette-058/ |
I just noticed something that could make for a huge performance improvement in faceting.
The default query used by Datasette when faceting looks like this:
Here it takes 53ms: https://global-power-plants.datasettes.com/global-power-plants?sql=select%0D%0A++country_long%2C%0D%0A++count%28*%29%0D%0Afrom+%28%0D%0A++select+*+from+%5Bglobal-power-plants%5D+order+by+rowid%0D%0A%29%0D%0Awhere%0D%0A++country_long+is+not+null%0D%0Agroup+by%0D%0A++country_long%0D%0Aorder+by%0D%0A++count%28*%29+desc
Note that there's a
order by rowid
in there which isn't necessary - the order on that inner query doesn't matter since we're grouping and counting.I had assumed SQLite would optimize this away - but it turns out it doesn't! Consider this version of the query, with that pointless order by removed:
https://global-power-plants.datasettes.com/global-power-plants?sql=select%0D%0A++country_long%2C%0D%0A++count%28*%29%0D%0Afrom+%28%0D%0A++select+*+from+%5Bglobal-power-plants%5D%0D%0A%29%0D%0Awhere%0D%0A++country_long+is+not+null%0D%0Agroup+by%0D%0A++country_long%0D%0Aorder+by%0D%0A++count%28*%29+desc runs in 7.2ms!
I tried this optimization on a table with 2.5m rows in it - without the optimization it took 5 seconds, with the optimization it took 450ms. So this is a very significant improvement!
The text was updated successfully, but these errors were encountered: