Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_facet_array should work against views #448

Closed
simonw opened this issue May 3, 2019 · 12 comments
Closed

_facet_array should work against views #448

simonw opened this issue May 3, 2019 · 12 comments

Comments

@simonw
Copy link
Owner

simonw commented May 3, 2019

I created this view: https://json-view-facet-bug-demo-j7hipcg4aq-uc.a.run.app/russian-ads-8dbda00/ads_with_targets

CREATE VIEW ads_with_targets as select ads.*, json_group_array(targets.name) as target_names from ads
  join ad_targets on ad_targets.ad_id = ads.id
  join targets on ad_targets.target_id = targets.id
  group by ad_targets.ad_id

When I try to apply faceting by array it appears to work at first: https://json-view-facet-bug-demo-j7hipcg4aq-uc.a.run.app/russian-ads/ads_with_targets?_facet_array=target_names

But actually it's doing the wrong thing - the SQL for the facets uses rowid, but rowid is not present on views at all! These results are incorrect, and clicking to select a facet will fail to produce any rows: https://json-view-facet-bug-demo-j7hipcg4aq-uc.a.run.app/russian-ads/ads_with_targets?_facet_array=target_names&target_names__arraycontains=people_who_match%3Ainterests%3AAfrican-American+Civil+Rights+Movement+%281954%E2%80%9468%29

Here's the SQL it should be using when you select a facet (note that it does not use a rowid):

https://json-view-facet-bug-demo-j7hipcg4aq-uc.a.run.app/russian-ads?sql=select+*+from+ads_with_targets+where+id+in+%28%0D%0A++++++++++++select+ads_with_targets.id+from+ads_with_targets%2C+json_each%28ads_with_targets.target_names%29+j%0D%0A++++++++++++where+j.value+%3D+%3Ap0%0D%0A++++++++%29+limit+101&p0=people_who_match%3Ainterests%3ABlack+%28Color%29

So we need to do something a lot smarter here. I'm not sure what the fix will look like, or even if it's feasible given that views don't have a rowid to hook into so the JSON faceting SQL may have to be completely rewritten.

datasette publish cloudrun \
    russian-ads.db \
    --name json-view-facet-bug-demo \
    --branch master \
    --extra-options "--config sql_time_limit_ms:5000 --config facet_time_limit_ms:5000"
@simonw
Copy link
Owner Author

simonw commented May 3, 2019

It may be that some facet implementations (ArrayFacet in this case) need a way to detect if they are supported by the thing they are running against (must be a rowid table in this case) and avoid suggesting themselves if they are not compatible. This may require a change to the information we make available to the suggest() method (information passed to the Facet class constructor).

@simonw simonw added this to the Datasette 1.0 milestone Jun 6, 2020
@simonw
Copy link
Owner Author

simonw commented Nov 15, 2021

@simonw
Copy link
Owner Author

simonw commented Nov 15, 2021

Applied that fix to the arraycontains filter but I'm still getting bad results for the faceting:

russian-ads__ads_with_targets__172_rows_where_where_target_names_contains__people_who_match_interests_African-American_culture__and_datasette_—_pipenv_shell_▸_python_—_80×24

Should never get 182 results on a page that faceting against only 172 items.

@simonw
Copy link
Owner Author

simonw commented Nov 15, 2021

I think this code is wrong in the ArrayFacet class:

facet_sql = """
select j.value as value, count(*) as count from (
{sql}
) join json_each({col}) j
group by j.value order by count desc, value limit {limit}
""".format(
col=escape_sqlite(column), sql=self.sql, limit=facet_size + 1
)

@simonw
Copy link
Owner Author

simonw commented Nov 15, 2021

It looks like the problem here is that some of the tags occur more than once in the documents:

russian-ads__ads_with_targets__172_rows_where_where_target_names_contains__people_who_match_interests_African-American_culture_

So they get counted more than once, hence the 182 count for something that couldn't possibly return more than 172 documents.

@simonw
Copy link
Owner Author

simonw commented Nov 15, 2021

@simonw
Copy link
Owner Author

simonw commented Nov 16, 2021

This looks like it might work:

with inner as (
  select
    *
  from
    ads_with_targets
  where
    :p0 in (
      select
        value
      from
        json_each([ads_with_targets].[target_names])
    )
),
deduped_array_items as (
  select
    distinct j.value,
    inner.*
  from
    json_each([inner].[target_names]) j
    join inner
)
select
  value,
  count(*)
from
  deduped_array_items
group by
  value
order by
  count(*) desc

@simonw
Copy link
Owner Author

simonw commented Nov 16, 2021

It uses a CTE which were introduced in SQLite 3.8 - and AWS Lambda Python 3.9 still provides 3.7 - but I've checked and I can use pysqlite3-binary to work around that there so I'm OK relying on CTEs for this.

@simonw
Copy link
Owner Author

simonw commented Nov 16, 2021

I tried this and it seems to work correctly:

        for source_and_config in self.get_configs():
            config = source_and_config["config"]
            source = source_and_config["source"]
            column = config.get("column") or config["simple"]
            facet_sql = """
                with inner as ({sql}),
                deduped_array_items as (
                    select
                        distinct j.value,
                        inner.*
                    from
                        json_each([inner].{col}) j
                        join inner
                )
                select
                    value as value,
                    count(*) as count
                from
                    deduped_array_items
                group by
                    value
                order by
                    count(*) desc limit {limit}
            """.format(
                col=escape_sqlite(column), sql=self.sql, limit=facet_size + 1
            )

The queries are very slow though - I had to bump up to 2s time limit even against only a view returning 3,499 rows.

@simonw
Copy link
Owner Author

simonw commented Nov 16, 2021

Actually with the cache warmed up it looks like the facet query is taking 150ms which is good enough.

@simonw
Copy link
Owner Author

simonw commented Nov 16, 2021

Also note that this demo data is using a SQL view to create the JSON arrays - the view is defined as such:

CREATE VIEW ads_with_targets as
select
  ads.*,
  json_group_array(targets.name) as target_names
from
  ads
  join ad_targets on ad_targets.ad_id = ads.id
  join targets on ad_targets.target_id = targets.id
group by
  ad_targets.ad_id;

So running JSON faceting on top of that view is a pretty big ask!

@simonw simonw closed this as completed in 55024b5 Nov 16, 2021
@simonw
Copy link
Owner Author

simonw commented Nov 16, 2021

Tests are failing and I think it's because the facets come back in different orders, need a tie-breaker. https://github.com/simonw/datasette/runs/4219325197?check_suite_focus=true

simonw added a commit that referenced this issue Nov 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant