Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Breaks on column called "Representative's Address" #43

Closed
simonw opened this issue Nov 4, 2023 · 4 comments
Closed

Breaks on column called "Representative's Address" #43

simonw opened this issue Nov 4, 2023 · 4 comments
Labels
bug Something isn't working

Comments

@simonw
Copy link
Owner

simonw commented Nov 4, 2023

Trying this against a CSV exported from https://www.regulations.gov/bulkdownload I got an exception visiting /-/edit-schema/data/tablename - error was:

OperationalError: near "Address": syntax error

One of the columns in the table was called Representative's Address and I think the ' broke it.

The traceback highlighted examples_for_columns:

CleanShot 2023-11-04 at 14 02 05@2x

https://datasette-cloud.sentry.io/issues/4380834663/?project=2050376&query=is%3Aunresolved&referrer=issue-stream&stream_index=0

@simonw simonw added the bug Something isn't working label Nov 4, 2023
@simonw
Copy link
Owner Author

simonw commented Nov 4, 2023

Function:

def examples_for_columns(conn, table_name):
columns = sqlite_utils.Database(conn)[table_name].columns_dict.keys()
ctes = [f'rows as (select * from "{table_name}" limit 1000)']
unions = []
for i, column in enumerate(columns):
ctes.append(
f'col{i} as (select distinct "{column}" from rows '
f'where ("{column}" is not null and "{column}" != "") limit 5)'
)
unions.append(f"select '{column}' as label, \"{column}\" as value from col{i}")
ctes.append("strings as ({})".format("\nunion all\n".join(unions)))
ctes.append(
"""
truncated_strings as (
select
label,
case
when length(value) > 30 then substr(value, 1, 30) || '...'
else value
end as value
from strings
where typeof(value) != 'blob'
)
"""
)
sql = (
"with {ctes} ".format(ctes=",\n".join(ctes))
+ "select label, json_group_array(value) as examples "
"from truncated_strings group by label"
)
output = {}
for column, examples in conn.execute(sql).fetchall():
output[column] = list(map(str, json.loads(examples)))
return output

@simonw
Copy link
Owner Author

simonw commented Nov 4, 2023

Here's the bug:

unions.append(f"select '{column}' as label, \"{column}\" as value from col{i}")

Resulted in:

select 'photo's' as label, "photo's" as value from col4),

@simonw
Copy link
Owner Author

simonw commented Nov 4, 2023

Another error:

  File "/Users/simon/Dropbox/Development/datasette-edit-schema/datasette_edit_schema/utils.py", line 66, in potential_primary_keys
    cursor.execute(sql)
sqlite3.OperationalError: near "s": syntax error

@simonw
Copy link
Owner Author

simonw commented Nov 4, 2023

def potential_primary_keys(conn, table_name, columns, max_string_len=128):
# First we run a query to check the max length of each column + if it has any nulls
selects = []
for column in columns:
selects.append("max(length(\"{}\")) as 'maxlen.{}'".format(column, column))
selects.append(
"sum(case when \"{}\" is null then 1 else 0 end) as 'nulls.{}'".format(
column, column
)
)
sql = 'select {} from "{}"'.format(", ".join(selects), table_name)
cursor = conn.cursor()
cursor.execute(sql)
row = cursor.fetchone()
potential_columns = []
for i, column in enumerate(columns):
maxlen = row[i * 2] or 0
nulls = row[i * 2 + 1] or 0
if maxlen < max_string_len and nulls == 0:
potential_columns.append(column)
if not potential_columns:
return []
# Count distinct values in each of our candidate columns
selects = ["count(*) as _count"]
for column in potential_columns:
selects.append("count(distinct \"{}\") as 'distinct.{}'".format(column, column))
sql = 'select {} from "{}"'.format(", ".join(selects), table_name)
cursor.execute(sql)
row = cursor.fetchone()
count = row[0]
potential_pks = []
for i, column in enumerate(potential_columns):
distinct = row[i + 1]
if distinct == count:
potential_pks.append(column)
return potential_pks

That generated SQL like this:

select count(*) as _count,
count(distinct "Document ID") as 'distinct.Document ID',
count(distinct "Agency ID") as 'distinct.Agency ID',
count(distinct "Representative's Address") as 'distinct.Representative's Address'
from "lok-imd8-3w1z"

@simonw simonw closed this as completed in 6cfb10a Nov 4, 2023
simonw added a commit that referenced this issue Nov 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant