Breaks on column called "Representative's Address" #43

simonw · 2023-11-04T21:03:14Z

Trying this against a CSV exported from https://www.regulations.gov/bulkdownload I got an exception visiting /-/edit-schema/data/tablename - error was:

OperationalError: near "Address": syntax error

One of the columns in the table was called Representative's Address and I think the ' broke it.

The traceback highlighted examples_for_columns:

https://datasette-cloud.sentry.io/issues/4380834663/?project=2050376&query=is%3Aunresolved&referrer=issue-stream&stream_index=0

The text was updated successfully, but these errors were encountered:

simonw · 2023-11-04T21:04:09Z

Function:

datasette-edit-schema/datasette_edit_schema/utils.py

Lines 92 to 125 in 05a155e

    
           def examples_for_columns(conn, table_name): 
        
               columns = sqlite_utils.Database(conn)[table_name].columns_dict.keys() 
        
               ctes = [f'rows as (select * from "{table_name}" limit 1000)'] 
        
               unions = [] 
        
               for i, column in enumerate(columns): 
        
                   ctes.append( 
        
                       f'col{i} as (select distinct "{column}" from rows ' 
        
                       f'where ("{column}" is not null and "{column}" != "") limit 5)' 
        
                   ) 
        
                   unions.append(f"select '{column}' as label, \"{column}\" as value from col{i}") 
        
               ctes.append("strings as ({})".format("\nunion all\n".join(unions))) 
        
               ctes.append( 
        
                   """ 
        
               truncated_strings as ( 
        
               select  
        
                   label, 
        
                   case  
        
                   when length(value) > 30 then substr(value, 1, 30) || '...' 
        
                   else value 
        
                   end as value 
        
               from strings 
        
               where typeof(value) != 'blob' 
        
               ) 
        
               """ 
        
               ) 
        
               sql = ( 
        
                   "with {ctes} ".format(ctes=",\n".join(ctes)) 
        
                   + "select label, json_group_array(value) as examples " 
        
                   "from truncated_strings group by label" 
        
               ) 
        
               output = {} 
        
               for column, examples in conn.execute(sql).fetchall(): 
        
                   output[column] = list(map(str, json.loads(examples))) 
        
               return output

simonw · 2023-11-04T21:06:20Z

Here's the bug:

datasette-edit-schema/datasette_edit_schema/utils.py

Line 101 in 05a155e

unions.append(f"select '{column}' as label, \"{column}\" as value from col{i}")

Resulted in:

select 'photo's' as label, "photo's" as value from col4),

simonw · 2023-11-04T21:09:30Z

Another error:

  File "/Users/simon/Dropbox/Development/datasette-edit-schema/datasette_edit_schema/utils.py", line 66, in potential_primary_keys
    cursor.execute(sql)
sqlite3.OperationalError: near "s": syntax error

simonw · 2023-11-04T21:13:57Z

datasette-edit-schema/datasette_edit_schema/utils.py

Lines 54 to 89 in 05a155e

    
           def potential_primary_keys(conn, table_name, columns, max_string_len=128): 
        
               # First we run a query to check the max length of each column + if it has any nulls 
        
               selects = [] 
        
               for column in columns: 
        
                   selects.append("max(length(\"{}\")) as 'maxlen.{}'".format(column, column)) 
        
                   selects.append( 
        
                       "sum(case when \"{}\" is null then 1 else 0 end) as 'nulls.{}'".format( 
        
                           column, column 
        
                       ) 
        
                   ) 
        
               sql = 'select {} from "{}"'.format(", ".join(selects), table_name) 
        
               cursor = conn.cursor() 
        
               cursor.execute(sql) 
        
               row = cursor.fetchone() 
        
               potential_columns = [] 
        
               for i, column in enumerate(columns): 
        
                   maxlen = row[i * 2] or 0 
        
                   nulls = row[i * 2 + 1] or 0 
        
                   if maxlen < max_string_len and nulls == 0: 
        
                       potential_columns.append(column) 
        
               if not potential_columns: 
        
                   return [] 
        
               # Count distinct values in each of our candidate columns 
        
               selects = ["count(*) as _count"] 
        
               for column in potential_columns: 
        
                   selects.append("count(distinct \"{}\") as 'distinct.{}'".format(column, column)) 
        
               sql = 'select {} from "{}"'.format(", ".join(selects), table_name) 
        
               cursor.execute(sql) 
        
               row = cursor.fetchone() 
        
               count = row[0] 
        
               potential_pks = [] 
        
               for i, column in enumerate(potential_columns): 
        
                   distinct = row[i + 1] 
        
                   if distinct == count: 
        
                       potential_pks.append(column) 
        
               return potential_pks

That generated SQL like this:

select count(*) as _count,
count(distinct "Document ID") as 'distinct.Document ID',
count(distinct "Agency ID") as 'distinct.Agency ID',
count(distinct "Representative's Address") as 'distinct.Representative's Address'
from "lok-imd8-3w1z"

Refs #43

simonw added the bug Something isn't working label Nov 4, 2023

simonw added a commit that referenced this issue Nov 4, 2023

Handle columns with ' in name in examples_for_columns, refs #43

e1f6108

simonw closed this as completed in 6cfb10a Nov 4, 2023

simonw added a commit that referenced this issue Nov 4, 2023

Release 0.7.1

1e48732

Refs #43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Breaks on column called "Representative's Address" #43

Breaks on column called "Representative's Address" #43

simonw commented Nov 4, 2023 •

edited

Loading

simonw commented Nov 4, 2023

simonw commented Nov 4, 2023

simonw commented Nov 4, 2023

simonw commented Nov 4, 2023

Breaks on column called "Representative's Address" #43

Breaks on column called "Representative's Address" #43

Comments

simonw commented Nov 4, 2023 • edited Loading

simonw commented Nov 4, 2023

simonw commented Nov 4, 2023

simonw commented Nov 4, 2023

simonw commented Nov 4, 2023

simonw commented Nov 4, 2023 •

edited

Loading