Column names for SQL queries on CSV files should not be case sensitive #1710

OscarTHZhang · 2022-01-30T19:01:01Z

Describe the bug
If the column names in a CSV file are uppercase, then typing lowercase column names in SQL queries will result in an Error: Invalid identifier for schema.

To Reproduce
Here is a simple example program that runs a query on 2 tables.

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
    let mut ctx = ExecutionContext::new();
    ctx.register_csv("lineorder", "data/lineorder.csv", CsvReadOptions::new()).await?;
    ctx.register_csv("date", "data/date.csv", CsvReadOptions::new()).await?;

    let df = ctx.sql("
        select sum(lo_extendedprice * lo_discount) as revenue
        from lineorder,
            date
        where lo_orderdate = d_datekey
        and d_year = 1993
        and (lo_discount between 1 and 3)
        and lo_quantity < 25;
      ").await?;

    df.show().await?;
    Ok(())
}

This will result in an error:

Error: Plan("Invalid identifier '#lo_orderdate' for schema lineorder.LO_ORDERKEY, lineorder.LO_LINENUMBER, lineorder.LO_CUSTKEY, lineorder.LO_PARTKEY, lineorder.LO_SUPPKEY, lineorder.LO_ORDERDATE, lineorder.LO_ORDERPRIORITY, lineorder.LO_SHIPPRIORITY, lineorder.LO_QUANTITY, lineorder.LO_EXTENDEDPRICE, lineorder.LO_ORDTOTALPRICE, lineorder.LO_DISCOUNT, lineorder.LO_REVENUE, lineorder.LO_SUPPLYCOST, lineorder.LO_TAX, lineorder.LO_COMMITDATE, lineorder.LO_SHIPMODE, date.D_DATEKEY, date.D_DATE, date.D_DAYOFWEEK, date.D_MONTH, date.D_YEAR, date.D_YEARMONTHNUM, date.D_YEARMONTH, date.D_DAYNUMINWEEK, date.D_DAYNUMINMONTH, date.D_DAYNUMINYEAR, date.D_MONTHNUMINYEAR, date.D_WEEKNUMINYEAR, date.D_SELLINGSEASON, date.D_LASTDAYINWEEKFL, date.D_LASTDAYINMONTHFL, date.D_HOLIDAYFL, date.D_WEEKDAYFL")

Expected behavior
The column names should not be case-sensitive. The query should execute normally to produce the query result.

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

alamb · 2022-01-31T16:47:35Z

Thank you for the report @OscarTHZhang

mkmik · 2022-02-04T11:21:27Z

IIUC arrow schemas are case sensitive. This is true for CSV and for other sources (e.g. parquet).

Not sure what's the right approach here. SQL engines are traditionally case insensitive, but here we have a SQL engine that builds on top of an existing schema model that doesn't follow that tradition.

alamb · 2022-02-04T12:05:12Z

Yes, Arrow schema's are case sensitive and SQL itself has stranger semantics for case insensitive

Here is an example from postgres showing how SQL interprets mixed case identifiers. Basically if the identifier is not double quoted, it is is simply lower cased prior to processing

create table bar ("foo" int, "Foo" int, "FoO" int);
CREATE TABLE

alamb=# insert into bar values (1,2,3);
INSERT 0 1

alamb=# select * from bar;
 foo | Foo | FoO 
-----+-----+-----
   1 |   2 |   3
(1 row)

alamb=# select foo from bar;
 foo 
-----
   1
(1 row)

-- Note result has foo, not Foo
alamb=# select Foo from bar;
 foo 
-----
   1
(1 row)

alamb=# select "Foo" from bar;
 Foo 
-----
   2
(1 row)

foo always matches the lower case column name "foo" even when it is not the first column

alamb=# create table baz("Foo" int, foo int);
CREATE TABLE
alamb=# insert into baz values (100,200);
INSERT 0 1
alamb=# select foo from baz;
 foo 
-----
 200
(1 row)

If you create a table, the identifier is also lower cased at create time

alamb=# create table blarg(Foo int);
CREATE TABLE

alamb=# select * from blarg;
 foo 
-----
(0 rows)

alamb · 2022-02-04T12:06:32Z

Apparently datafusion already does this conversion for function names

https://github.com/apache/arrow-datafusion/blob/940d4eb60e76a3d4062489e872bf241dbfe0031a/datafusion/src/sql/planner.rs#L1636-L1649

alamb · 2022-02-04T12:26:55Z

So my interpretation is "SQL should lowercase all identifiers unless they are double quoted"

So if the arrow schema has foo, Foo fields, a query for select Foo should return the foo (lowercase) field. A query for select "Foo" should return Foo (the upper case one)

mkmik · 2022-02-04T13:52:02Z

tmp1=> create table bar ("Foo" int, "FoO" int);
CREATE TABLE
tmp1=> insert into bar values (2,3);
INSERT 0 1
tmp1=> select * from bar;
 Foo | FoO
-----+-----
   2 |   3
(1 row)

tmp1=> select foo from bar;
ERROR:  column "foo" does not exist
LINE 1: select foo from bar;
               ^
HINT:  Perhaps you meant to reference the column "bar.Foo".

mkmik · 2022-02-04T13:56:21Z

If you create a table, the identifier is also lower cased at create time

the identifier is lowercased only if not quoted.

tmp1=> create table blarg(Foo int);
CREATE TABLE
tmp1=> \d blarg
               Table "public.blarg"
 Column |  Type   | Collation | Nullable | Default
--------+---------+-----------+----------+---------
 foo    | integer |           |          |

tmp1=> create table blarg2("Foo" int);
CREATE TABLE
tmp1=> \d blarg2
               Table "public.blarg2"
 Column |  Type   | Collation | Nullable | Default
--------+---------+-----------+----------+---------
 Foo    | integer |           |          |

To summarize, postgres schemas are case sensitive, but in the SQL syntax the identifiers are lowercased when unquoted.

mkmik · 2022-02-04T14:00:44Z

This issue however is about a CSV file that has a header that contains non-lowercased column names, and the expectation that datafusion will make them referenceable using lowercase identifiers.

Perhaps we could add an option CsvReadOptions to lowercase all columns?

alamb · 2022-02-04T14:39:33Z

Perhaps we could add an option CsvReadOptions to lowercase all columns?

Indeed -- both changes (lowercase identifiers when unquoted) and an option to CsvReadOption if not already available sound good to me

mkmik · 2022-02-04T15:13:45Z

I factored out the identifier quoting issue in #1746.

I think this issue should be a feature request for datafusion to parse a CSV file while converting its column names to lowercase.

OscarTHZhang · 2022-02-13T00:44:40Z

Looks like I can close this one

alamb · 2022-02-14T15:11:49Z

Thanks @OscarTHZhang . Note that as @mkmik points out, we haven't changed how identifiers are named from csv files. So if your CSV file has a column named Foo to query it you will have to use "Foo". foo will not work.

This issue however is about a CSV file that has a header that contains non-lowercased column names, and the expectation that datafusion will make them referenceable using lowercase identifiers.

OscarTHZhang added the bug Something isn't working label Jan 30, 2022

houqp added good first issue Good for newcomers help wanted Extra attention is needed labels Jan 30, 2022

OscarTHZhang closed this as completed Feb 13, 2022

alamb mentioned this issue May 2, 2022

Clarify in docs that Identifiers are made lower-case in SQL query #2374

Closed

trueleo mentioned this issue Feb 15, 2023

It is not possible to query fields with capitals names parseablehq/parseable#47

Closed

gruuya mentioned this issue Aug 7, 2024

Case-Sensitive Column Names in External Parquet Tables splitgraph/seafowl#571

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Column names for SQL queries on CSV files should not be case sensitive #1710

Column names for SQL queries on CSV files should not be case sensitive #1710

OscarTHZhang commented Jan 30, 2022

alamb commented Jan 31, 2022

mkmik commented Feb 4, 2022

alamb commented Feb 4, 2022

alamb commented Feb 4, 2022

alamb commented Feb 4, 2022

mkmik commented Feb 4, 2022

mkmik commented Feb 4, 2022

mkmik commented Feb 4, 2022

alamb commented Feb 4, 2022

mkmik commented Feb 4, 2022 •

edited

Loading

OscarTHZhang commented Feb 13, 2022

alamb commented Feb 14, 2022

Column names for SQL queries on CSV files should not be case sensitive #1710

Column names for SQL queries on CSV files should not be case sensitive #1710

Comments

OscarTHZhang commented Jan 30, 2022

alamb commented Jan 31, 2022

mkmik commented Feb 4, 2022

alamb commented Feb 4, 2022

alamb commented Feb 4, 2022

alamb commented Feb 4, 2022

mkmik commented Feb 4, 2022

mkmik commented Feb 4, 2022

mkmik commented Feb 4, 2022

alamb commented Feb 4, 2022

mkmik commented Feb 4, 2022 • edited Loading

OscarTHZhang commented Feb 13, 2022

alamb commented Feb 14, 2022

mkmik commented Feb 4, 2022 •

edited

Loading