Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Column names for SQL queries on CSV files should not be case sensitive #1710

Closed
OscarTHZhang opened this issue Jan 30, 2022 · 12 comments
Closed
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed

Comments

@OscarTHZhang
Copy link

Describe the bug
If the column names in a CSV file are uppercase, then typing lowercase column names in SQL queries will result in an Error: Invalid identifier for schema.

To Reproduce
Here is a simple example program that runs a query on 2 tables.

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
    let mut ctx = ExecutionContext::new();
    ctx.register_csv("lineorder", "data/lineorder.csv", CsvReadOptions::new()).await?;
    ctx.register_csv("date", "data/date.csv", CsvReadOptions::new()).await?;

    let df = ctx.sql("
        select sum(lo_extendedprice * lo_discount) as revenue
        from lineorder,
            date
        where lo_orderdate = d_datekey
        and d_year = 1993
        and (lo_discount between 1 and 3)
        and lo_quantity < 25;
      ").await?;

    df.show().await?;
    Ok(())
}

This will result in an error:

Error: Plan("Invalid identifier '#lo_orderdate' for schema lineorder.LO_ORDERKEY, lineorder.LO_LINENUMBER, lineorder.LO_CUSTKEY, lineorder.LO_PARTKEY, lineorder.LO_SUPPKEY, lineorder.LO_ORDERDATE, lineorder.LO_ORDERPRIORITY, lineorder.LO_SHIPPRIORITY, lineorder.LO_QUANTITY, lineorder.LO_EXTENDEDPRICE, lineorder.LO_ORDTOTALPRICE, lineorder.LO_DISCOUNT, lineorder.LO_REVENUE, lineorder.LO_SUPPLYCOST, lineorder.LO_TAX, lineorder.LO_COMMITDATE, lineorder.LO_SHIPMODE, date.D_DATEKEY, date.D_DATE, date.D_DAYOFWEEK, date.D_MONTH, date.D_YEAR, date.D_YEARMONTHNUM, date.D_YEARMONTH, date.D_DAYNUMINWEEK, date.D_DAYNUMINMONTH, date.D_DAYNUMINYEAR, date.D_MONTHNUMINYEAR, date.D_WEEKNUMINYEAR, date.D_SELLINGSEASON, date.D_LASTDAYINWEEKFL, date.D_LASTDAYINMONTHFL, date.D_HOLIDAYFL, date.D_WEEKDAYFL")

Expected behavior
The column names should not be case-sensitive. The query should execute normally to produce the query result.

Additional context
Add any other context about the problem here.

@OscarTHZhang OscarTHZhang added the bug Something isn't working label Jan 30, 2022
@houqp houqp added good first issue Good for newcomers help wanted Extra attention is needed labels Jan 30, 2022
@alamb
Copy link
Contributor

alamb commented Jan 31, 2022

Thank you for the report @OscarTHZhang

@mkmik
Copy link
Contributor

mkmik commented Feb 4, 2022

IIUC arrow schemas are case sensitive. This is true for CSV and for other sources (e.g. parquet).

Not sure what's the right approach here. SQL engines are traditionally case insensitive, but here we have a SQL engine that builds on top of an existing schema model that doesn't follow that tradition.

@alamb
Copy link
Contributor

alamb commented Feb 4, 2022

Yes, Arrow schema's are case sensitive and SQL itself has stranger semantics for case insensitive

Here is an example from postgres showing how SQL interprets mixed case identifiers. Basically if the identifier is not double quoted, it is is simply lower cased prior to processing

create table bar ("foo" int, "Foo" int, "FoO" int);
CREATE TABLE

alamb=# insert into bar values (1,2,3);
INSERT 0 1

alamb=# select * from bar;
 foo | Foo | FoO 
-----+-----+-----
   1 |   2 |   3
(1 row)

alamb=# select foo from bar;
 foo 
-----
   1
(1 row)

-- Note result has foo, not Foo
alamb=# select Foo from bar;
 foo 
-----
   1
(1 row)

alamb=# select "Foo" from bar;
 Foo 
-----
   2
(1 row)

foo always matches the lower case column name "foo" even when it is not the first column

alamb=# create table baz("Foo" int, foo int);
CREATE TABLE
alamb=# insert into baz values (100,200);
INSERT 0 1
alamb=# select foo from baz;
 foo 
-----
 200
(1 row)

If you create a table, the identifier is also lower cased at create time

alamb=# create table blarg(Foo int);
CREATE TABLE

alamb=# select * from blarg;
 foo 
-----
(0 rows)

@alamb
Copy link
Contributor

alamb commented Feb 4, 2022

@alamb
Copy link
Contributor

alamb commented Feb 4, 2022

So my interpretation is "SQL should lowercase all identifiers unless they are double quoted"

So if the arrow schema has foo, Foo fields, a query for select Foo should return the foo (lowercase) field. A query for select "Foo" should return Foo (the upper case one)

@mkmik
Copy link
Contributor

mkmik commented Feb 4, 2022

tmp1=> create table bar ("Foo" int, "FoO" int);
CREATE TABLE
tmp1=> insert into bar values (2,3);
INSERT 0 1
tmp1=> select * from bar;
 Foo | FoO
-----+-----
   2 |   3
(1 row)

tmp1=> select foo from bar;
ERROR:  column "foo" does not exist
LINE 1: select foo from bar;
               ^
HINT:  Perhaps you meant to reference the column "bar.Foo".

@mkmik
Copy link
Contributor

mkmik commented Feb 4, 2022

If you create a table, the identifier is also lower cased at create time

the identifier is lowercased only if not quoted.

tmp1=> create table blarg(Foo int);
CREATE TABLE
tmp1=> \d blarg
               Table "public.blarg"
 Column |  Type   | Collation | Nullable | Default
--------+---------+-----------+----------+---------
 foo    | integer |           |          |

tmp1=> create table blarg2("Foo" int);
CREATE TABLE
tmp1=> \d blarg2
               Table "public.blarg2"
 Column |  Type   | Collation | Nullable | Default
--------+---------+-----------+----------+---------
 Foo    | integer |           |          |

To summarize, postgres schemas are case sensitive, but in the SQL syntax the identifiers are lowercased when unquoted.

@mkmik
Copy link
Contributor

mkmik commented Feb 4, 2022

This issue however is about a CSV file that has a header that contains non-lowercased column names, and the expectation that datafusion will make them referenceable using lowercase identifiers.

Perhaps we could add an option CsvReadOptions to lowercase all columns?

@alamb
Copy link
Contributor

alamb commented Feb 4, 2022

Perhaps we could add an option CsvReadOptions to lowercase all columns?

Indeed -- both changes (lowercase identifiers when unquoted) and an option to CsvReadOption if not already available sound good to me

@mkmik
Copy link
Contributor

mkmik commented Feb 4, 2022

I factored out the identifier quoting issue in #1746.

I think this issue should be a feature request for datafusion to parse a CSV file while converting its column names to lowercase.

@OscarTHZhang
Copy link
Author

Looks like I can close this one

@alamb
Copy link
Contributor

alamb commented Feb 14, 2022

Thanks @OscarTHZhang . Note that as @mkmik points out, we haven't changed how identifiers are named from csv files. So if your CSV file has a column named Foo to query it you will have to use "Foo". foo will not work.

This issue however is about a CSV file that has a header that contains non-lowercased column names, and the expectation that datafusion will make them referenceable using lowercase identifiers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants