-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Column names for SQL queries on CSV files should not be case sensitive #1710
Comments
Thank you for the report @OscarTHZhang |
IIUC arrow schemas are case sensitive. This is true for CSV and for other sources (e.g. parquet). Not sure what's the right approach here. SQL engines are traditionally case insensitive, but here we have a SQL engine that builds on top of an existing schema model that doesn't follow that tradition. |
Yes, Arrow schema's are case sensitive and SQL itself has stranger semantics for case insensitive Here is an example from postgres showing how SQL interprets mixed case identifiers. Basically if the identifier is not double quoted, it is is simply lower cased prior to processing create table bar ("foo" int, "Foo" int, "FoO" int);
CREATE TABLE
alamb=# insert into bar values (1,2,3);
INSERT 0 1
alamb=# select * from bar;
foo | Foo | FoO
-----+-----+-----
1 | 2 | 3
(1 row)
alamb=# select foo from bar;
foo
-----
1
(1 row)
-- Note result has foo, not Foo
alamb=# select Foo from bar;
foo
-----
1
(1 row)
alamb=# select "Foo" from bar;
Foo
-----
2
(1 row)
alamb=# create table baz("Foo" int, foo int);
CREATE TABLE
alamb=# insert into baz values (100,200);
INSERT 0 1
alamb=# select foo from baz;
foo
-----
200
(1 row)
If you create a table, the identifier is also lower cased at create time
|
Apparently datafusion already does this conversion for function names |
So my interpretation is "SQL should lowercase all identifiers unless they are double quoted" So if the arrow schema has |
|
the identifier is lowercased only if not quoted.
To summarize, postgres schemas are case sensitive, but in the SQL syntax the identifiers are lowercased when unquoted. |
This issue however is about a CSV file that has a header that contains non-lowercased column names, and the expectation that datafusion will make them referenceable using lowercase identifiers. Perhaps we could add an option |
Indeed -- both changes (lowercase identifiers when unquoted) and an option to |
I factored out the identifier quoting issue in #1746. I think this issue should be a feature request for datafusion to parse a CSV file while converting its column names to lowercase. |
Looks like I can close this one |
Thanks @OscarTHZhang . Note that as @mkmik points out, we haven't changed how identifiers are named from csv files. So if your CSV file has a column named
|
Describe the bug
If the column names in a CSV file are uppercase, then typing lowercase column names in SQL queries will result in an Error:
Invalid identifier for schema
.To Reproduce
Here is a simple example program that runs a query on 2 tables.
This will result in an error:
Error: Plan("Invalid identifier '#lo_orderdate' for schema lineorder.LO_ORDERKEY, lineorder.LO_LINENUMBER, lineorder.LO_CUSTKEY, lineorder.LO_PARTKEY, lineorder.LO_SUPPKEY, lineorder.LO_ORDERDATE, lineorder.LO_ORDERPRIORITY, lineorder.LO_SHIPPRIORITY, lineorder.LO_QUANTITY, lineorder.LO_EXTENDEDPRICE, lineorder.LO_ORDTOTALPRICE, lineorder.LO_DISCOUNT, lineorder.LO_REVENUE, lineorder.LO_SUPPLYCOST, lineorder.LO_TAX, lineorder.LO_COMMITDATE, lineorder.LO_SHIPMODE, date.D_DATEKEY, date.D_DATE, date.D_DAYOFWEEK, date.D_MONTH, date.D_YEAR, date.D_YEARMONTHNUM, date.D_YEARMONTH, date.D_DAYNUMINWEEK, date.D_DAYNUMINMONTH, date.D_DAYNUMINYEAR, date.D_MONTHNUMINYEAR, date.D_WEEKNUMINYEAR, date.D_SELLINGSEASON, date.D_LASTDAYINWEEKFL, date.D_LASTDAYINMONTHFL, date.D_HOLIDAYFL, date.D_WEEKDAYFL")
Expected behavior
The column names should not be case-sensitive. The query should execute normally to produce the query result.
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: