Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify in docs that Identifiers are made lower-case in SQL query #2374

Closed
dbr opened this issue Apr 29, 2022 · 6 comments · Fixed by #3385
Closed

Clarify in docs that Identifiers are made lower-case in SQL query #2374

dbr opened this issue Apr 29, 2022 · 6 comments · Fixed by #3385
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@dbr
Copy link

dbr commented Apr 29, 2022

Describe the bug
With (I think) the most up-to-date version of the arrow2 branch, the idenfitiers in query seem to be made lower-case

For example if I modify the memtable example by renaming the bank_acount field to Bank_Account then the query will fail,

$ gd
diff --git a/datafusion-examples/examples/memtable.rs b/datafusion-examples/examples/memtable.rs
index 4c635205..16bd247d 100644
--- a/datafusion-examples/examples/memtable.rs
+++ b/datafusion-examples/examples/memtable.rs
@@ -37,7 +37,7 @@ async fn main() -> Result<()> {
     // Register the in-memory table containing the data
     ctx.register_table("users", Arc::new(mem_table))?;
 
-    let dataframe = ctx.sql("SELECT * FROM users;").await?;
+    let dataframe = ctx.sql("SELECT Bank_Account FROM users;").await?;
 
     timeout(Duration::from_secs(10), async move {
         let result = dataframe.collect().await.unwrap();
@@ -72,6 +72,6 @@ fn create_record_batch() -> Result<RecordBatch> {
 fn get_schema() -> SchemaRef {
     SchemaRef::new(Schema::new(vec![
         Field::new("id", DataType::UInt8, false),
-        Field::new("bank_account", DataType::UInt64, true),
+        Field::new("Bank_Account", DataType::UInt64, true),
     ]))
 }
$ cargo run --example memtable
    Finished dev [unoptimized + debuginfo] target(s) in 0.35s
     Running `./target/debug/examples/memtable`
Error: Plan("Invalid identifier '#bank_account' for schema fields:[users.id, users.Bank_Account], metadata:{}")

Meaning, if the source table has any non-lowercase characters, the queries cannot work

Additional context
I'm using the arrow2 branch at this revision:

744b262

@dbr dbr added the bug Something isn't working label Apr 29, 2022
@dbr
Copy link
Author

dbr commented Apr 29, 2022

Oh this is acting same in latest master branch! 7a9b865

$ git rev-parse HEAD
7a9b86526e2bd12deccb9ff30cead2ba6323409a
$ git diff
diff --git a/datafusion-examples/examples/memtable.rs b/datafusion-examples/examples/memtable.rs
index 11793a83..80cb65fc 100644
--- a/datafusion-examples/examples/memtable.rs
+++ b/datafusion-examples/examples/memtable.rs
@@ -36,7 +36,7 @@ async fn main() -> Result<()> {
     // Register the in-memory table containing the data
     ctx.register_table("users", Arc::new(mem_table))?;
 
-    let dataframe = ctx.sql("SELECT * FROM users;").await?;
+    let dataframe = ctx.sql("SELECT Bank_Account FROM users;").await?;
 
     timeout(Duration::from_secs(10), async move {
         let result = dataframe.collect().await.unwrap();
@@ -71,6 +71,6 @@ fn create_record_batch() -> Result<RecordBatch> {
 fn get_schema() -> SchemaRef {
     SchemaRef::new(Schema::new(vec![
         Field::new("id", DataType::UInt8, false),
-        Field::new("bank_account", DataType::UInt64, true),
+        Field::new("Bank_Account", DataType::UInt64, true),
     ]))
 }
$ cargo run --example memtable
    Finished dev [unoptimized + debuginfo] target(s) in 0.36s
     Running `./target/debug/examples/memtable`
Error: Plan("Invalid identifier '#bank_account' for schema fields:[users.id, users.Bank_Account], metadata:{}")

@dbr dbr changed the title Arrow2: Identifiers are made lower-case in SQL query Identifiers are made lower-case in SQL query Apr 29, 2022
@andygrove andygrove self-assigned this Apr 29, 2022
@andygrove
Copy link
Member

@dbr Unquoted SQL identifiers are case-insensitive by design (to match ANSI and Postgres). SQL identifiers in double quotes are case-sensitive. If you put double quotes around the identifier in the SQL query then it works as expected.

diff --git a/datafusion-examples/examples/memtable.rs b/datafusion-examples/examples/memtable.rs
index 11793a837f..b04f5424a1 100644
--- a/datafusion-examples/examples/memtable.rs
+++ b/datafusion-examples/examples/memtable.rs
@@ -36,7 +36,7 @@ async fn main() -> Result<()> {
     // Register the in-memory table containing the data
     ctx.register_table("users", Arc::new(mem_table))?;
 
-    let dataframe = ctx.sql("SELECT * FROM users;").await?;
+    let dataframe = ctx.sql("SELECT \"Bank_Account\" FROM users;").await?;
 
     timeout(Duration::from_secs(10), async move {
         let result = dataframe.collect().await.unwrap();
@@ -71,6 +71,6 @@ fn create_record_batch() -> Result<RecordBatch> {
 fn get_schema() -> SchemaRef {
     SchemaRef::new(Schema::new(vec![
         Field::new("id", DataType::UInt8, false),
-        Field::new("bank_account", DataType::UInt64, true),
+        Field::new("Bank_Account", DataType::UInt64, true),
     ]))
 }

@dbr
Copy link
Author

dbr commented Apr 30, 2022

Ah interesting! Quoting the identifiers works, thanks

Just to be clear, is it still a bug that the original example fails? I would expect:

  • Given the field named bank_account the SELECT BaNk_AcCoUnT FROM users does work as expected (case is ignored)
  • Given the field named Bank_Account, I would expect SELECT BaNk_AcCoUnT FROM users to work (but currently this only looks for a field named bank_account so the query errors, and the error contains a lower-cased version of what is present in the query)

@andygrove
Copy link
Member

@dbr My view is that this is not a bug. If a non-lowercase table or column name is created via the low-level Schema/DataFrame APIs then I would expect to have to use double-quotes in SQL to reference it. However, we are probably lacking in documentation around this so lets at least use this issue to drive that. Also would like to hear what others think, including @alamb and @yjshen

@alamb
Copy link
Contributor

alamb commented May 2, 2022

Given the field named Bank_Account, I would expect SELECT BaNk_AcCoUnT FROM users to work (but currently this only looks for a field named bank_account so the query errors, and the error contains a lower-cased version of what is present in the query)

I would not expect this to work (as the field is named Bank_Account)

Here is an example in case that is helpful: #1710 (comment)

So in other words, I agree with @andygrove 's conclusion that this is not a bug but rather "working as intended"

@dbr
Copy link
Author

dbr commented May 3, 2022

Ahh I see - linked comment is interesting, I wasn't aware Postgres had these rules around case-insensitivity!

I was thinking maybe the error message could be clearer, but even postgres appears to do the same thing (e.g error message containing the lower-cased field name, not the one the user supplied). So just noting this in the docs as mentioned seems like a good solution

@andygrove andygrove added documentation Improvements or additions to documentation and removed bug Something isn't working labels May 6, 2022
@alamb alamb changed the title Identifiers are made lower-case in SQL query Clarify in docs that Identifiers are made lower-case in SQL query May 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants