Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance the formatting for Column #11724

Closed
wants to merge 6 commits into from
Closed
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 56 additions & 1 deletion datafusion/common/src/column.rs
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
//! Column

use arrow_schema::{Field, FieldRef};
use std::borrow::Cow;

use crate::error::_schema_err;
use crate::utils::{parse_identifiers_normalized, quote_identifier};
Expand Down Expand Up @@ -156,6 +157,17 @@ impl Column {
}
}

fn quoted_flat_name_if_contain_dot(&self) -> String {
match &self.relation {
Some(r) => format!(
"{}.{}",
table_reference_to_quoted_string(r),
quoted_if_contain_dot(&self.name)
),
None => quoted_if_contain_dot(&self.name).to_string(),
}
}
Comment on lines +160 to +169
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jayzhan211 Before I fix other tests, I want to check if this behavior makes sense. (It involves too many tests 😢 ).
Now, we only quote an identifier if it contains the dot. However, some cases like sum(t1.c1) will also be quoted, even if it's a function call. I think it's not worth doing more checking to exclude this kind of case. What do you think?

Copy link
Contributor

@jayzhan211 jayzhan211 Aug 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is not ideal if sum(t1.c1) is quoted 🤔 . I hope the change is as small as possible, so I would prefer to keep function or others Expr remain the same, only identifier with dot is quoted.

We could also hold on and wait for more input from other's about the change of this, given the change of this is not trivial

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe instead of modifying Column, we should modify the display_name for Expr, so if we found column inside ScalarFunction, we could skip the double quote anyway. (by something like boolean flag?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe instead of modifying Column, we should modify the display_name for Expr, so if we found column inside ScalarFunction, we could skip the double quote anyway. (by something like boolean flag?)

I'm not sure if it's that simple 🤔. In my experience, the column might look like this:

Column { qualifier: None, name: "sum(t1.c1)" }

I think it's hard to find a consistent pattern for it because we use many Column::from_name calls to create projections. For example, in

.map(|x| Column::from_name(self.ident_normalizer.normalize(x)))

the column name could be complex and unruly.


/// Qualify column if not done yet.
///
/// If this column already has a [relation](Self::relation), it will be returned as is and the given parameters are
Expand Down Expand Up @@ -328,6 +340,37 @@ impl Column {
}
}

fn quoted_if_contain_dot(s: &str) -> Cow<str> {
if s.contains(".") {
Cow::Owned(format!("\"{}\"", s.replace('"', "\"\"")))
} else {
Cow::Borrowed(s)
}
}

fn table_reference_to_quoted_string(table_ref: &TableReference) -> String {
match table_ref {
TableReference::Bare { table } => quoted_if_contain_dot(table).to_string(),
TableReference::Partial { schema, table } => {
format!(
"{}.{}",
quoted_if_contain_dot(schema),
quoted_if_contain_dot(table)
)
}
TableReference::Full {
catalog,
schema,
table,
} => format!(
"{}.{}.{}",
quoted_if_contain_dot(catalog),
quoted_if_contain_dot(schema),
quoted_if_contain_dot(table)
),
}
}

impl From<&str> for Column {
fn from(c: &str) -> Self {
Self::from_qualified_name(c)
Expand Down Expand Up @@ -372,7 +415,7 @@ impl FromStr for Column {

impl fmt::Display for Column {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
write!(f, "{}", self.flat_name())
write!(f, "{}", self.quoted_flat_name_if_contain_dot())
}
}

Expand Down Expand Up @@ -455,4 +498,16 @@ mod tests {

Ok(())
}

#[test]
fn test_display() {
let col = Column::new(Some("t1"), "a");
assert_eq!(col.to_string(), "t1.a");
let col = Column::new(TableReference::none(), "t1.a");
assert_eq!(col.to_string(), r#""t1.a""#);
let col = Column::new(Some(TableReference::full("a.b", "c.d", "e.f")), "g.h");
assert_eq!(col.to_string(), r#""a.b"."c.d"."e.f"."g.h""#);
let col = Column::new(TableReference::none(), "max(a)");
assert_eq!(col.to_string(), "max(a)")
}
}
51 changes: 45 additions & 6 deletions datafusion/expr/src/logical_plan/plan.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2752,11 +2752,10 @@ fn calc_func_dependencies_for_project(
.iter()
.filter_map(|expr| {
let expr_name = match expr {
Expr::Alias(alias) => {
format!("{}", alias.expr)
}
_ => format!("{}", expr),
};
Expr::Alias(alias) => alias.expr.display_name(),
_ => expr.display_name(),
}
.ok()?;
Comment on lines +2755 to +2758
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know whether this makes sense, but we need a name for the function dependencies, so I chose a name that wouldn't be affected by Display.

By the way, I'm confused about why we have so many different names or similar display methods for Expr. Maybe we should organize them or name them more clearly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#11782 filed

input_fields.iter().position(|item| *item == expr_name)
})
.collect::<Vec<_>>();
Expand Down Expand Up @@ -2906,7 +2905,7 @@ mod tests {
use super::*;
use crate::builder::LogicalTableSource;
use crate::logical_plan::table_scan;
use crate::{col, exists, in_subquery, lit, placeholder, GroupingSet};
use crate::{col, exists, ident, in_subquery, lit, placeholder, GroupingSet};

use datafusion_common::tree_node::{TransformedResult, TreeNodeVisitor};
use datafusion_common::{not_impl_err, Constraint, ScalarValue};
Expand Down Expand Up @@ -3512,4 +3511,44 @@ digraph {
let actual = format!("{}", plan.display_indent());
assert_eq!(expected.to_string(), actual)
}

#[test]
fn test_display_unqualifed_ident() {
let schema = Schema::new(vec![
Field::new("max(id)", DataType::Int32, false),
Field::new("state", DataType::Utf8, false),
]);

let plan = table_scan(Some("t"), &schema, None)
.unwrap()
.filter(col("state").eq(lit("CO")))
.unwrap()
.project(vec![col("max(id)")])
.unwrap()
.build()
.unwrap();

let expected =
"Projection: t.max(id)\n Filter: t.state = Utf8(\"CO\")\n TableScan: t";
let actual = format!("{}", plan.display_indent());
assert_eq!(expected.to_string(), actual);

let schema = Schema::new(vec![
Field::new("id", DataType::Int32, false),
Field::new("t.id", DataType::Int32, false),
]);

let plan = table_scan(Some("t"), &schema, None)
.unwrap()
.build()
.unwrap();
let projection = LogicalPlan::Projection(
Projection::try_new(vec![col("t.id"), ident("t.id")], Arc::new(plan))
.unwrap(),
);

let expected = "Projection: t.id, \"t.id\"\n TableScan: t";
let actual = format!("{}", projection.display_indent());
assert_eq!(expected.to_string(), actual);
}
}