Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support tuples as types #11896

Merged
merged 3 commits into from
Aug 12, 2024
Merged

Support tuples as types #11896

merged 3 commits into from
Aug 12, 2024

Conversation

samuelcolvin
Copy link
Contributor

@samuelcolvin samuelcolvin commented Aug 8, 2024

Which issue does this PR close?

Closes #6635, continues from #8590.

Rationale for this change

See #6635 — I want to be able to use tuples as per that issue

What changes are included in this PR?

  • Add tuples planning
  • add tuple coercion
  • tests

Are these changes tested?

yes.

Are there any user-facing changes?

More SQL supported.

(Fixed, ignore this)

This is mostly working and (a, b) in (('x', 'y')) is working correctly, however if you have more than one value in the lookup, e.g. ('x', 'y') IN (('x', 'y'), ('y', 'x')) an error occurs:

External error: query failed: DataFusion error: Arrow error: Invalid argument error: Nested comparison: Struct([Field { name: "c0", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c1", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]) == Struct([Field { name: "c0", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c1", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]) (hint: use make_comparator instead)
[SQL] select ('x', 'y') IN (('x', 'y'), ('y', 'x'));
at test_files/struct.slt:256

There are two tests that should fail.

@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt) labels Aug 8, 2024
Copy link
Contributor Author

@samuelcolvin samuelcolvin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why

@jayzhan211
Copy link
Contributor

jayzhan211 commented Aug 9, 2024

DataFusion error: Arrow error: Invalid argument error: Nested comparison: Struct\(\[Field \{ name: "c0", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \}, Field \{ name: "c1", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \}\]\) == Struct\(\[Field \{ name: "c0", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \}, Field \{ name: "c1", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \}\]\) \(hint: use make_comparator instead\)

make_comparator is introduced recently for comparing nested type.

However, InListExpr is not able to compare with nested type.

fn evaluate(&self, batch: &RecordBatch) -> Result<ColumnarValue> {
        let num_rows = batch.num_rows();
        let value = self.expr.evaluate(batch)?;
        let r = match &self.static_filter {
            Some(f) => f.contains(value.into_array(num_rows)?.as_ref(), self.negated)?,
            None => {
                let value = value.into_array(num_rows)?;
                let found = self.list.iter().map(|expr| expr.evaluate(batch)).try_fold(
                    BooleanArray::new(BooleanBuffer::new_unset(num_rows), None),
                    |result, expr| -> Result<BooleanArray> {
                        Ok(or_kleene(
                            &result,
                            &eq(&value, &expr?.into_array(num_rows)?)?,
                        )?)
                    },
                )?;

                if self.negated {
                    not(&found)?
                } else {
                    found
                }
            }
        };
        Ok(ColumnarValue::Array(Arc::new(r)))
    }

We need to change the eq to new function that calls compare_op_for_nested for nested type, and fallback to eq for non-nested type. I think there is similar issue left somewhere for other expressions, maybe we can have a general comparison function (nested + non-nested) and adapt it widely in datafusion

@github-actions github-actions bot added the physical-expr Physical Expressions label Aug 9, 2024
@samuelcolvin
Copy link
Contributor Author

Thanks so much for the pointer @jayzhan211, this should now be fixed.

@samuelcolvin samuelcolvin marked this pull request as ready for review August 9, 2024 09:17
@samuelcolvin samuelcolvin changed the title [HELP NEEDED] support tuples as types Support tuples as types Aug 9, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @samuelcolvin -- this looks very impressive to me. Beautifully simple

I had some small suggestions, but I don't think they are required to merge

Very nicely done

@@ -780,6 +781,31 @@ fn coerce_numeric_type_to_decimal256(numeric_type: &DataType) -> Option<DataType
}
}

fn struct_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<DataType> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please document what the intended rules for coercion are, in English, as a comment? That would help me understand if the code implements the intent correctly

I think the result is if all the types can be cocerced, then the output is a struct with fields named c1 , c2, etc

I wonder if it would be better / easier to understand if the names of the fields were preserved (aka the field names from lhs_fields)

self.parse_struct(schema, planner_context, values, vec![])
}
None => not_impl_err!("Empty tuple not supported yet"),
_ => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to figure out how to write a negative test for this case, however anything I could try seems to work

DataFusion CLI v41.0.0

create table foo as values (1);

> select * from foo where (column1) IN ((column1+1), (2));
+---------+
| column1 |
+---------+
+---------+
0 row(s) fetched.
Elapsed 0.008 seconds.

> select * from foo where (column1) IN ((column1+1), (2), (1));
+---------+
| column1 |
+---------+
| 1       |
+---------+
1 row(s) fetched.
Elapsed 0.010 seconds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

query error DataFusion error: This feature is not implemented: Only identifiers and literals are supported in tuples
select a from values where (a, c) IN ((abs(a), 1));

Query success in duckdb

D select a from t where (a, b) in ((1, abs(a)));
┌────────┐
│   a    │
│ int32  │
├────────┤
│ 0 rows │
└────────┘

We could support abs() and other complex type in tuple in follow PR, but we should check not only first element of the values but all the values in this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will file a ticket to track the feature

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

query error DataFusion error: This feature is not implemented: Only identifiers and literals are supported in tuples
select a from values where (a, c) IN ((abs(a), 1));

Query success in duckdb

D select a from t where (a, b) in ((1, abs(a)));
┌────────┐
│   a    │
│ int32  │
├────────┤
│ 0 rows │
└────────┘

We could support abs() and other complex type in tuple in follow PR, but we should check not only first element of the values but all the values in this PR

I am not sure about whether our evaulation support struct array like {(struct(Expr1{column1}, 1)), struct (Expr2{column2}, 2))}?

true

query I
select a from values where (a, c) = (1, 'a');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend we also do a negative test, like

select a from values where (a, c) = (100, 'a');

@alamb
Copy link
Contributor

alamb commented Aug 9, 2024

FYI @my-vegetable-has-exploded

.into_iter()
.enumerate()
.map(|(i, datatype)| {
Arc::new(Field::new(format!("c{i}"), datatype, true))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should preserve field name and nullable, only type is updated

planner_context: &mut PlannerContext,
values: Vec<SQLExpr>,
) -> Result<Expr> {
match values.first() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check all

@my-vegetable-has-exploded
Copy link
Contributor

Thanks @samuelcolvin. This is exactly what I wanted to do.

@github-actions github-actions bot removed the logical-expr Logical plan and expressions label Aug 12, 2024
@alamb
Copy link
Contributor

alamb commented Aug 12, 2024

I took the liberty of merging up from main to resolve a conflict -- I think we could merge this PR and address the comments as a follow on PR, or we can do it prior to merging this one.

Please let me know what you prefer @samuelcolvin

@alamb
Copy link
Contributor

alamb commented Aug 12, 2024

Thanks again everyone 🚀

@alamb alamb merged commit 140f7ce into apache:main Aug 12, 2024
24 checks passed
@samuelcolvin samuelcolvin deleted the tuples branch August 12, 2024 21:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-expr Physical Expressions sql SQL Planner sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support multi column IN lists like (c1, c2) IN ((c1, c2), ,,,)
4 participants