Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Count_Distinct in Snowflake #10818

Merged
merged 9 commits into from
Aug 16, 2024
Original file line number Diff line number Diff line change
Expand Up @@ -444,7 +444,7 @@ agg_count_distinct_include_null args =
The columns are converted to VARIANT type because of that, which may incur some overhead.
But there seems to be no other reliable way to handle this for columns like numeric where no non-NULL value exists that can be guaranteed to be unused.
replace_null_with_marker expr =
SQL_Builder.code "COALESCE(" ++ expr ++ ", {'enso-null-replacement-marker':'" ++ Random.uuid ++ "'}::variant)"
SQL_Builder.code "COALESCE(" ++ expr ++ "::variant, {'enso-null-replacement-marker':'" ++ Random.uuid ++ "'}::variant)"

## PRIVATE
A helper for `lookup_and_replace`, and perhaps other operation.
Expand Down
8 changes: 4 additions & 4 deletions test/Snowflake_Tests/src/Snowflake_Spec.enso
Original file line number Diff line number Diff line change
Expand Up @@ -184,11 +184,11 @@ snowflake_specific_spec suite_builder default_connection db_name setup =
# The integer column is treated as NUMBER(38, 0) in Snowflake so the value type reflects that:
i.at "Value Type" . to_vector . should_equal [Value_Type.Char, Value_Type.Decimal 38 0, Value_Type.Boolean, Value_Type.Float]

group_builder.specify "should return Table information, also for aggregated results" pending="TODO: fix https://github.com/enso-org/enso/issues/10611" <|
group_builder.specify "should return Table information, also for aggregated results" <|
i = data.t.aggregate columns=[Aggregate_Column.Concatenate "strs", Aggregate_Column.Sum "ints", Aggregate_Column.Count_Distinct "bools"] . column_info
i.at "Column" . to_vector . should_equal ["Concatenate strs", "Sum ints", "Count Distinct bools"]
i.at "Items Count" . to_vector . should_equal [1, 1, 1]
i.at "Value Type" . to_vector . should_equal [Value_Type.Char, Value_Type.Decimal 38 0, Value_Type.Decimal 38 0]
i.at "Value Type" . to_vector . should_equal [Value_Type.Char, Value_Type.Decimal 38 0, Value_Type.Decimal 18 0]

group_builder.specify "should infer standard types correctly" <|
data.t.at "strs" . value_type . is_text . should_be_true
Expand Down Expand Up @@ -394,7 +394,7 @@ snowflake_specific_spec suite_builder default_connection db_name setup =
table_name = Name_Generator.random_name "TimestampTZ"
table = default_connection.get.create_table table_name [Column_Description.Value "A" (Value_Type.Date_Time with_timezone=True), Column_Description.Value "rowid" Value_Type.Integer] primary_key=[]
table.should_succeed
Panic.with_finalizer default_connection.get.drop_table table.name <|
Panic.with_finalizer (default_connection.get.drop_table table.name) <|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this not give a compile-time error, having too many arguments?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, such stuff is only currently resolved at runtime.

Once #9812 is implemented, we'd get a warning if static type analysis were enabled.

dt1 = Date_Time.new 2022 05 04 15 30 zone=(Time_Zone.utc)
dt2 = Date_Time.new 2022 05 04 15 30 zone=(Time_Zone.parse "US/Hawaii")
dt3 = Date_Time.new 2022 05 04 15 30 zone=(Time_Zone.parse "Europe/Warsaw")
Expand All @@ -419,7 +419,7 @@ snowflake_specific_spec suite_builder default_connection db_name setup =
table_name = Name_Generator.random_name "Timestamp"
table = default_connection.get.create_table table_name [Column_Description.Value "A" (Value_Type.Date_Time with_timezone=False)] primary_key=[]
Problems.assume_no_problems table
Panic.with_finalizer default_connection.get.drop_table table.name <|
Panic.with_finalizer (default_connection.get.drop_table table.name) <|
dt1 = Date_Time.new 2022 05 04 15 30
dt2 = Date_Time.new 2022 05 04 15 30 zone=(Time_Zone.utc)
dt3 = Date_Time.new 2022 05 04 15 30 zone=(Time_Zone.parse "US/Hawaii")
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1061,8 +1061,8 @@ add_specs suite_builder setup =
m1.columns.first.name . should_equal "Count Distinct A B"
m1.columns.first.to_vector . should_equal [3]

group_builder.specify "should work correctly with Boolean columns" pending=(if prefix.contains "Snowflake" then "TODO: fix https://github.com/enso-org/enso/issues/10611") <|
table = table_builder [["A", [True, True, True]], ["B", [False, False, False]], ["C", [True, False, True]], ["D", [Nothing, False, True]]]
group_builder.specify "should work correctly with Boolean columns" <|
table = table_builder [["A", [True, True, True]], ["B", [False, False, False]], ["C", [True, False, True]], ["D", [Nothing, False, True]], ["E", [1, 1, 2]]]

t_with_nulls = table.aggregate columns=[..Count_Distinct "A", ..Count_Distinct "B", ..Count_Distinct "C", ..Count_Distinct "D"]
m1 = materialize t_with_nulls
Expand All @@ -1081,6 +1081,13 @@ add_specs suite_builder setup =
# The NULL is ignored, and not counted towards the total
m2.at "Count Distinct D" . to_vector . should_equal [2]

# It should also work if multiple columns are used for the criteria and if the columns are mixed boolean-and-non-boolean
t_many_and_mixed = table.aggregate columns=[..Count_Distinct ["A", "B"], ..Count_Distinct ["C", "E"]]
m3 = materialize t_many_and_mixed
m3.column_count . should_equal 2
m3.at "Count Distinct A B" . to_vector . should_equal [1]
m3.at "Count Distinct C E" . to_vector . should_equal [3]

suite_builder.group prefix+"Table.aggregate Standard_Deviation" pending=(resolve_pending test_selection.std_dev) group_builder->
group_builder.specify "should correctly handle single elements" <|
r1 = table_builder [["X", [1]]] . aggregate columns=[Standard_Deviation "X" (population=False), Standard_Deviation "X" (population=True)]
Expand Down
Loading