Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove duplicate function name in its aliases list #10661

Merged
merged 15 commits into from
May 28, 2024

Conversation

goldmedal
Copy link
Contributor

Which issue does this PR close?

Closes #10658

Rationale for this change

What changes are included in this PR?

Are these changes tested?

yes
Also checked starting the CLI with the DEBUG level.

RUST_LOG=DEBUG ./target/debug/datafusion-cli

Are there any user-facing changes?

No

@github-actions github-actions bot added core Core DataFusion crate optimizer Optimizer rules labels May 25, 2024
@@ -65,7 +65,6 @@ impl OptimizerRule for PushDownLimit {
};

let Limit { skip, fetch, input } = limit;
let input = input;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why the clippy check in CI didn't detect this but it's showed when I ran clippy in my local.

}
}

for function in all_default_aggregate_functions() {
let udaf = new_state.register_udaf(function).unwrap();
if let Some(udaf) = udaf {
assert!(false, "Function {} already registered", udaf.name());
unreachable!("Function {} already registered", udaf.name());
Copy link
Contributor Author

@goldmedal goldmedal May 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clippy suggests that I use panic or unreachable instead of assert. I'm not sure which one is better.

function_factory: None,
};

for function in all_default_functions() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice test -- though it is basically a copy of SessionContext::new

I wonder if we could make a test that is a bit simpler for example by creating a new
https://docs.rs/datafusion/latest/datafusion/execution/registry/struct.MemoryFunctionRegistry.html

Something like (untested)

let registry = MemoryFunctionRegistry::new();
for function in all_default_array_functions() {
  let existing_function = new_state.register_udf(function);
  assert!(existing_function.is_none(), "{} was already registered", function.name()
}
// and similarly for the aggregate and array functins

🤔

@alamb
Copy link
Contributor

alamb commented May 25, 2024

Thank you @goldmedal 🙏

@goldmedal
Copy link
Contributor Author

Thanks, @alamb!

Actually, I tried to add some tests for MemoryFunctionRegistry before I fixed it, but I found everything was fine. I guess the reason is that we don't insert aliases when using MemoryFunctionRegistry#register_udf.

fn register_udf(&mut self, udf: Arc<ScalarUDF>) -> Result<Option<Arc<ScalarUDF>>> {
Ok(self.udfs.insert(udf.name().to_string(), udf))
}

Then, I found that the aliases are only inserted by SessionState in

fn register_udf(&mut self, udf: Arc<ScalarUDF>) -> Result<Option<Arc<ScalarUDF>>> {
udf.aliases().iter().for_each(|alias| {
self.scalar_functions.insert(alias.clone(), udf.clone());
});
Ok(self.scalar_functions.insert(udf.name().into(), udf))
}

That's why I added tests for SessionState.

I'm not sure, but maybe we should also insert aliases in MemoryFunctionRegistry. What do you think?

@alamb
Copy link
Contributor

alamb commented May 26, 2024

I'm not sure, but maybe we should also insert aliases in MemoryFunctionRegistry. What do you think?

I think this would make sense -- though maybe we can do it as a follow on PR

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this find @goldmedal -- very cool. I think we can simplify the test a bit which would be good to consider.

@@ -2860,6 +2863,57 @@ mod tests {
Ok(())
}

#[tokio::test]
async fn test_register_default_functions() -> Result<()> {
let config = SessionConfig::new();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I understand the proposal to test with code like SessionState::new , I think it is problematic because if SessionState::new is changed, then this test will no longer match what it is doing.

I think the core of this test is verifying that there are no duplicate names in the alias lists.

Thus, perhaps we could write tests for all_default_functions, etc like

let mut names = HashSet::new();
for func in all_default_functions() {
  assert(names.insert(func.name()).is_none(), "func.name duplicated")
  for alias in func.aliases() {
    assert(names.insert(alias).is_none(), "alias duplicated")
   }
}

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I'll add this test and remove the test for SessionState::new. Thanks

@goldmedal goldmedal force-pushed the bugfix/10658-udf-multiple-register branch from 340329d to 8c2c828 Compare May 26, 2024 13:03
@github-actions github-actions bot removed the core Core DataFusion crate label May 26, 2024
let mut names = HashSet::new();
for func in all_default_aggregate_functions() {
assert!(
names.insert(func.name().to_string().to_lowercase()),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the function name and alias should be case-insensitive. Therefore, lowercase them here.

@goldmedal
Copy link
Contributor Author

goldmedal commented May 26, 2024

I'm confused about the new median function. I'm not sure why the name is case-sensitive.

Planning query: Plan("Invalid function 'median'.\nDid you mean 'MEDIAN'?\n\n

I removed the duplicate name as I did for other functions in 4086688. However, the planner can't resolve the lowercase function name. I have tried and confirmed that other functions (e.g., first_value and approx_median) are case-insensitive. Only median is case-sensitive.

Do you have any idea? @alamb

I found this function was created in #10644.
Gentle ping @jayzhan211 , do you have any idea about it?
Thanks

@goldmedal
Copy link
Contributor Author

I think I got it now. In #10644, MEDIAN was removed from the AggregationFunction list and reimplemented as a UDAF. It looks like the list of aggregation functions handles some name mapping for case insensitivity.

See:

FirstValue => "FIRST_VALUE",

"first_value" => AggregateFunction::FirstValue,

That's why we don't need to add a lowercase alias for FIRST_VALUE in first_last.rs. It seems FIRST_VALUE has been redesigned as a UDAF but hasn't been removed from the AggregationFunction list. I'm not sure if this is an issue.

I reverted the change for median and removed the name testing for case insensitivity. However, I think adding a lowercase alias for MEDIAN is a workaround to handle the function name case insensitivity issue for UDAF. Maybe we need a formal way to handle this, but I'm not sure.

@jayzhan211
Copy link
Contributor

jayzhan211 commented May 27, 2024

That's why we don't need to add a lowercase alias for FIRST_VALUE in first_last.rs. It seems FIRST_VALUE has been redesigned as a UDAF but hasn't been removed from the AggregationFunction list. I'm not sure if this is an issue.

first_value is not added in alias because the function is not fully converted to UDAF yet, so it will fail some tests. #10648 has alias for lower case

so as approx_median since it is not UDAF yet.

I think adding a lowercase alias for MEDIAN is a workaround to handle the function name case insensitivity issue for UDAF

Why adding a lowercase alias is a workaround for UDAF? Without alias, we will need to check the lowercase in many places.

@goldmedal
Copy link
Contributor Author

first_value is not added in alias because the function is not fully converted to UDAF yet, so it will fail some tests. #10648 has alias for lower case

so as approx_median since it is not UDAF yet.

Thanks for the information! @jayzhan211

I think adding a lowercase alias for MEDIAN is a workaround to handle the function name case insensitivity issue for UDAF

Why adding a lowercase alias is a workaround for UDAF? Without alias, we will need to check the lowercase in many places.

Since I thought we needed to list all case patterns for the function name (e.g., Median, MEDian), I called it a workaround. However, I found I was wrong. After some testing, I discovered something interesting. For DataFusion, if we query a function

SELECT Median(xx) FROM tt

The function call will be transferred to median. Even query it use

SELECT MEDIAN(xx) FROM tt

It will also be changed to median. The uppercase name is only be matched when invoke this function with qutoes

SELECT "MEDIAN"(xx) FROM tt

So, how about setting UDAF names to lowercase (e.g., median, first_value)? This way, we don't need another uppercase alias. What do you think?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tahnk you @goldmedal and @jayzhan211 -- I think this is a really nice improvement. Not only does it fix the bug it also makes the code easier to work with and avoids introducing the same regression again

I took the liberty of merging up from main to resolve a merge conflict

@alamb
Copy link
Contributor

alamb commented May 27, 2024

So, how about setting UDAF names to lowercase (e.g., median, first_value)? This way, we don't need another uppercase alias. What do you think?

This makes sense to me

Shall we file a follow on ticket for it? I don't think we need to fix it in this PR (as this PR doesn't make the situation any better or worse, from what I can tell).

@goldmedal
Copy link
Contributor Author

goldmedal commented May 27, 2024

Shall we file a follow on ticket for it? I don't think we need to fix it in this PR (as this PR doesn't make the situation any better or worse, from what I can tell).

Thanks @alamb, I agree with you. Fixing it in a follow-up ticket is better.

@jayzhan211
Copy link
Contributor

Thanks, @goldmedal and @alam. I think we can merge this

@jayzhan211 jayzhan211 merged commit 98cd19e into apache:main May 28, 2024
23 checks passed
@goldmedal goldmedal deleted the bugfix/10658-udf-multiple-register branch May 28, 2024 12:50
@goldmedal
Copy link
Contributor Author

Thanks again @alamb @jayzhan211

@alamb
Copy link
Contributor

alamb commented May 28, 2024

Shall we file a follow on ticket for it? I don't think we need to fix it in this PR (as this PR doesn't make the situation any better or worse, from what I can tell).

Thanks @alamb, I agree with you. Fixing it in a follow-up ticket is better.

Filed #10695 to track

findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
* remove duplicate name in function name aliases for array-function

* add test for registering default functions

* rename test case

* add tests for agg and core function and refactor the test

* remove unused mut

* add comments for public function

* cargo fmt

* fix clippy

* fix missing list_element

* fix clippy

* remove duplicate aliase name for median

* add test for function list and remove the test for SessionState

* remove the debug message

* revert the change for medain and remove case insensitive test

---------

Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimizer Optimizer rules
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Many DEBUG datafusion_functions_array] Overwrite existing UDF: array_to_string messages in log
3 participants