Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: duplicate declaration of min:timestamp & max:timestamp #631

Merged
merged 1 commit into from
Apr 24, 2024

Conversation

amol-
Copy link
Contributor

@amol- amol- commented Apr 18, 2024

Addresses a duplication of min and max function overloads
for timestamp types.

The functions are declared in arithmetic extensions:

but are also declared in datetime extensions:

This seems to be a source of confusion for a system loading those extensions definition, which one of the two should be considered valid?

The PR addresses this by preserving only the definitions in datetime for those argument types.

@CLAassistant
Copy link

CLAassistant commented Apr 18, 2024

CLA assistant check
All committers have signed the CLA.

Copy link

ACTION NEEDED

Substrait follows the Conventional Commits
specification
for
release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

@amol- amol- changed the title fix: Duplicate declaration of min:timestamp,max:timestamp fix: duplicate declaration of min:timestamp,max:timestamp Apr 18, 2024
@vbarua
Copy link
Member

vbarua commented Apr 18, 2024

The duplicated names prevent substrait-java from being updated.

However, there is a question around whether names need to unique across ALL extensions, or just within a single extension file. While these functions were added in error (I think) you could make the case that:

  • functions_arithmetic/min:timestamp,max:timestamp
  • functions_datetime/min:timestamp,max:timestamp

should be treated as different functions.

@amol-
Copy link
Contributor Author

amol- commented Apr 18, 2024

  • functions_arithmetic/min:timestamp,max:timestamp
  • functions_datetime/min:timestamp,max:timestamp

should be treated as different functions.

That might make sense from a certain point of view, but it would make very hard to resolve function invocations.
Suppose you have

  • functions_arithmetic/add:int32_int32
  • functions_integer/add:int32_int32

If you have to parse "3 + 2" to translate it from SQL (for example in Isthmus),
you know the function which is add, you know the types which are two literal int.
But how are you supposed to know which one of the two namespaces is the one to use?

I think that introducing namespaces is not as simple as allowing duplicated declarations and requires a dedicated discussion.

@vbarua
Copy link
Member

vbarua commented Apr 18, 2024

The case I'm think of is more like:

  1. functions_arithmetic/add:int32_int32
  2. custom_functions_for_fancy_engine/add:int32_int32

Which is a name collision with names outside of the core spec, because users can choose to provide their own functions. My engine might provide 2 and disallow 1 because it differs from the Substrait semantics for add somehow.

But how are you supposed to know which one of the two namespaces is the one to use?

Within a Substrait plan, these two functions are distinguishable because we have access to the extension information.

Outside of Substrait, less so. You're right that for something like 1 + 2, whatever is parsing and generating the Substrait plan needs to decide on which add version to use. For Ishmus, we choose to use the standard functions_arithmetic version and provide a (relatively*) explicit mapping from Calcite to Substrait. The choice of which functions to use use is generally up to the plan producer, and will depend on the consumer that they are targeting and what they support.

I think that introducing namespaces is not as simple as allowing duplicated declarations and requires a dedicated discussion.

I agree with this, figuring out duplicating declarations is out of the scope of this PR.

* I say relatively because the mapping doesn't include the name of the extension, yet.

nullability: DECLARED_OUTPUT
decomposable: MANY
intermediate: timestamp_tz?
return: timestamp_tz?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is technically a breaking change. Plans may exist that use this version of the function, especially because it predates the version in functions_datetime.yaml.

API wise, it's probably better to remove these versions because the leaves all the time related min and max functions in functions_datetime.yaml`.

In terms of minimising breakage, it's probably better to remove the definition of these functions in functions_datetime.yaml as those have been around for less time.

I have a preference for removing them from functions_datetime, but it would be good to have other folks weigh in on this.

@westonpace
Copy link
Member

I've added my thoughts around function names and uniqueness in #634

@vbarua
Copy link
Member

vbarua commented Apr 24, 2024

From substrait sync, we've decided to remove the old ones or keep the new ones.

@vbarua vbarua changed the title fix: duplicate declaration of min:timestamp,max:timestamp fix: duplicate declaration of min:timestamp & max:timestamp Apr 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants