Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Sequence comparisons eg Array.Intersection(), Array.jaccard(), String.jaccard(), String.Levenshtein(), etc #6719

Closed
5 tasks done
NickCrews opened this issue Jul 28, 2023 · 7 comments · Fixed by #7119
Labels
feature Features or general enhancements

Comments

@NickCrews
Copy link
Contributor

NickCrews commented Jul 28, 2023

Is your feature request related to a problem?

I am writing a record linkage framework on top of ibis. Part of this is the need to compare two records to determine if they are the same. Examples of this:

  • Take the jaccard similarity of tags, eg ["shoes", "blue"] and ["shoes", "green"]. I can currently do this with some custom functions that use Array.union() and Array.length(), but some backends such as duckdb support jaccard natively so that should be faster. duckdb also supports this with strings as well as arrays, so we could do that here as well. At the least, ibis has Array.union(), so I think for symmetry it would be nice if it had Array.intersection(). Then at the least jaccard() would be trivial for a user, since that is just intersection().length() / union().length() (but still probably a bit less performant than a builtin jaccard).
  • Compute the string edit distance such as levenshtein, damerau-levenshtein, jaro-winkler. Duckdb supports many of these. I could see how supporting ALL of these could quickly get out of hand since there are so many different variations, but curious what you think.

I can look up other backends if the number of supporting ones is important for you to decide if its worth it.

Describe the solution you'd like

Order of desire:

What version of ibis are you running?

main

What backend(s) are you using, if any?

duckdb, some other backends support this but not all.

Code of Conduct

  • I agree to follow this project's Code of Conduct
@NickCrews NickCrews added the feature Features or general enhancements label Jul 28, 2023
@jcrist
Copy link
Member

jcrist commented Jul 28, 2023

Hmmm, I wonder if these would be a good use case for our new UDF support. This isn't finished yet, but there's a future where the following would work:

con  = ibis.duckdb.connect()

# This would query the duckdb engine to automatically determine the signature of `jaccard`,
# so we can error nicely on misuse at expression build time rather than at run time.
# This works in some backends, but not in duckdb yet.
jaccard = con.function("jaccard")

t = con.table(...)
res = jaccard(t.x, t.y)

# We're also thinking about making a general way to define udf functions as expressions alone,
# which would make the expression-level versions backend agnostic. Something like (don't focus
# too much on the spelling).

@ibis.udf.builtin
def jaccard(x: str, y: str) -> float:
    ...  # this isn't eliding details, the function body is intended to be empty here

# If you're familiar with ibis's unbound tables, you might think of these as unbound functions.
# We just assume that they exist in the backend and that the specified schema is correct.

res = jaccard(t.x, t.y)

This might be a nice way to let you expose this functionality without adding a bunch of new methods to the various ibis types. Thoughts?

@NickCrews
Copy link
Contributor Author

NickCrews commented Jul 28, 2023

jaccard = con.function("jaccard")

This requires me to have access to the connection, which is tricky in a framework: The user just gives me inputs as Tables/Expressions, and I don't have access to the backend directly. So I would have to use the private method expr._find_backend(), which feels dirty. If you made this an official API, then I think this would work for me. Could we make it so you register on the entire class of backend, eg jaccard = ibis.duckdb.function("jaccard")? I guess you actually need a physical connection to do the lookup of the signature...

I like the second API, I think that would work! I'm not sure about calling it a UDF though, that gives me the impression that this python function is actually going to get called, like with other UDFs. What about registering it with ibis more directly (even if under the hood it uses the UDF machinery)?

@ibis.register_builtin
def jaccard(x: str, y: str) -> float:
    ...

@jcrist
Copy link
Member

jcrist commented Jul 28, 2023

Agreed that for your use case the second option would be a better fit. I wouldn't focus too much on the naming here (there's lots of ways we could spell this), just wanted to see if conceptually this would solve your use case.

@NickCrews
Copy link
Contributor Author

Correction: jaccard() and levenshtein(), while they appeared to work on arrays, actually don't: duckdb/duckdb#8414

So this request should be dialed back to Array.intersection() and String.jaccard() and String.levenshtein() for now

@cpcloud
Copy link
Member

cpcloud commented Aug 4, 2023

Levenshtein might be a good candidate for a top level API, it's supported in at least 6 backends:

@cpcloud
Copy link
Member

cpcloud commented Aug 6, 2023

Ok, we've added levenshtein and array intersection.

I think Jaccard similarity is uncommon enough across our backends that it is a good candidate for testing the builtin function API as @jcrist suggests.

@cpcloud
Copy link
Member

cpcloud commented Sep 13, 2023

Closed by #7119.

Docs here: https://ibis-project.org/how-to/extending/builtin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants