-
Notifications
You must be signed in to change notification settings - Fork 603
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Sequence comparisons eg Array.Intersection(), Array.jaccard(), String.jaccard(), String.Levenshtein(), etc #6719
Comments
Hmmm, I wonder if these would be a good use case for our new UDF support. This isn't finished yet, but there's a future where the following would work: con = ibis.duckdb.connect()
# This would query the duckdb engine to automatically determine the signature of `jaccard`,
# so we can error nicely on misuse at expression build time rather than at run time.
# This works in some backends, but not in duckdb yet.
jaccard = con.function("jaccard")
t = con.table(...)
res = jaccard(t.x, t.y)
# We're also thinking about making a general way to define udf functions as expressions alone,
# which would make the expression-level versions backend agnostic. Something like (don't focus
# too much on the spelling).
@ibis.udf.builtin
def jaccard(x: str, y: str) -> float:
... # this isn't eliding details, the function body is intended to be empty here
# If you're familiar with ibis's unbound tables, you might think of these as unbound functions.
# We just assume that they exist in the backend and that the specified schema is correct.
res = jaccard(t.x, t.y) This might be a nice way to let you expose this functionality without adding a bunch of new methods to the various ibis types. Thoughts? |
This requires me to have access to the connection, which is tricky in a framework: The user just gives me inputs as Tables/Expressions, and I don't have access to the backend directly. So I would have to use the private method I like the second API, I think that would work! I'm not sure about calling it a UDF though, that gives me the impression that this python function is actually going to get called, like with other UDFs. What about registering it with ibis more directly (even if under the hood it uses the UDF machinery)? @ibis.register_builtin
def jaccard(x: str, y: str) -> float:
... |
Agreed that for your use case the second option would be a better fit. I wouldn't focus too much on the naming here (there's lots of ways we could spell this), just wanted to see if conceptually this would solve your use case. |
Correction: jaccard() and levenshtein(), while they appeared to work on arrays, actually don't: duckdb/duckdb#8414 So this request should be dialed back to Array.intersection() and String.jaccard() and String.levenshtein() for now |
Ok, we've added levenshtein and array intersection. I think Jaccard similarity is uncommon enough across our backends that it is a good candidate for testing the builtin function API as @jcrist suggests. |
Closed by #7119. Docs here: https://ibis-project.org/how-to/extending/builtin |
Is your feature request related to a problem?
I am writing a record linkage framework on top of ibis. Part of this is the need to compare two records to determine if they are the same. Examples of this:
I can look up other backends if the number of supporting ones is important for you to decide if its worth it.
Describe the solution you'd like
Order of desire:
Array.intersection()
: feat(api): addArrayIntersect
operation and correspondingArrayValue.intersect
API #6782Array.jaccard()
String.levenshtein()
: feat(api): addlevenshtein
edit distance API #6773String.jaccard
What version of ibis are you running?
main
What backend(s) are you using, if any?
duckdb, some other backends support this but not all.
Code of Conduct
The text was updated successfully, but these errors were encountered: