-
Notifications
You must be signed in to change notification settings - Fork 873
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JAVA-3118: Add support for vector data type in Schema Builder, QueryBuilder #1931
base: 4.x
Are you sure you want to change the base?
Conversation
i can't get this to compile
am i doing something wrong ? |
Is there a separate ticket for vector similarity functions ? |
@@ -146,6 +146,8 @@ default Select orderBy(@NonNull String columnName, @NonNull ClusteringOrder orde | |||
return orderBy(CqlIdentifier.fromCql(columnName), order); | |||
} | |||
|
|||
@NonNull | |||
Select orderBy(@NonNull Ann ann); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would consider adding here the direction (ASC, DESC) parameter. Currently we do not support DESC vector ordering, but this may be available in future and CQL syntax allows it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @SiyaoIsHiding! This is basically what I was expecting to see with this change. We can have a conversation about the comments about the API but otherwise there's just a few things to clean up here.
core/src/main/java/com/datastax/oss/driver/internal/core/type/DefaultVectorType.java
Outdated
Show resolved
Hide resolved
|
||
public static Ann annOf(@NonNull String cqlIdentifier, @NonNull CqlVector<Number> vector) { | ||
return new DefaultAnn(CqlIdentifier.fromCql(cqlIdentifier), vector); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These will need to be updated when the PR for JAVA-3143 is merged; the CqlVector constraint won't apply once that's in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested out on C* 5.0.1 and it says ANN only supports float.
cqlsh:default_keyspace> insert INTO vt (key, v ) VALUES ( 1, ['a','b']) ;
cqlsh:default_keyspace> select * from vt order by v ann of ['a', 'c'];
InvalidRequest: Error from server: code=2200 [Invalid query] message="ANN ordering is only supported on float vector indexes"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! Yes, you're correct; Apache Cassandra 5.0.x supports vectors of any subtype as a type but the ANN index used there only supports floats.
@@ -146,6 +146,8 @@ default Select orderBy(@NonNull String columnName, @NonNull ClusteringOrder orde | |||
return orderBy(CqlIdentifier.fromCql(columnName), order); | |||
} | |||
|
|||
@NonNull | |||
Select orderBy(@NonNull Ann ann); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm missing something but it seems more natural to support something like the following:
Select orderByAnnOf(CqlIdentifier columnId, CqlVector ann);
Select orderByAnnOf(String columnName, CqlVector ann);
Advantage is that with this approach you don't even need to introduce an Ann type... which kinda seems right as that type isn't really doing much for you here.
You could also perhaps add a notion of type checking the specified column to make sure it's a vector type (and to make sure it matches the type of the input CqlVector).
To the point made by @lukasz-antoniak above we could add directionality here (and throw warnings if the user tries to use a DESC order before there's server-side support for it) but I'm not sure it's worth it. There's no mention of ordering in the relevant Cassandra docs so my intuition says to just leave it out for now and add it when it becomes more of a thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also agree to save DESC ordering for later because:
- I find it weird to add a feature that does not work yet and we cannot even test
- Neither Bret nor I found relevant doc saying they may support DESC of vector search. @lukasz-antoniak did you find any? I find it hard to imagine in what cases we want to find the approximate farthest neighbor...?
- If we want to add DESC later, we just need to add another function overload
Select orderByAnnOf(String columnName, CqlVector ann, ClusteringOrder order);
. I assume this is not hard.
query-builder/src/test/java/com/datastax/oss/driver/api/querybuilder/schema/AlterTableTest.java
Outdated
Show resolved
Hide resolved
...uilder/src/test/java/com/datastax/oss/driver/api/querybuilder/delete/DeleteSelectorTest.java
Outdated
Show resolved
Hide resolved
...builder/src/test/java/com/datastax/oss/driver/api/querybuilder/insert/RegularInsertTest.java
Outdated
Show resolved
Hide resolved
query-builder/src/test/java/com/datastax/oss/driver/api/querybuilder/schema/AlterTypeTest.java
Outdated
Show resolved
Hide resolved
...y-builder/src/test/java/com/datastax/oss/driver/api/querybuilder/schema/CreateTableTest.java
Outdated
Show resolved
Hide resolved
query-builder/src/test/java/com/datastax/oss/driver/api/querybuilder/schema/CreateTypeTest.java
Outdated
Show resolved
Hide resolved
One other thing worth mentioning: the Cassandra impl also supports a way to get "the similarity calculation of the best scoring node closest to the query data as part of the results". Take a look at the similarity_dot_product() function (and the other choices as well) in the relevant Cassandra docs. The query builder should have support for those as well. |
The revapi thing is fixed and the vector similarity function is already supported by the existing Lines 235 to 274 in 19148d5
In terms of the spring-ai downstream, as we won't actually break any API, is there anything we should test or how? |
Good call on the similarity_* functions @SiyaoIsHiding! |
|
||
/** Adds the ORDER BY ... ANN OF ... clause */ | ||
@NonNull | ||
Select orderByAnnOf(@NonNull CqlIdentifier columnId, @NonNull CqlVector<? extends Number> ann); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm changing my answer on this: we should remove the type bound here and just make this a CqlVector<?>
. My rationale goes as follows.
The fact that ANN comparisons only support float vectors actually isn't a constraint on the underlying Java type used here. In theory any Java type whose codec will generate a serialized value that can be understood as a float when the server receives the message would work here. That means we could pass in a CqlVector of floats, decimals or doubles and still have the server handle that without issue. I'll also note that there's no common supertype for Float, BigDecimal or Double (the corresponding Java types based on the docs) other than Number... and we can't use that here because it includes types other than these three. So there's no meaningful supertype we can use for a type bound at this point.
A second point: the query builder is built around the notion of generating CQL based on whatever the user passes in; there's almost no checking whether types correlate to things the user specified. So, for example, OngoingValues doesn't have any kind of type bounds around it's various value() methods. I accept that this isn't exactly the same as the float constraint we're discussing here... but it is an indication that the query builder largely doesn't concern itself with type checking with the expectation that the server will handle that when it receives the query.
Finally, I'll note that avoiding the type bound here at least exposes the idea (at an API level) of using external/custom types which serialize to CQL float values by way of their own codecs. Scala users, for example, might want to support using Spire numerics for their CQL queries. This actually won't work for now since we constrain the codec registry used to the (immutable) default which doesn't include any support for Spire types... but that's an implementation detail. By avoiding type bounds at an API level we've left that door open for such a change without having to make a (potentially breaking) API change and without sacrificing our design principles elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're almost there @SiyaoIsHiding! Let's sort out the type bounds on Select.orderByAnnOf()
and then I think this guy is (finally) ready to be merged!
I assume @lukasz-antoniak comment here about adding |
I'm not 💯 sure what the accessors on DefaultSelect are intended to do but I suppose it does make sense to keep it consistent and add an accessor for the Ann object. Either way I'm satisfied with where this stands now... 👍 from me! |
Currently, the
SchemaBuilder
works with vector like this:Or
Please let me know if you want something like
.withColumn("v", DataTypes.vector(DataTypes.FLOAT, 3))
.