Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JAVA-3118: Add support for vector data type in Schema Builder, QueryBuilder #1931

Open
wants to merge 13 commits into
base: 4.x
Choose a base branch
from

Conversation

SiyaoIsHiding
Copy link
Contributor

@SiyaoIsHiding SiyaoIsHiding commented May 6, 2024

Currently, the SchemaBuilder works with vector like this:

    assertThat(
            createTable("foo")
                .withPartitionKey("k", DataTypes.INT)
                .withColumn("v", new DefaultVectorType(DataTypes.FLOAT, 3)))
        .hasCql("CREATE TABLE foo (k int PRIMARY KEY,v VECTOR<FLOAT, 3>)");

Or

assertThat(createTable("foo")
            .withPartitionKey("k", DataTypes.INT)
            .withColumn("v", DataTypes.custom("org.apache.cassandra.db.marshal.VectorType(org.apache.cassandra.db.marshal.FloatType,3)")
            ))
            .hasCql("CREATE TABLE foo (k int PRIMARY KEY,v VECTOR<FLOAT, 3>)");

Please let me know if you want something like .withColumn("v", DataTypes.vector(DataTypes.FLOAT, 3)).

@absurdfarce absurdfarce self-requested a review May 29, 2024 17:37
@michaelsembwever
Copy link
Member

i can't get this to compile

[ERROR] Failed to execute goal org.revapi:revapi-maven-plugin:0.10.5:check (default) on project java-driver-query-builder: The following API problems caused the build to fail:
[ERROR] java.method.addedToInterface: method com.datastax.oss.driver.api.querybuilder.select.Select com.datastax.oss.driver.api.querybuilder.select.Select::orderBy(com.datastax.oss.driver.api.querybuilder.select.Ann): Method was added to an interface.
[ERROR]

am i doing something wrong ?

@michaelsembwever michaelsembwever self-requested a review June 11, 2024 16:48
@michaelsembwever
Copy link
Member

Is there a separate ticket for vector similarity functions ?
https://cassandra.apache.org/doc/latest/cassandra/developing/cql/functions.html#vector-similarity-functions

@@ -146,6 +146,8 @@ default Select orderBy(@NonNull String columnName, @NonNull ClusteringOrder orde
return orderBy(CqlIdentifier.fromCql(columnName), order);
}

@NonNull
Select orderBy(@NonNull Ann ann);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider adding here the direction (ASC, DESC) parameter. Currently we do not support DESC vector ordering, but this may be available in future and CQL syntax allows it.

Copy link
Contributor

@absurdfarce absurdfarce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @SiyaoIsHiding! This is basically what I was expecting to see with this change. We can have a conversation about the comments about the API but otherwise there's just a few things to clean up here.


public static Ann annOf(@NonNull String cqlIdentifier, @NonNull CqlVector<Number> vector) {
return new DefaultAnn(CqlIdentifier.fromCql(cqlIdentifier), vector);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These will need to be updated when the PR for JAVA-3143 is merged; the CqlVector constraint won't apply once that's in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested out on C* 5.0.1 and it says ANN only supports float.

cqlsh:default_keyspace> insert INTO vt (key, v ) VALUES ( 1, ['a','b']) ;
cqlsh:default_keyspace> select * from vt order by v ann of ['a', 'c'];
InvalidRequest: Error from server: code=2200 [Invalid query] message="ANN ordering is only supported on float vector indexes"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! Yes, you're correct; Apache Cassandra 5.0.x supports vectors of any subtype as a type but the ANN index used there only supports floats.

@@ -146,6 +146,8 @@ default Select orderBy(@NonNull String columnName, @NonNull ClusteringOrder orde
return orderBy(CqlIdentifier.fromCql(columnName), order);
}

@NonNull
Select orderBy(@NonNull Ann ann);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something but it seems more natural to support something like the following:

Select orderByAnnOf(CqlIdentifier columnId, CqlVector ann);
Select orderByAnnOf(String columnName, CqlVector ann);

Advantage is that with this approach you don't even need to introduce an Ann type... which kinda seems right as that type isn't really doing much for you here.

You could also perhaps add a notion of type checking the specified column to make sure it's a vector type (and to make sure it matches the type of the input CqlVector).

To the point made by @lukasz-antoniak above we could add directionality here (and throw warnings if the user tries to use a DESC order before there's server-side support for it) but I'm not sure it's worth it. There's no mention of ordering in the relevant Cassandra docs so my intuition says to just leave it out for now and add it when it becomes more of a thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also agree to save DESC ordering for later because:

  1. I find it weird to add a feature that does not work yet and we cannot even test
  2. Neither Bret nor I found relevant doc saying they may support DESC of vector search. @lukasz-antoniak did you find any? I find it hard to imagine in what cases we want to find the approximate farthest neighbor...?
  3. If we want to add DESC later, we just need to add another function overload Select orderByAnnOf(String columnName, CqlVector ann, ClusteringOrder order);. I assume this is not hard.

@absurdfarce
Copy link
Contributor

One other thing worth mentioning: the Cassandra impl also supports a way to get "the similarity calculation of the best scoring node closest to the query data as part of the results". Take a look at the similarity_dot_product() function (and the other choices as well) in the relevant Cassandra docs. The query builder should have support for those as well.

@SiyaoIsHiding
Copy link
Contributor Author

The revapi thing is fixed and the vector similarity function is already supported by the existing Function term. I added tests for it as examples:

public void should_generate_similarity_functions() {
Select similarity_cosine_clause =
selectFrom("cycling", "comments_vs")
.column("comment")
.function(
"similarity_cosine",
Selector.column("comment_vector"),
literal(CqlVector.newInstance(0.2, 0.15, 0.3, 0.2, 0.05)))
.orderByAnnOf("comment_vector", CqlVector.newInstance(0.1, 0.15, 0.3, 0.12, 0.05))
.limit(1);
assertThat(similarity_cosine_clause)
.hasCql(
"SELECT comment,similarity_cosine(comment_vector,[0.2, 0.15, 0.3, 0.2, 0.05]) FROM cycling.comments_vs ORDER BY comment_vector ANN OF [0.1, 0.15, 0.3, 0.12, 0.05] LIMIT 1");
Select similarity_euclidean_clause =
selectFrom("cycling", "comments_vs")
.column("comment")
.function(
"similarity_euclidean",
Selector.column("comment_vector"),
literal(CqlVector.newInstance(0.2, 0.15, 0.3, 0.2, 0.05)))
.orderByAnnOf("comment_vector", CqlVector.newInstance(0.1, 0.15, 0.3, 0.12, 0.05))
.limit(1);
assertThat(similarity_euclidean_clause)
.hasCql(
"SELECT comment,similarity_euclidean(comment_vector,[0.2, 0.15, 0.3, 0.2, 0.05]) FROM cycling.comments_vs ORDER BY comment_vector ANN OF [0.1, 0.15, 0.3, 0.12, 0.05] LIMIT 1");
Select similarity_dot_product_clause =
selectFrom("cycling", "comments_vs")
.column("comment")
.function(
"similarity_dot_product",
Selector.column("comment_vector"),
literal(CqlVector.newInstance(0.2, 0.15, 0.3, 0.2, 0.05)))
.orderByAnnOf("comment_vector", CqlVector.newInstance(0.1, 0.15, 0.3, 0.12, 0.05))
.limit(1);
assertThat(similarity_dot_product_clause)
.hasCql(
"SELECT comment,similarity_dot_product(comment_vector,[0.2, 0.15, 0.3, 0.2, 0.05]) FROM cycling.comments_vs ORDER BY comment_vector ANN OF [0.1, 0.15, 0.3, 0.12, 0.05] LIMIT 1");
}

In terms of the spring-ai downstream, as we won't actually break any API, is there anything we should test or how?

@absurdfarce
Copy link
Contributor

Good call on the similarity_* functions @SiyaoIsHiding!


/** Adds the ORDER BY ... ANN OF ... clause */
@NonNull
Select orderByAnnOf(@NonNull CqlIdentifier columnId, @NonNull CqlVector<? extends Number> ann);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm changing my answer on this: we should remove the type bound here and just make this a CqlVector<?>. My rationale goes as follows.

The fact that ANN comparisons only support float vectors actually isn't a constraint on the underlying Java type used here. In theory any Java type whose codec will generate a serialized value that can be understood as a float when the server receives the message would work here. That means we could pass in a CqlVector of floats, decimals or doubles and still have the server handle that without issue. I'll also note that there's no common supertype for Float, BigDecimal or Double (the corresponding Java types based on the docs) other than Number... and we can't use that here because it includes types other than these three. So there's no meaningful supertype we can use for a type bound at this point.

A second point: the query builder is built around the notion of generating CQL based on whatever the user passes in; there's almost no checking whether types correlate to things the user specified. So, for example, OngoingValues doesn't have any kind of type bounds around it's various value() methods. I accept that this isn't exactly the same as the float constraint we're discussing here... but it is an indication that the query builder largely doesn't concern itself with type checking with the expectation that the server will handle that when it receives the query.

Finally, I'll note that avoiding the type bound here at least exposes the idea (at an API level) of using external/custom types which serialize to CQL float values by way of their own codecs. Scala users, for example, might want to support using Spire numerics for their CQL queries. This actually won't work for now since we constrain the codec registry used to the (immutable) default which doesn't include any support for Spire types... but that's an implementation detail. By avoiding type bounds at an API level we've left that door open for such a change without having to make a (potentially breaking) API change and without sacrificing our design principles elsewhere.

Copy link
Contributor

@absurdfarce absurdfarce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're almost there @SiyaoIsHiding! Let's sort out the type bounds on Select.orderByAnnOf() and then I think this guy is (finally) ready to be merged!

@SiyaoIsHiding
Copy link
Contributor Author

I assume @lukasz-antoniak comment here about adding getAnn() in DefaultSelect should be addressed here. Thank you for your suggestion! It's added.

@absurdfarce
Copy link
Contributor

I'm not 💯 sure what the accessors on DefaultSelect are intended to do but I suppose it does make sense to keep it consistent and add an accessor for the Ann object.

Either way I'm satisfied with where this stands now... 👍 from me!

@absurdfarce absurdfarce requested review from tolbertam and removed request for lukasz-antoniak October 14, 2024 16:00
absurdfarce added a commit to absurdfarce/cassandra-java-driver that referenced this pull request Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants