Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JAVA-3061 Remove CqlVector, represent CQL vector types as Lists #1656

Merged
merged 14 commits into from
Jun 27, 2023

Conversation

absurdfarce
Copy link
Contributor

@absurdfarce absurdfarce commented Jun 22, 2023

While also playing nice with CqlVectorCodec. Also set out to add some Javadoc in order to make usage a bit clearer.

Largely derived from previous work done by @jshook in PR 1641

@absurdfarce
Copy link
Contributor Author

My original comment reviewing what became this change I mentioned that I didn't want to use both the newInstance() methods and the builder in the same impl. After reading comments from the other PR (and in other spots) I realized this wasn't really a sensible position... so I decided to leave both in.

Copy link

@jshook jshook left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only have one issue which I think is critical to address, which is the direct values accessor. Please see my comment for details.

* <p>Instances may be created iteratively using {@link CqlVector.Builder} or in a single method
* call via {@link #newInstance(Object[])} or {@link #newInstance(Collection)}.
*/
public class CqlVector<T> implements Iterable<T> {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is still a critical need to have simple access to the underlying values. The iterative accessor is simply not sufficient for our application's needs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the pros/cons, my understanding is:

  • Iterable access provides a generic approach and could be used where lazy evaluation and large datasets are known.
  • List commonly used interface, easy way to work with collections, random access, specific list operations. This also allows for getting an idea of the size.

Since our full set of vector search use cases are somewhat vague at this point, +1 for having both options.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I attempted to address this by adding an export(Collection) method which dumps the vectors content into a collection type of your choosing. This is significantly more flexible than exporting a single collection type; whatever type we export may have properties (list vs. set vs. queue, immutable vs. immutable, linked lists vs. array lists and so on) which almost certainly will not apply in even a majority of situations. This approach allows API users to create the list in whatever way they see fit (and even perform some initial operations on said list if they choose to do so) and then hand it over to our vector for supplying the data.

It's my hope that this addresses your immediate concern @jshook while also maximizing our flexibility for the many Java driver users who will have use cases with different parameters. If I outright missed the boat here and didn't understand something please let me know.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an ergonomic improvement, and quite flexible. However, it doesn't improve the situation around the heap allocation acrobatics for large (assume 2K nominal) vectors. I'm working up some microbenches to quantify this better. It seems worth it at least for something that, once settled, will be how CqlVector works for a long time.

…SequenceType interface and

make both ListType and VectorType extend from that.  Also converted VectorType to a pure interface
and moved impl to DefaultVectorType to better match driver conventions.
@absurdfarce absurdfarce changed the title JAVA-3061 Improve CqlVector usability JAVA-3061 Remove CqlVector, represent CQL vector types as Lists Jun 23, 2023
@absurdfarce
Copy link
Contributor Author

absurdfarce commented Jun 23, 2023

I spent some time looking at this last night, trying out different ideas like supporting custom codecs for mutable vector types. There just wasn't a clean way to do this which generalized nicely, but fortunately while digging into these ideas I hit upon some comments which inverted my original way of thinking. CqlDuration is the exception here, it isn't the model to be followed... which in turn means we have no particular need to have a custom type for vectors. With that in mind it was much easier to simply represent vectors in Java as a List with elements of some subtype, a change implemented by the most recent set of commits.

The change was quite intrusive, changing the direction of this PR and making it quite a bit bigger than it was. But after this change you now do things like the following:

            ResultSet rs = session.execute("select j from test.foo where j ann of [3.4, 7.8, 9.1] limit 1");
            for (Row row : rs) {
                List<Float> v = row.getVector(0, Float.class);
                System.out.println(v)
            }

And this:

            PreparedStatement preparedInsert = session.prepare("insert into test.foo (i, j) values (?,?)");
            List<Float> vector = Lists.newArrayList(1.4f, 2.5f, 3.6f);
            session.execute(preparedInsert.bind(3, vector));
            session.execute(preparedInsert.bind(4).setVector(1, vector, Float.class));
            session.execute(preparedInsert.bind(5).setVector("j", vector, Float.class));
            session.execute(preparedInsert.bind(6).setVector(CqlIdentifier.fromInternal("j"), vector, Float.class));

This avoids all the Sturm und Drang brought on by the prior implementation while also behaving in a way that fits more naturally with the Java drivers re-use of existing Java types wherever possible. Finally, it also more closely mirrors what we do with the Python driver, which isn't necessary but is a sign that we're on a good path.

A lot of this PR now covers driver internals, so really my interest for the reviewers who don't work on the Java driver on a regular basis is to make sure they're okay with the idea of doing away with CqlVector and using a List instead. We can handle the review of these changes from the perspective of the Java driver code base separately.

@jeffbanks
Copy link

@absurdfarce as you stated, the internal changes are less of a concern for the majority of folks on this PR interested in the general use of what was CqlVector. I'm 👍🏼 for this implementation, it seems to address the concerns raised by @jshook. I do like the fact that this aligns with the Python driver implementation.

Copy link

@jeffbanks jeffbanks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm for the cqlvector usage; the driver specific bits, I'm leaving that detail to those more familiar with the internals.

Copy link

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 very nice

@absurdfarce absurdfarce requested a review from jshook June 26, 2023 15:04
Copy link

@jshook jshook left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.We'll try it out as soon as we can in a release. It looks like you also added some APIs to help manage collection/container types more cleanly and efficently too, from what little I know of the driver internals. Thanks for taking this to the next level!

@absurdfarce
Copy link
Contributor Author

A quick note on this: @hhughes and I will be walking through these changes (including the changes to the driver innards) in detail later this afternoon. If that goes well this will get merged and the set of other fixes dependent on the outcome of this PR will get addressed.

Copy link
Contributor

@hhughes hhughes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment about edge case checking when encoding vectors and a few inconsistencies in javadocs. LGTM

@@ -423,8 +423,9 @@ default SelfT setCqlDuration(int i, @Nullable CqlDuration v) {
*/
@NonNull
@CheckReturnValue
default SelfT setCqlVector(int i, @Nullable CqlVector<?> v) {
return set(i, v, CqlVector.class);
default <ElementT> SelfT setVector(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* Sets the {@code i}th value to the provided duration.

Should this be vector?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should indeed... good catch @hhughes!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -570,10 +570,11 @@ default SelfT setCqlDuration(@NonNull String name, @Nullable CqlDuration v) {
*/
@NonNull
@CheckReturnValue
default SelfT setCqlVector(@NonNull String name, @Nullable CqlVector<?> v) {
default <ElementT> SelfT setVector(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace duration with vector in the javadoc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, this one too!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if (value == null || cqlType.getDimensions() <= 0) {
return null;
}
ByteBuffer[] valueBuffs = new ByteBuffer[cqlType.getDimensions()];
Iterator<SubtypeT> values = value.getValues().iterator();
Iterator<SubtypeT> values = value.iterator();
int allValueBuffsSize = 0;
for (int i = 0; i < cqlType.getDimensions(); ++i) {
ByteBuffer valueBuff = this.subtypeCodec.encode(values.next(), protocolVersion);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ListCodec and MapCodec have some handling around null elements and classcastexceptions, do we want to have the same checks here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My original thought was that this wasn't necessary because the codecs were pretty good about not meeting these error cases. Then I went back and looked at the code and realized that recollection was entirely wrong. Added that code in there (as well as check for a few other error conditions) and beefed up the codec test a bit to detect them.

@absurdfarce
Copy link
Contributor Author

Okay, we have buy-in from consumers on the API changes and some level of confidence for the driver internals (hat tip to @hhughes). I'm gonna call this good and merge so that I can start working on the downstream dependencies of this one.

@absurdfarce absurdfarce merged commit 6cf075f into 4.x Jun 27, 2023
@absurdfarce absurdfarce deleted the java3061 branch June 27, 2023 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants