JAVA-3061 CqlVector refinements #1641

jshook · 2023-06-06T18:09:55Z

This PR makes the following refinements to the CqlVector type:

allows simple ctor construction and data access.
moves the values iterator to the base type.
removes unnecessary dependencies. Now only java.util is needed.
preserves immutability of the data.
provides a varargs method of(T... values) distinct from the list-based ctor.
preserves the builder interface

This does provide multiple ctor patterns, but I think this is ok because 1) users will find a simple interface of whatever style they are looking for, and 2) they are all idomatic and uncomplicated.

This also simplifies dependencies down to just java.util;

core/src/main/java/com/datastax/oss/driver/api/core/data/CqlVector.java

jeffbanks · 2023-06-06T20:07:55Z

core/src/main/java/com/datastax/oss/driver/api/core/data/CqlVector.java

-import com.datastax.oss.driver.shaded.guava.common.collect.ImmutableList;
-import com.datastax.oss.driver.shaded.guava.common.collect.Iterators;
-import java.util.Arrays;
+import java.util.*;


nit; consider specific types

MikeYaacoubStax

The logic in all those builder/constructor paths LGTM, so does the string conversion. This is good to go.

One question though, do we want the CqlVector object to be able to house a list of ANY type of data?

jeffbanks

lgtm

ShaunakDas88 · 2023-06-06T20:35:59Z

core/src/main/java/com/datastax/oss/driver/api/core/data/CqlVector.java

@@ -56,36 +62,48 @@ public int hashCode() {

  @Override
  public String toString() {
+    StringBuilder builder = new StringBuilder("CqlVector{");
+    for (T value : values) {
+      builder.append(value).append(", ");


nit: Consider String.join() for the values? This will prevent having to strip-off the trailing ,

I tried this, but the generic parameter broke it, as there is no bound on T which makes it a CharSequence.

jeffbanks

lgtm

jshook · 2023-06-20T18:59:29Z

We've been using this in a branch build for awhile. A related tracking issue: nosqlbench/nosqlbench#1325

absurdfarce

In general I think we're on the same page but this needs some refinement to play nicely with the existing Java driver code. I now you guys need this and this has been delayed on our side so in the interest of expediency I've undertaken an effort to adapt what's here into something that works with the existing driver code; PR for that to follow shortly.

absurdfarce · 2023-06-22T14:31:52Z

core/src/main/java/com/datastax/oss/driver/api/core/data/CqlVector.java


-  private final ImmutableList<T> values;
+  private final List<T> values;


Mentioned in my original review of what became this PR: I don't see any reason for this change. We're giving up informing the type system that the list in question is immutable for no obvious benefit.

Sorry, I wasn't looking in chat for review content after our first convo. I thought it would be here. This is a design choice. Here are the reasons for my suggestion:

I've preferred to avoid dependencies on Guava due to the instability, unneccesary complexity, library bloat, and packaging problems (not strictly backwards compatible, etc) that have arisin in other projects that depend on it. In this case, informing the type system doesn't really do anything, since the users doesn't see this in the CqlVector contract, and the effective behavior is the same. (The implemented type is effectively immutable.) Depending on this library also puts deeper roots into the Guava type system which is starting to overlap significantly with more modern Java idioms. Based on having to disentagle Guava from other projects for the reasons above, I've taken a stricter approach for if and when I would rely on it.

But this isn't something that is important enough to block a PR for, so I'll just follow your lead on the actual change.

Are we already using Guava in the driver today?

I'll only mention that my comment about having the immutability of that collection reflected in the type system had to do with code internal to CqlVector rather than consumers of that library. Which leads beautifully into my answer to the question posed by @jbellis :)

The Java driver does use Guava fairly extensively. In the 3.x driver there are a few spots that returned Guava types directly from methods but this practice has been discontinued for the 4.x driver. In 4.x we use a shaded version of Guava (an older one actually) and we only use it internally.

Understood, but the desire to disentangle Guava as a form of incremental simplification still stands for me, not to protect users directly, but more to make the code easier to maintain over time. This item is mostly an academic discussion on this point, but I wanted to be transparent on my reasons for suggesting the change.

absurdfarce · 2023-06-22T14:34:03Z

core/src/main/java/com/datastax/oss/driver/api/core/data/CqlVector.java

  }

-  public Iterable<T> getValues() {
+  /** @return the (immutable) list of values in the vector */
+  public List<T> getValues() {
    return values;
  }


Upon reflection I'm not sure we need a method to just return the contained values here. Supporting Iterable should cover the overwhelming majority of cases; my inclination is to just go with that until we see we need something else.

[EDIT: I improved this comment to take out the unnecessarily strong tone and to focus it on the accessor]

This, I emphatically disagree with. This is the most important aspect of the change we need. Here is a scenario to illustrate:

We need to run an intensive performance test that is logically and operationally repeatable.

We need to run a variation of the performance test which has the simplest possible change, from un-normalized vectors to normalized vectors. The way to do that is, arguably, to simply allow for a CqlVector to be normalized.

We provide a basic function to do that ^, but because we are forced to use an iterator interface with (with no size data, even), we must:

create a collection on heap, which will have to be dynamically resized and reallocated each time we exceed its backing capacity, which is incredibly wasteful for larger vectors.

iteratively walk the data and store it into the collection

run our computation against the separate copy of the data to compute the normalization factor

run another scaling computation against those values

~~(and with the forced builder pattern) construct a whole new object, install the values into, and ask it for our new CqlVector~~ The newInstance approach can cover this.

Aspects of our testing mean that we're spending significant cycles on just creating and modifying vectors.

If we can access the vector data as a raw value, we could do this instead:

create an array of values of the correct size, once as a copy of the original data.

scan over it to compute the normalization factor.

Update that array in place with the recalculated component lengths.

(and with the newInstance pattern elsewhere) just create a new instance from the array.

As an application developer using the vector search capability for the first time, I would want the datatype to be reasonably forgiving, meaning easy to create and access, not a hot spot for typical data sizes (over 1K or higher elements). In short, the closest thing to List or Float[] you can get, the happier I'll be as a user. I would ask you to consider our usage as a meaningful application study on what others will do. We are, after all, running an application.

In order to overcome the testing impact of the iterative accessor, we've had to create our own branches and build from them. We could jump through some hoops to prove out the impacts, but I thought they were evident enough that we wouldn't need to spend hours or days to justify such a direct and simple improvement.

Understood, this explanation is helpful... thanks @jshook!

FTR in my working PR I've tried to address this by adding a method to export data from a vector to an input collection. If I understand correctly (always a question) this will allow users to do the "could to this instead" section above while still making no assumptions about the type of collection involved or any of it's properties (other than the fact that it's, you know, a Collection).

We'll see if that flies.

without knowing the details -- maybe?

the core issue is we need to be able to modify it w/o copying it

copies of large collections like vectors really hurt perf

And copying them incrementally with a dynamically sized backing store, one element at a time, is a good way to test your GC, because this is a lot of small allocations.

As elements are added to an ArrayList, its capacity grows automatically. The details of the growth policy are not specified beyond the fact that adding an element has constant amortized time cost.

And the nitty gritty details are almost always "the existing backing memory is copied into a newly allocated region", so you can see how this would create a heap allocation trampoline.

I believe this entire discussion is now rendered moot by virtue of the most recent round of changes to my working PR for JAVA-3061. I won't repeat the discussion in my most recent comment on the topic but the upshot is that CqlVector no longer exists and the Java driver now just represents CQL vectors as Java Lists.

absurdfarce · 2023-06-22T14:43:52Z

core/src/main/java/com/datastax/oss/driver/api/core/data/CqlVector.java

-  public static Builder builder() {
-    return new Builder();
+  public static <T> CqlVector of(T... values) {
+    return new CqlVector(Collections.unmodifiableList(Arrays.asList(values)));


This should be newInstance() in order to match up more closely with the other custom types with impls defined in the driver. CqlDuration is an excellent example here.

sure, that works too

absurdfarce · 2023-06-22T14:49:19Z

core/src/main/java/com/datastax/oss/driver/api/core/data/CqlVector.java

+    builder.setLength(builder.length() - ", ".length());
+    builder.append("}");
+    return builder.toString();
+  }


Note that the TypeCodec interface provides methods for converting a specific type to a String and back. Upshot is that whatever is used here has to play nice with what's in CqlVectorCodec.

CqlDurationCodec provides a useful example of the interaction here.

absurdfarce · 2023-06-22T14:54:22Z

As mentioned above in the interest of expediency I put together a PR wrapping the work @jshook already put in here into something that integrates with the existing Java driver code. That PR lives here.

absurdfarce · 2023-06-27T15:54:43Z

Closing this out now that #1656 has been merged. My thanks to everybody involved in this discussion for their efforts in making the Java driver better. Special thanks to @jshook for kicking off the discussion and for all his work to make this change happen!

jshook · 2023-06-27T20:22:37Z

Thanks for the shout out, Bret! I really appreciate the amount of due diligence and care you put into this one, since it will be a long-standing usage pattern for myself and many other users.

refine CqlVector for direct construction, builder and varargs options.

534c209

This also simplifies dependencies down to just java.util;

jshook commented Jun 6, 2023

View reviewed changes

core/src/main/java/com/datastax/oss/driver/api/core/data/CqlVector.java Show resolved Hide resolved

jshook added 2 commits June 6, 2023 13:17

fix toString and add test coverage over builder

1f0f4f7

reformat to add two missing empty lines to unbreak the build

c62b0f8

jeffbanks reviewed Jun 6, 2023

View reviewed changes

MikeYaacoubStax approved these changes Jun 6, 2023

View reviewed changes

jeffbanks approved these changes Jun 6, 2023

View reviewed changes

ShaunakDas88 reviewed Jun 6, 2023

View reviewed changes

re-reformatted to appease the fmt-maven-plugin

989299f

absurdfarce changed the title ~~CqlVector refinements~~ JAVA-3061 CqlVector refinements Jun 6, 2023

add revapi exclusion for non-affecting change

b8461e2

msmygit mentioned this pull request Jun 10, 2023

Add support for vector type datastax/dsbulk#475

Merged

msmygit approved these changes Jun 13, 2023

View reviewed changes

jeffbanks approved these changes Jun 13, 2023

View reviewed changes

absurdfarce requested changes Jun 22, 2023

View reviewed changes

absurdfarce mentioned this pull request Jun 22, 2023

JAVA-3061 Remove CqlVector, represent CQL vector types as Lists #1656

Merged

absurdfarce closed this Jun 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JAVA-3061 CqlVector refinements #1641

JAVA-3061 CqlVector refinements #1641

jshook commented Jun 6, 2023

jeffbanks Jun 6, 2023

MikeYaacoubStax left a comment

jeffbanks left a comment

ShaunakDas88 Jun 6, 2023 •

edited

Loading

jshook Jun 6, 2023

jeffbanks left a comment

jshook commented Jun 20, 2023

absurdfarce left a comment

absurdfarce Jun 22, 2023

jshook Jun 22, 2023

jbellis Jun 22, 2023

absurdfarce Jun 22, 2023

jshook Jun 22, 2023

absurdfarce Jun 22, 2023

jshook Jun 22, 2023 •

edited

Loading

absurdfarce Jun 22, 2023

jbellis Jun 22, 2023

jshook Jun 22, 2023

absurdfarce Jun 23, 2023

absurdfarce Jun 22, 2023

jshook Jun 22, 2023

absurdfarce Jun 22, 2023

jshook Jun 22, 2023

absurdfarce commented Jun 22, 2023

absurdfarce commented Jun 27, 2023

jshook commented Jun 27, 2023


		private final ImmutableList<T> values;
		private final List<T> values;

JAVA-3061 CqlVector refinements #1641

JAVA-3061 CqlVector refinements #1641

Conversation

jshook commented Jun 6, 2023

Choose a reason for hiding this comment

MikeYaacoubStax left a comment

Choose a reason for hiding this comment

jeffbanks left a comment

Choose a reason for hiding this comment

ShaunakDas88 Jun 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeffbanks left a comment

Choose a reason for hiding this comment

jshook commented Jun 20, 2023

absurdfarce left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jshook Jun 22, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

absurdfarce commented Jun 22, 2023

absurdfarce commented Jun 27, 2023

jshook commented Jun 27, 2023

ShaunakDas88 Jun 6, 2023 •

edited

Loading

jshook Jun 22, 2023 •

edited

Loading