Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JAVA-3061 CqlVector refinements #1641

Closed
wants to merge 5 commits into from

Conversation

jshook
Copy link

@jshook jshook commented Jun 6, 2023

This PR makes the following refinements to the CqlVector type:

  • allows simple ctor construction and data access.
  • moves the values iterator to the base type.
  • removes unnecessary dependencies. Now only java.util is needed.
  • preserves immutability of the data.
  • provides a varargs method of(T... values) distinct from the list-based ctor.
  • preserves the builder interface

This does provide multiple ctor patterns, but I think this is ok because 1) users will find a simple interface of whatever style they are looking for, and 2) they are all idomatic and uncomplicated.

This also simplifies dependencies down to just java.util;
import com.datastax.oss.driver.shaded.guava.common.collect.ImmutableList;
import com.datastax.oss.driver.shaded.guava.common.collect.Iterators;
import java.util.Arrays;
import java.util.*;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit; consider specific types

Copy link

@MikeYaacoubStax MikeYaacoubStax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic in all those builder/constructor paths LGTM, so does the string conversion. This is good to go.

One question though, do we want the CqlVector object to be able to house a list of ANY type of data?

Copy link

@jeffbanks jeffbanks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@@ -56,36 +62,48 @@ public int hashCode() {

@Override
public String toString() {
StringBuilder builder = new StringBuilder("CqlVector{");
for (T value : values) {
builder.append(value).append(", ");
Copy link

@ShaunakDas88 ShaunakDas88 Jun 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Consider String.join() for the values? This will prevent having to strip-off the trailing ,

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this, but the generic parameter broke it, as there is no bound on T which makes it a CharSequence.

@absurdfarce absurdfarce changed the title CqlVector refinements JAVA-3061 CqlVector refinements Jun 6, 2023
Copy link

@jeffbanks jeffbanks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@jshook
Copy link
Author

jshook commented Jun 20, 2023

We've been using this in a branch build for awhile. A related tracking issue: nosqlbench/nosqlbench#1325

Copy link
Contributor

@absurdfarce absurdfarce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I think we're on the same page but this needs some refinement to play nicely with the existing Java driver code. I now you guys need this and this has been delayed on our side so in the interest of expediency I've undertaken an effort to adapt what's here into something that works with the existing driver code; PR for that to follow shortly.


private final ImmutableList<T> values;
private final List<T> values;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mentioned in my original review of what became this PR: I don't see any reason for this change. We're giving up informing the type system that the list in question is immutable for no obvious benefit.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I wasn't looking in chat for review content after our first convo. I thought it would be here. This is a design choice. Here are the reasons for my suggestion:

I've preferred to avoid dependencies on Guava due to the instability, unneccesary complexity, library bloat, and packaging problems (not strictly backwards compatible, etc) that have arisin in other projects that depend on it. In this case, informing the type system doesn't really do anything, since the users doesn't see this in the CqlVector contract, and the effective behavior is the same. (The implemented type is effectively immutable.) Depending on this library also puts deeper roots into the Guava type system which is starting to overlap significantly with more modern Java idioms. Based on having to disentagle Guava from other projects for the reasons above, I've taken a stricter approach for if and when I would rely on it.

But this isn't something that is important enough to block a PR for, so I'll just follow your lead on the actual change.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we already using Guava in the driver today?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll only mention that my comment about having the immutability of that collection reflected in the type system had to do with code internal to CqlVector rather than consumers of that library. Which leads beautifully into my answer to the question posed by @jbellis :)

The Java driver does use Guava fairly extensively. In the 3.x driver there are a few spots that returned Guava types directly from methods but this practice has been discontinued for the 4.x driver. In 4.x we use a shaded version of Guava (an older one actually) and we only use it internally.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, but the desire to disentangle Guava as a form of incremental simplification still stands for me, not to protect users directly, but more to make the code easier to maintain over time. This item is mostly an academic discussion on this point, but I wanted to be transparent on my reasons for suggesting the change.

}

public Iterable<T> getValues() {
/** @return the (immutable) list of values in the vector */
public List<T> getValues() {
return values;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon reflection I'm not sure we need a method to just return the contained values here. Supporting Iterable should cover the overwhelming majority of cases; my inclination is to just go with that until we see we need something else.

Copy link
Author

@jshook jshook Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[EDIT: I improved this comment to take out the unnecessarily strong tone and to focus it on the accessor]

This, I emphatically disagree with. This is the most important aspect of the change we need. Here is a scenario to illustrate:

  • We need to run an intensive performance test that is logically and operationally repeatable.
  • We need to run a variation of the performance test which has the simplest possible change, from un-normalized vectors to normalized vectors. The way to do that is, arguably, to simply allow for a CqlVector to be normalized.
  • We provide a basic function to do that ^, but because we are forced to use an iterator interface with (with no size data, even), we must:
    • create a collection on heap, which will have to be dynamically resized and reallocated each time we exceed its backing capacity, which is incredibly wasteful for larger vectors.
    • iteratively walk the data and store it into the collection
    • run our computation against the separate copy of the data to compute the normalization factor
    • run another scaling computation against those values
    • (and with the forced builder pattern) construct a whole new object, install the values into, and ask it for our new CqlVector The newInstance approach can cover this.
  • Aspects of our testing mean that we're spending significant cycles on just creating and modifying vectors.

If we can access the vector data as a raw value, we could do this instead:

  • create an array of values of the correct size, once as a copy of the original data.
  • scan over it to compute the normalization factor.
  • Update that array in place with the recalculated component lengths.
  • (and with the newInstance pattern elsewhere) just create a new instance from the array.

As an application developer using the vector search capability for the first time, I would want the datatype to be reasonably forgiving, meaning easy to create and access, not a hot spot for typical data sizes (over 1K or higher elements). In short, the closest thing to List or Float[] you can get, the happier I'll be as a user. I would ask you to consider our usage as a meaningful application study on what others will do. We are, after all, running an application.

In order to overcome the testing impact of the iterative accessor, we've had to create our own branches and build from them. We could jump through some hoops to prove out the impacts, but I thought they were evident enough that we wouldn't need to spend hours or days to justify such a direct and simple improvement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, this explanation is helpful... thanks @jshook!

FTR in my working PR I've tried to address this by adding a method to export data from a vector to an input collection. If I understand correctly (always a question) this will allow users to do the "could to this instead" section above while still making no assumptions about the type of collection involved or any of it's properties (other than the fact that it's, you know, a Collection).

We'll see if that flies.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without knowing the details -- maybe?

the core issue is we need to be able to modify it w/o copying it

copies of large collections like vectors really hurt perf

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And copying them incrementally with a dynamically sized backing store, one element at a time, is a good way to test your GC, because this is a lot of small allocations.

As elements are added to an ArrayList, its capacity grows automatically. The details of the growth policy are not specified beyond the fact that adding an element has constant amortized time cost.

And the nitty gritty details are almost always "the existing backing memory is copied into a newly allocated region", so you can see how this would create a heap allocation trampoline.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this entire discussion is now rendered moot by virtue of the most recent round of changes to my working PR for JAVA-3061. I won't repeat the discussion in my most recent comment on the topic but the upshot is that CqlVector no longer exists and the Java driver now just represents CQL vectors as Java Lists.

public static Builder builder() {
return new Builder();
public static <T> CqlVector of(T... values) {
return new CqlVector(Collections.unmodifiableList(Arrays.asList(values)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be newInstance() in order to match up more closely with the other custom types with impls defined in the driver. CqlDuration is an excellent example here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, that works too

builder.setLength(builder.length() - ", ".length());
builder.append("}");
return builder.toString();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the TypeCodec interface provides methods for converting a specific type to a String and back. Upshot is that whatever is used here has to play nice with what's in CqlVectorCodec.

CqlDurationCodec provides a useful example of the interaction here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noted

@absurdfarce
Copy link
Contributor

As mentioned above in the interest of expediency I put together a PR wrapping the work @jshook already put in here into something that integrates with the existing Java driver code. That PR lives here.

@absurdfarce
Copy link
Contributor

Closing this out now that #1656 has been merged. My thanks to everybody involved in this discussion for their efforts in making the Java driver better. Special thanks to @jshook for kicking off the discussion and for all his work to make this change happen!

@jshook
Copy link
Author

jshook commented Jun 27, 2023

Thanks for the shout out, Bret! I really appreciate the amount of due diligence and care you put into this one, since it will be a long-standing usage pattern for myself and many other users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants