-
Notifications
You must be signed in to change notification settings - Fork 873
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JAVA-3061 CqlVector refinements #1641
Conversation
This also simplifies dependencies down to just java.util;
import com.datastax.oss.driver.shaded.guava.common.collect.ImmutableList; | ||
import com.datastax.oss.driver.shaded.guava.common.collect.Iterators; | ||
import java.util.Arrays; | ||
import java.util.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit; consider specific types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic in all those builder/constructor paths LGTM, so does the string conversion. This is good to go.
One question though, do we want the CqlVector object to be able to house a list of ANY type of data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@@ -56,36 +62,48 @@ public int hashCode() { | |||
|
|||
@Override | |||
public String toString() { | |||
StringBuilder builder = new StringBuilder("CqlVector{"); | |||
for (T value : values) { | |||
builder.append(value).append(", "); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Consider String.join()
for the values
? This will prevent having to strip-off the trailing ,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this, but the generic parameter broke it, as there is no bound on T which makes it a CharSequence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
We've been using this in a branch build for awhile. A related tracking issue: nosqlbench/nosqlbench#1325 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I think we're on the same page but this needs some refinement to play nicely with the existing Java driver code. I now you guys need this and this has been delayed on our side so in the interest of expediency I've undertaken an effort to adapt what's here into something that works with the existing driver code; PR for that to follow shortly.
|
||
private final ImmutableList<T> values; | ||
private final List<T> values; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mentioned in my original review of what became this PR: I don't see any reason for this change. We're giving up informing the type system that the list in question is immutable for no obvious benefit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I wasn't looking in chat for review content after our first convo. I thought it would be here. This is a design choice. Here are the reasons for my suggestion:
I've preferred to avoid dependencies on Guava due to the instability, unneccesary complexity, library bloat, and packaging problems (not strictly backwards compatible, etc) that have arisin in other projects that depend on it. In this case, informing the type system doesn't really do anything, since the users doesn't see this in the CqlVector contract, and the effective behavior is the same. (The implemented type is effectively immutable.) Depending on this library also puts deeper roots into the Guava type system which is starting to overlap significantly with more modern Java idioms. Based on having to disentagle Guava from other projects for the reasons above, I've taken a stricter approach for if and when I would rely on it.
But this isn't something that is important enough to block a PR for, so I'll just follow your lead on the actual change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we already using Guava in the driver today?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll only mention that my comment about having the immutability of that collection reflected in the type system had to do with code internal to CqlVector rather than consumers of that library. Which leads beautifully into my answer to the question posed by @jbellis :)
The Java driver does use Guava fairly extensively. In the 3.x driver there are a few spots that returned Guava types directly from methods but this practice has been discontinued for the 4.x driver. In 4.x we use a shaded version of Guava (an older one actually) and we only use it internally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood, but the desire to disentangle Guava as a form of incremental simplification still stands for me, not to protect users directly, but more to make the code easier to maintain over time. This item is mostly an academic discussion on this point, but I wanted to be transparent on my reasons for suggesting the change.
} | ||
|
||
public Iterable<T> getValues() { | ||
/** @return the (immutable) list of values in the vector */ | ||
public List<T> getValues() { | ||
return values; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upon reflection I'm not sure we need a method to just return the contained values here. Supporting Iterable should cover the overwhelming majority of cases; my inclination is to just go with that until we see we need something else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[EDIT: I improved this comment to take out the unnecessarily strong tone and to focus it on the accessor]
This, I emphatically disagree with. This is the most important aspect of the change we need. Here is a scenario to illustrate:
- We need to run an intensive performance test that is logically and operationally repeatable.
- We need to run a variation of the performance test which has the simplest possible change, from un-normalized vectors to normalized vectors. The way to do that is, arguably, to simply allow for a CqlVector to be normalized.
- We provide a basic function to do that ^, but because we are forced to use an iterator interface with (with no size data, even), we must:
- create a collection on heap, which will have to be dynamically resized and reallocated each time we exceed its backing capacity, which is incredibly wasteful for larger vectors.
- iteratively walk the data and store it into the collection
- run our computation against the separate copy of the data to compute the normalization factor
- run another scaling computation against those values
(and with the forced builder pattern) construct a whole new object, install the values into, and ask it for our new CqlVectorThe newInstance approach can cover this.
- Aspects of our testing mean that we're spending significant cycles on just creating and modifying vectors.
If we can access the vector data as a raw value, we could do this instead:
- create an array of values of the correct size, once as a copy of the original data.
- scan over it to compute the normalization factor.
- Update that array in place with the recalculated component lengths.
- (and with the newInstance pattern elsewhere) just create a new instance from the array.
As an application developer using the vector search capability for the first time, I would want the datatype to be reasonably forgiving, meaning easy to create and access, not a hot spot for typical data sizes (over 1K or higher elements). In short, the closest thing to List or Float[] you can get, the happier I'll be as a user. I would ask you to consider our usage as a meaningful application study on what others will do. We are, after all, running an application.
In order to overcome the testing impact of the iterative accessor, we've had to create our own branches and build from them. We could jump through some hoops to prove out the impacts, but I thought they were evident enough that we wouldn't need to spend hours or days to justify such a direct and simple improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood, this explanation is helpful... thanks @jshook!
FTR in my working PR I've tried to address this by adding a method to export data from a vector to an input collection. If I understand correctly (always a question) this will allow users to do the "could to this instead" section above while still making no assumptions about the type of collection involved or any of it's properties (other than the fact that it's, you know, a Collection).
We'll see if that flies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
without knowing the details -- maybe?
the core issue is we need to be able to modify it w/o copying it
copies of large collections like vectors really hurt perf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And copying them incrementally with a dynamically sized backing store, one element at a time, is a good way to test your GC, because this is a lot of small allocations.
As elements are added to an ArrayList, its capacity grows automatically. The details of the growth policy are not specified beyond the fact that adding an element has constant amortized time cost.
And the nitty gritty details are almost always "the existing backing memory is copied into a newly allocated region", so you can see how this would create a heap allocation trampoline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this entire discussion is now rendered moot by virtue of the most recent round of changes to my working PR for JAVA-3061. I won't repeat the discussion in my most recent comment on the topic but the upshot is that CqlVector no longer exists and the Java driver now just represents CQL vectors as Java Lists.
public static Builder builder() { | ||
return new Builder(); | ||
public static <T> CqlVector of(T... values) { | ||
return new CqlVector(Collections.unmodifiableList(Arrays.asList(values))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be newInstance() in order to match up more closely with the other custom types with impls defined in the driver. CqlDuration is an excellent example here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, that works too
builder.setLength(builder.length() - ", ".length()); | ||
builder.append("}"); | ||
return builder.toString(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the TypeCodec interface provides methods for converting a specific type to a String and back. Upshot is that whatever is used here has to play nice with what's in CqlVectorCodec.
CqlDurationCodec provides a useful example of the interaction here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
noted
Thanks for the shout out, Bret! I really appreciate the amount of due diligence and care you put into this one, since it will be a long-standing usage pattern for myself and many other users. |
This PR makes the following refinements to the CqlVector type:
of(T... values)
distinct from the list-based ctor.This does provide multiple ctor patterns, but I think this is ok because 1) users will find a simple interface of whatever style they are looking for, and 2) they are all idomatic and uncomplicated.