Add support for vector type #475

absurdfarce · 2023-06-08T22:24:28Z

Given the following table:

cqlsh> describe test.foo;

CREATE TABLE test.foo (
    i int PRIMARY KEY,
    j vector<float, 3>
);

CREATE CUSTOM INDEX ann_index ON test.foo (j) USING 'StorageAttachedIndex';
cqlsh> select * from test.foo;

 i | j
---+---

(0 rows)

the changes in this PR enable the following:

$ bin/dsbulk unload -k test -t foo 2> /dev/null
...
total | failed | rows/s | p50ms | p99ms | p999ms
    0 |      0 |      0 |  0.00 |  0.00 |   0.00
...
$ cat ../vector_test_data.csv 
i,j
1,"[8, 2.3, 58]"
2,"[1.2, 3.4, 5.6]"
5,"[23, 18, 3.9]"

$ bin/dsbulk load -url "./../vector_test_data.csv" -k test -t foo
...
total | failed | rows/s | p50ms | p99ms | p999ms | batches
    3 |      0 |     22 |  5.10 |  6.91 |   6.91 |    1.00
...
$ bin/dsbulk unload -k test -t foo
...
i,j
5,"[23.0, 18.0, 3.9]"
2,"[1.2, 3.4, 5.6]"
1,"[8.0, 2.3, 58.0]"
total | failed | rows/s | p50ms | p99ms | p999ms
    3 |      0 |     16 |  2.25 |  2.97 |   2.97
...

It also adds support to the new syntax to the built-in minimal CQL parser which allows for this kind of operation:

$ bin/dsbulk unload -query "select j from test.foo where j ann of [3.4, 7.8, 9.1] limit 1"
...
j
"[1.2, 3.4, 5.6]"
total | failed | rows/s | p50ms | p99ms | p999ms
    1 |      0 |      7 |  8.21 |  8.22 |   8.22
...

Data on the server side matches up to what we'd expect:

cqlsh> select * from test.foo;

 i | j
---+-----------------
 5 |   [23, 18, 3.9]
 1 |    [8, 2.3, 58]
 2 | [1.2, 3.4, 5.6]

(3 rows)
cqlsh> select j from test.foo where j ann of [3.4, 7.8, 9.1] limit 1;

 j
-----------------
 [1.2, 3.4, 5.6]

(1 rows)

msmygit · 2023-06-10T15:51:42Z

@absurdfarce is this covering for JSON based converter too? Courtesy: @eolivelli

msmygit · 2023-06-10T15:53:34Z

pom.xml

@@ -59,7 +59,7 @@
    make sure the resulting binary tarball contains only
    required jars, and that no jar has an offending license.
    -->
-    <driver.version>4.14.1</driver.version>
+    <driver.version>4.16.0</driver.version>


[Optional] It may be better to incorporate this Java Driver enhancements

it is definitely required to use 4.16.0 and I would need to have 4.16.1 with this patch
apache/cassandra-java-driver#1643

Re: question from @msmygit ... we can move to a Java driver that includes the changes in the PR you mention but they certainly aren't essential for what we're doing here. It would be a bit nicer to not have to create lots of builder objects for every vector we parse from input but that's a relatively small cost to pay; the JVM is very good at creating (and reaping) lots of small objects. :)

Re: question from @eolivelli .... I agree that 4.16.0 is absolutely required for this change. The constraint on anything beyond 4.16.0 refers to another PR along similar lines. This PR uses 4.16.0 as it stands.

absurdfarce · 2023-06-10T16:23:57Z

Good question @msmygit. If I were to guess I would say it does not support any JSON-specific functionality. I'm not sure if those represent different code paths or not; I'll dig in some more and check on it.

Thanks for the suggestion! This is exactly the kind of feedback I was hoping for from this PR!

eolivelli · 2023-06-12T07:09:33Z

is the Cassandra Pulsar Sink we need JSON capabilities.
I am running some tests, and I will provide feedback.

eolivelli · 2023-06-12T10:55:29Z

I added support for the JSON stuff here
#476

eolivelli · 2023-06-12T10:55:57Z

@absurdfarce Travis CI is dead,
if you want I can help setting up GH Actions

@eolivelli

…y work by @eolivelli (#476)

absurdfarce · 2023-06-14T06:05:09Z

Implementation of JSON support added here. This is largely based on work done by @eolivelli here, although I wound up making a number of additional changes as well.

msmygit · 2023-06-14T12:42:11Z

@absurdfarce Travis CI is dead, if you want I can help setting up GH Actions

we should definitely do this.

absurdfarce · 2023-06-14T14:43:13Z

TravisCI is still a going concern for a number of projects. I'm not sure why it isn't working for dsbulk but it is very definitely still used by DataStax projects. We also have the Jenkins build, which is really canonical for most cases. We can have a discussion at some point in the future about replacing TravisCI with GH Actions but we have the Jenkins build to handle the immediate need... so I'd definitely say this is future work.

absurdfarce · 2023-06-15T04:51:33Z

The recent merge of 1.x into this branch included a fix to get Jenkins working again. The result of that build is indeed very green, although it looks like by default Jenkins largely runs unit tests and a few integration tests. I'll try a re-run that includes all integration tests. It appears that this build will take several hours to complete.

absurdfarce · 2023-06-15T21:03:24Z

I'm seeing repeated failures when running the integration tests, not with the tests themselves but with odd download errors when grabbing dependencies. At least this run made it through all the actual test runs, and the absence of a failure of any kind there leaves me pretty confident that we're good on the integration tests front.

weideng1

After the latest commit 5f23551, all jenkins tests are passing, so I'm going to approve this PR and merge it.

Will work on generating a new release shortly after.

absurdfarce added 4 commits June 8, 2023 16:16

Java driver version upgrade (to get vector support)

98877c1

Initial impl of vector support for loading

ffb338e

Formatting changes

6e052c4

Add vector support to minimal internal CQL parser

908212f

absurdfarce linked an issue Jun 8, 2023 that may be closed by this pull request

dsbulk compat with vector type #474

Closed

absurdfarce mentioned this pull request Jun 8, 2023

Upgrade DataStax Java Driver to 4.15.0 #472

Closed

msmygit reviewed Jun 10, 2023

View reviewed changes

eolivelli mentioned this pull request Jun 12, 2023

Add JSON Codec for Vector type #476

Closed

Adding support for JSON codecs. This work is significantly inspired b…

7874ae7

…y work by @eolivelli (#476)

absurdfarce mentioned this pull request Jun 14, 2023

Automatically register the codecs for CqlVectorType in CachingCodecRegistry apache/cassandra-java-driver#1643

Closed

msmygit approved these changes Jun 14, 2023

View reviewed changes

Merge branch '1.x' into vector_support

0aa9fae

eolivelli mentioned this pull request Jun 16, 2023

Support CQL Vector type (upgrade Core Driver to 4.16.0, DSBulk to 1.10.0 and fork DSBulk Text Codec datastax/messaging-connectors-commons#16

Merged

weideng1 added 2 commits June 21, 2023 06:45

workaround for maven connection reset problem

24462e0

use newer maven version to workaround WAGON-545

5f23551

weideng1 self-requested a review June 22, 2023 02:55

weideng1 approved these changes Jun 22, 2023

View reviewed changes

weideng1 merged commit 903ab58 into 1.x Jun 22, 2023

weideng1 deleted the vector_support branch June 22, 2023 03:07

absurdfarce linked an issue Jul 13, 2023 that may be closed by this pull request

Add support for loading/unloading vector type data #481

Closed

absurdfarce mentioned this pull request Jul 13, 2023

Update vector support to work done in JAVA-3061, add JSON codecs #480

Merged

absurdfarce added this to the 1.11.0 milestone Jul 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for vector type #475

Add support for vector type #475

absurdfarce commented Jun 8, 2023 •

edited

Loading

msmygit commented Jun 10, 2023 •

edited

Loading

msmygit Jun 10, 2023

eolivelli Jun 12, 2023

absurdfarce Jun 14, 2023 •

edited

Loading

absurdfarce commented Jun 10, 2023

eolivelli commented Jun 12, 2023

eolivelli commented Jun 12, 2023

eolivelli commented Jun 12, 2023

absurdfarce commented Jun 14, 2023

msmygit commented Jun 14, 2023

absurdfarce commented Jun 14, 2023

absurdfarce commented Jun 15, 2023

absurdfarce commented Jun 15, 2023

weideng1 left a comment

Add support for vector type #475

Add support for vector type #475

Conversation

absurdfarce commented Jun 8, 2023 • edited Loading

msmygit commented Jun 10, 2023 • edited Loading

msmygit Jun 10, 2023

Choose a reason for hiding this comment

eolivelli Jun 12, 2023

Choose a reason for hiding this comment

absurdfarce Jun 14, 2023 • edited Loading

Choose a reason for hiding this comment

absurdfarce commented Jun 10, 2023

eolivelli commented Jun 12, 2023

eolivelli commented Jun 12, 2023

eolivelli commented Jun 12, 2023

absurdfarce commented Jun 14, 2023

msmygit commented Jun 14, 2023

absurdfarce commented Jun 14, 2023

absurdfarce commented Jun 15, 2023

absurdfarce commented Jun 15, 2023

weideng1 left a comment

Choose a reason for hiding this comment

absurdfarce commented Jun 8, 2023 •

edited

Loading

msmygit commented Jun 10, 2023 •

edited

Loading

absurdfarce Jun 14, 2023 •

edited

Loading