Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for vector type #475

Merged
merged 8 commits into from
Jun 22, 2023
Merged

Add support for vector type #475

merged 8 commits into from
Jun 22, 2023

Conversation

absurdfarce
Copy link
Collaborator

@absurdfarce absurdfarce commented Jun 8, 2023

Given the following table:

cqlsh> describe test.foo;

CREATE TABLE test.foo (
    i int PRIMARY KEY,
    j vector<float, 3>
);

CREATE CUSTOM INDEX ann_index ON test.foo (j) USING 'StorageAttachedIndex';
cqlsh> select * from test.foo;

 i | j
---+---

(0 rows)

the changes in this PR enable the following:

$ bin/dsbulk unload -k test -t foo 2> /dev/null
...
total | failed | rows/s | p50ms | p99ms | p999ms
    0 |      0 |      0 |  0.00 |  0.00 |   0.00
...
$ cat ../vector_test_data.csv 
i,j
1,"[8, 2.3, 58]"
2,"[1.2, 3.4, 5.6]"
5,"[23, 18, 3.9]"

$ bin/dsbulk load -url "./../vector_test_data.csv" -k test -t foo
...
total | failed | rows/s | p50ms | p99ms | p999ms | batches
    3 |      0 |     22 |  5.10 |  6.91 |   6.91 |    1.00
...
$ bin/dsbulk unload -k test -t foo
...
i,j
5,"[23.0, 18.0, 3.9]"
2,"[1.2, 3.4, 5.6]"
1,"[8.0, 2.3, 58.0]"
total | failed | rows/s | p50ms | p99ms | p999ms
    3 |      0 |     16 |  2.25 |  2.97 |   2.97
...

It also adds support to the new syntax to the built-in minimal CQL parser which allows for this kind of operation:

$ bin/dsbulk unload -query "select j from test.foo where j ann of [3.4, 7.8, 9.1] limit 1"
...
j
"[1.2, 3.4, 5.6]"
total | failed | rows/s | p50ms | p99ms | p999ms
    1 |      0 |      7 |  8.21 |  8.22 |   8.22
...

Data on the server side matches up to what we'd expect:

cqlsh> select * from test.foo;

 i | j
---+-----------------
 5 |   [23, 18, 3.9]
 1 |    [8, 2.3, 58]
 2 | [1.2, 3.4, 5.6]

(3 rows)
cqlsh> select j from test.foo where j ann of [3.4, 7.8, 9.1] limit 1;

 j
-----------------
 [1.2, 3.4, 5.6]

(1 rows)

@absurdfarce absurdfarce linked an issue Jun 8, 2023 that may be closed by this pull request
@msmygit
Copy link
Collaborator

msmygit commented Jun 10, 2023

@absurdfarce is this covering for JSON based converter too? Courtesy: @eolivelli

@@ -59,7 +59,7 @@
make sure the resulting binary tarball contains only
required jars, and that no jar has an offending license.
-->
<driver.version>4.14.1</driver.version>
<driver.version>4.16.0</driver.version>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Optional] It may be better to incorporate this Java Driver enhancements

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is definitely required to use 4.16.0 and I would need to have 4.16.1 with this patch
apache/cassandra-java-driver#1643

Copy link
Collaborator Author

@absurdfarce absurdfarce Jun 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: question from @msmygit ... we can move to a Java driver that includes the changes in the PR you mention but they certainly aren't essential for what we're doing here. It would be a bit nicer to not have to create lots of builder objects for every vector we parse from input but that's a relatively small cost to pay; the JVM is very good at creating (and reaping) lots of small objects. :)

Re: question from @eolivelli .... I agree that 4.16.0 is absolutely required for this change. The constraint on anything beyond 4.16.0 refers to another PR along similar lines. This PR uses 4.16.0 as it stands.

@absurdfarce
Copy link
Collaborator Author

Good question @msmygit. If I were to guess I would say it does not support any JSON-specific functionality. I'm not sure if those represent different code paths or not; I'll dig in some more and check on it.

Thanks for the suggestion! This is exactly the kind of feedback I was hoping for from this PR!

@eolivelli
Copy link

is the Cassandra Pulsar Sink we need JSON capabilities.
I am running some tests, and I will provide feedback.

@eolivelli
Copy link

I added support for the JSON stuff here
#476

@eolivelli
Copy link

@absurdfarce Travis CI is dead,
if you want I can help setting up GH Actions

@absurdfarce
Copy link
Collaborator Author

Implementation of JSON support added here. This is largely based on work done by @eolivelli here, although I wound up making a number of additional changes as well.

@msmygit
Copy link
Collaborator

msmygit commented Jun 14, 2023

@absurdfarce Travis CI is dead, if you want I can help setting up GH Actions

we should definitely do this.

@absurdfarce
Copy link
Collaborator Author

TravisCI is still a going concern for a number of projects. I'm not sure why it isn't working for dsbulk but it is very definitely still used by DataStax projects. We also have the Jenkins build, which is really canonical for most cases. We can have a discussion at some point in the future about replacing TravisCI with GH Actions but we have the Jenkins build to handle the immediate need... so I'd definitely say this is future work.

@absurdfarce
Copy link
Collaborator Author

The recent merge of 1.x into this branch included a fix to get Jenkins working again. The result of that build is indeed very green, although it looks like by default Jenkins largely runs unit tests and a few integration tests. I'll try a re-run that includes all integration tests. It appears that this build will take several hours to complete.

@absurdfarce
Copy link
Collaborator Author

I'm seeing repeated failures when running the integration tests, not with the tests themselves but with odd download errors when grabbing dependencies. At least this run made it through all the actual test runs, and the absence of a failure of any kind there leaves me pretty confident that we're good on the integration tests front.

@weideng1 weideng1 self-requested a review June 22, 2023 02:55
Copy link
Collaborator

@weideng1 weideng1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the latest commit 5f23551, all jenkins tests are passing, so I'm going to approve this PR and merge it.

Will work on generating a new release shortly after.

@weideng1 weideng1 merged commit 903ab58 into 1.x Jun 22, 2023
@weideng1 weideng1 deleted the vector_support branch June 22, 2023 03:07
@absurdfarce absurdfarce linked an issue Jul 13, 2023 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for loading/unloading vector type data dsbulk compat with vector type
4 participants