-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cassandra 5.0 vector type CREATE/INSERT support #1020
Conversation
See the following report for details: cargo semver-checks output
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I see, looks good, thank you for contribution!
However, what I see in this PR is support for parsing schema containing Vector type. What is completely missing is serialization and deserialization support (the latter, you said, you leave for a follow-up). See #1014 (comment) for hints.
cassandra1: | ||
image: cassandra | ||
image: cassandra:5.0-beta1 | ||
healthcheck: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unsure if we are OK with testing only with newest beta cassandra instead of stable. WDYT @Lorak-mmk?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use this in addition to stable, but not instead of stable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've moved cassandra-5.0-beta1 into a separate compose file.
9e4d642
to
ff6ce03
Compare
scylla/src/transport/topology.rs
Outdated
/// as per <https://cassandra.apache.org/doc/latest/cassandra/reference/vector-data-type.html> | ||
/// vectors are limited to a size of 8192 | ||
size: u16, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I can see that this is the size limit of a Vector, but what interests me is how Cassandra represents Vector in the system tables. It surely does not use unsigned int, because AFAIK unsigned numbers are not supported by neither ScyllaDB nor Cassandra.
The idea of PreCqlType
in topology.rs
is to represent the raw type definition read from system tables. Only later can we convert it to more convenient representation, involving unsigned ints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it makes more sense to store size in the same type as Cassandra does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The java driver uses an int, so I'll do the same: https://github.com/apache/cassandra-java-driver/blob/85bb4065098b887d2dda26eb14423ce4fc687045/core/src/main/java/com/datastax/oss/driver/api/core/type/DataTypes.java#L77
I believe the cassandra code also uses an int internally: https://github.com/apache/cassandra/blob/7f1c0e9e76a0bdfc23ef6a655ec7774bc0f77e18/src/java/org/apache/cassandra/db/marshal/VectorType.java#L49
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I adjusted the code to use an i32 to match the above findings.
I also renamed the size
to dimensions
as that is a more accurate name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unresolving this comment because we talked about that with @wprzytula
I'm not aware of any case were size of vector is stored in some table (as a cql type). IIUC we always parse type names from strings - so the argument about system tables doesn't apply.
Is it possible to create vector with negative length? I'd be very surprised if it was possible.
If it is not we can and should use unsigned integer for the size (and simplify the parsing function thanks to that).
Should it be u16? I'm not sure. If documentation says that size is limited to 8192 then yes, but I can't find this information is the link in the comment. Could you link the source?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The limit of 2^13 = 8192
is taken from https://cassandra.apache.org/doc/latest/cassandra/reference/vector-data-type.html
However it can be observed that the documentation is incorrect.
From functionally testing, the highest value accepted when creating a table is 2**31 -1 = 2147483647
.
The reason would be that its the highest value representable by a signed 32 bit integer
From functional testing it can also be seen that negative numbers and 0
are rejected.
Do you want me to swap the type from an i32 to a u32?
a0906fd
to
4be5e19
Compare
/// * The first character is not a digit or '-' | ||
/// * The the integer is larger than i32 | ||
pub(crate) fn parse_i32(self) -> ParseResult<(i32, Self)> { | ||
let (digits, p) = self.take_while(|c| c.is_ascii_digit() || c == '-'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has a minor flaw of accepting strings such as 00-21-37
, but such strings will be rejected below anyway, so no problem here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution! It looks good, I left a few minor comments (and responded to a past discussion about a type of vector size).
I believe that the PR title and description are a bit imprecise.
The PR does not add CREATE or INSERT support for Vector type. CREATE is and always was possible, so was INSERT with hardcoded values.
The PR enables the driver to fetch and represent metadata containing Vector type, which is required to fully connect to cluster (because if the driver can't fetch metadata then it also won't fetch peer list).
scylla/src/transport/session_test.rs
Outdated
async fn test_vector_type() { | ||
setup_tracing(); | ||
let session = create_new_session_builder().build().await.unwrap(); | ||
let ks = unique_keyspace_name(); | ||
|
||
session.query(format!("CREATE KEYSPACE IF NOT EXISTS {} WITH REPLICATION = {{'class' : 'NetworkTopologyStrategy', 'replication_factor' : 1}}", ks), &[]).await.unwrap(); | ||
session | ||
.query( | ||
format!( | ||
"CREATE TABLE IF NOT EXISTS {}.t (a int PRIMARY KEY, b vector<int, 4>, c vector<text, 2>)", | ||
ks | ||
), | ||
&[], | ||
) | ||
.await | ||
.unwrap(); | ||
|
||
session | ||
.query( | ||
format!( | ||
"INSERT INTO {}.t (a, b, c) VALUES (1, [1, 2, 3, 4], ['foo', 'bar'])", | ||
ks | ||
), | ||
&[], | ||
) | ||
.await | ||
.unwrap(); | ||
|
||
let prepared_statement = session | ||
.prepare(format!( | ||
"INSERT INTO {}.t (a, b, c) VALUES (?, [11, 12, 13, 14], ['afoo', 'abar'])", | ||
ks | ||
)) | ||
.await | ||
.unwrap(); | ||
session.execute(&prepared_statement, &(2,)).await.unwrap(); | ||
let metadata = session.get_cluster_data(); | ||
let columns = &metadata.keyspaces[&ks].tables["t"].columns; | ||
assert_eq!( | ||
columns["b"].type_, | ||
CqlType::Vector { | ||
type_: Box::new(CqlType::Native(NativeType::Int)), | ||
dimensions: 4, | ||
}, | ||
); | ||
assert_eq!( | ||
columns["c"].type_, | ||
CqlType::Vector { | ||
type_: Box::new(CqlType::Native(NativeType::Text)), | ||
dimensions: 2, | ||
}, | ||
); | ||
|
||
// TODO: Implement and test SELECT statements and bind values (`?`) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test should test 1 thing, this tests 3:
- unprepared insert query with vector
- prepared insert query with vector
- cluster metadata containing vector type
It should be split into 3 separate tests.
Also I don't think this is the best place for this test, I don't see tests for other types in this file. @wprzytula where do you think those tests should be placed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have split up the tests as requested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I don't think this is the best place for this test, I don't see tests for other types in this file. @wprzytula where do you think those tests should be placed?
I think cql_types_test.rs
suits this case best.
scylla/src/transport/topology.rs
Outdated
/// as per <https://cassandra.apache.org/doc/latest/cassandra/reference/vector-data-type.html> | ||
/// vectors are limited to a size of 8192 | ||
size: u16, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unresolving this comment because we talked about that with @wprzytula
I'm not aware of any case were size of vector is stored in some table (as a cql type). IIUC we always parse type names from strings - so the argument about system tables doesn't apply.
Is it possible to create vector with negative length? I'd be very surprised if it was possible.
If it is not we can and should use unsigned integer for the size (and simplify the parsing function thanks to that).
Should it be u16? I'm not sure. If documentation says that size is limited to 8192 then yes, but I can't find this information is the link in the comment. Could you link the source?
.github/workflows/cassandra.yml
Outdated
|
||
# TODO: delete the below and move RUSTFLAGS into the above test run when cassandra 5.0.0 releases. | ||
- name: Setup 3-node Cassandra cluster | ||
run: | | ||
docker compose -f test/cluster/cassandra5/docker-compose.yml up -d --wait | ||
- name: Run tests on cassandra 5 beta | ||
run: | | ||
CDC='disabled' RUSTFLAGS="--cfg cassandra_tests" RUST_LOG=trace SCYLLA_URI=172.42.0.2:9042 SCYLLA_URI2=172.42.0.3:9042 SCYLLA_URI3=172.42.0.4:9042 cargo test --verbose --features "full-serialization" -- --skip test_views_in_schema_info --skip test_large_batch_statements | ||
- name: Stop the cluster | ||
if: ${{ always() }} | ||
run: docker compose -f test/cluster/cassandra5/docker-compose.yml stop | ||
- name: Print the cluster logs | ||
if: ${{ always() }} | ||
run: docker compose -f test/cluster/cassandra5/docker-compose.yml logs | ||
- name: Remove cluster | ||
run: docker compose -f test/cluster/cassandra5/docker-compose.yml down |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be done sequentially - first we perform tests on normal cassandra, then on beta.
Cassandra tests are already longest part of our CI. Maybe it would be better to create a separate workflow for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have moved it into a separate workflow as requested.
I'm not quite following what you want. |
Hi, any news on this one? |
Sorry for the delay, both @Lorak-mmk and I have been on vacation. Now let @Lorak-mmk take a fresh look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR looks much better now. I am however reluctant to merge it before Cassandra version with Vector support is officially released. Mostly because:
- implementation could change before release
- cql specification seems incomplete w.r.t vectors. They are mentioned in section 5.2 of protocol v5 docs, so we know their serialized form. I don't however see anything about them in section 4.2.5.2 which describes prepared metadata.
Regarding the above: I totally forgot that prepared metadata is a place where we parse the type not from string but from binary form, sorry about that :(
This is also a reason why I favor a bit larger PRs - this would come up when trying to add support for selecting vectors using a prepared statement.
@wprzytula because spec is incomplete, and type used for size depends on it, I think we should wait until C* releases the feature. wdyt?
Agreed. |
cassandra 5 is released, any interest in landing this? |
oh right, theres some TODO's in here for when cassandra 5 releases. I'll address those |
8ffcf60
to
57b1511
Compare
The TODO's have been addressed: |
oh right, a recent rust release has disallowed that kind of feature check, I'll look into alternatives |
c22e512
to
49e68e7
Compare
I applied the same fix as #1049 and CI is green now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left one minor comment - after you fix it we can merge I think, and fix potential issue when introducing ser/deser
c70a946
to
1b6c244
Compare
makes progress towards: scylladb#1014 The vector type is introduced by the currently in beta cassandra 5. See: https://cassandra.apache.org/doc/latest/cassandra/reference/vector-data-type.html Scylla does not support vector types and so the tests are setup to only compile/run with a new cassandra_tests config. This commit does not add support for retrieving the data via a SELECT. That was omitted to reduce scope and will be implemented in follow up work.
1b6c244
to
2685ab9
Compare
makes progress towards: #1014
The vector type is introduced by the currently in beta cassandra 5.
See: https://cassandra.apache.org/doc/latest/cassandra/reference/vector-data-type.html
Scylla does not support vector types and so the tests are setup to only compile/run with a new cassandra_tests config.
This commit does not add support for retrieving the data via a SELECT.
That was omitted to reduce scope and will be implemented in follow up work.
I have split off support for CREATE/INSERT since I value the early feedback of small PRs.
I didnt realize that @pkolaczk had shown interest in implementing this until after I'd went to submit my PR.
Hopefully I've not stepped on any toes.
The cassandra docker-compose.yaml is changed to 5.0 beta cassandra, if this is undesirable I can introduce a second docker-compose.yaml specifically for 5.0
Confirmed that the test is running in CI
Pre-review checklist
./docs/source/
.Fixes:
annotations to PR description.