Parsing vector data from JSON fails for "floats" with too many digits (aka doubles) #484

hemidactylus · 2023-09-28T20:21:24Z

When ingesting VECTOR<FLOAT,n> data from a JSON, dsbulk (v 1.11) fails for "floats" which are represented with too many digits. They end up being double, which seems to cause unrecoverable problems then.

Notes:

JSON produced by dsbulk itself are OK, i.e. their floats are floats proper (low number of digits).
But, with folks coming to load datasets generated elsewhere (viz Python, which lacks a clear float/double distinction) this limitation might get in the way.

Minimal reproducible case

create table mini_table (id text primary key, embedding vector<float, 2>);

java -jar dsbulk-1.11.0.jar load -k $KEYSPACE -t mini_table -u "token" -p $TOKEN -b $BUNDLEZIP --dsbulk.connector.json.mode SINGLE_DOCUMENT --connector.json.url GOOD_OR_BAD.json -c json

$> cat good.json 
[
 {
  "id": "my_row",
  "embedding": [
   6.64632,
   4.49715
  ]
 }
]

$> cat bad.json 
[
 {
  "id": "my_row",
  "embedding": [
   6.646329843,
   4.4971533213
  ]
 }
]

The text was updated successfully, but these errors were encountered:

absurdfarce · 2024-06-26T21:15:17Z

I couldn't reproduce this, at least not with JSON inputs.

$ cat ../vector_test_data_json_tooprecise/one.json 
{
    "i":1,
    "j":[6.646329843, 4.4971533213, 58]
}

$ bin/dsbulk load -url "./../vector_test_data_json_tooprecise" -k test -t bar -c json
Operation directory: /work/git/dsbulk/dist_test/dsbulk-1.11.0/logs/LOAD_20240626-210637-895657
At least 1 record does not match the provided schema.mapping or schema.query. Please check that the connector configuration and the schema configuration are correct.
total | failed | rows/s | p50ms | p99ms | p999ms | batches
    3 |      1 |     16 |  4.62 |  5.93 |   5.93 |    1.00
Operation LOAD_20240626-210637-895657 completed with 1 errors in less than one second.

$ cat logs/LOAD_20240626-210637-895657/mapping-errors.log 
Resource: file:/work/git/dsbulk/dist_test/vector_test_data_json_tooprecise/one.json
Position: 1
Source: {"i":1,"j":[6.646329843,4.4971533213,58]}
com.datastax.oss.dsbulk.workflow.commons.schema.InvalidMappingException: Could not map field j to variable j; conversion from Java type com.fasterxml.jackson.databind.JsonNode to CQL type Vector(FLOAT, 3) failed for raw value: [6.646329843,4.4971533213,58].
        at com.datastax.oss.dsbulk.workflow.commons.schema.InvalidMappingException.encodeFailed(InvalidMappingException.java:90)
        at com.datastax.oss.dsbulk.workflow.commons.schema.DefaultRecordMapper.bindColumn(DefaultRecordMapper.java:182)
        at com.datastax.oss.dsbulk.workflow.commons.schema.DefaultRecordMapper.bindStatement(DefaultRecordMapper.java:158)
        at com.datastax.oss.dsbulk.workflow.commons.schema.DefaultRecordMapper.map(DefaultRecordMapper.java:127)
        at java.lang.Thread.run(Thread.java:750) [19 skipped]
Caused by: java.lang.ArithmeticException: Cannot convert 6.646329843 from BigDecimal to Float
        at com.datastax.oss.dsbulk.codecs.api.util.CodecUtils.conversionFailed(CodecUtils.java:610)
        at com.datastax.oss.dsbulk.codecs.api.util.CodecUtils.toFloatValueExact(CodecUtils.java:537)
        at com.datastax.oss.dsbulk.codecs.api.util.CodecUtils.convertNumber(CodecUtils.java:333)
        at com.datastax.oss.dsbulk.codecs.api.util.CodecUtils.narrowNumber(CodecUtils.java:191)
        at com.datastax.oss.dsbulk.codecs.text.json.JsonNodeToNumberCodec.narrowNumber(JsonNodeToNumberCodec.java:84)
        at com.datastax.oss.dsbulk.codecs.text.json.JsonNodeToFloatCodec.externalToInternal(JsonNodeToFloatCodec.java:78)
        at com.datastax.oss.dsbulk.codecs.text.json.JsonNodeToFloatCodec.externalToInternal(JsonNodeToFloatCodec.java:34)
        at com.datastax.oss.dsbulk.codecs.text.json.JsonNodeToVectorCodec.lambda$externalToInternal$0(JsonNodeToVectorCodec.java:50)
        at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
        at java.util.ArrayList$Itr.forEachRemaining(ArrayList.java:901)

This is consistent with the code; the JSON-to-vector codec already leverages dsbulk converting codecs when reading from input strings and these codecs already perform overflow checks.

It was a different story on the string side, however. There we were re-using CqlVector.from() to handle strings, which obviously doesn't allow for the insertion of additional (possibly more rigorous) policies. To support something more rigorous a version of this logic was moved into the dsbulk codecs. This solves the problem but it also makes more sense logically; dsbulk should be in charge of the formats it's willing to accept rather than relying on CqlVector to define that for him.

absurdfarce self-assigned this Jun 25, 2024

absurdfarce linked a pull request Jun 26, 2024 that will close this issue

Leverage dsbulk string codecs to convert from strings to Java types (and back again) #496

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing vector data from JSON fails for "floats" with too many digits (aka doubles) #484

Parsing vector data from JSON fails for "floats" with too many digits (aka doubles) #484

hemidactylus commented Sep 28, 2023

absurdfarce commented Jun 26, 2024

Parsing vector data from JSON fails for "floats" with too many digits (aka doubles) #484

Parsing vector data from JSON fails for "floats" with too many digits (aka doubles) #484

Comments

hemidactylus commented Sep 28, 2023

Notes:

Minimal reproducible case

absurdfarce commented Jun 26, 2024