Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing vector data from JSON fails for "floats" with too many digits (aka doubles) #484

Open
hemidactylus opened this issue Sep 28, 2023 · 1 comment · May be fixed by #496
Open

Parsing vector data from JSON fails for "floats" with too many digits (aka doubles) #484

hemidactylus opened this issue Sep 28, 2023 · 1 comment · May be fixed by #496
Assignees

Comments

@hemidactylus
Copy link

When ingesting VECTOR<FLOAT,n> data from a JSON, dsbulk (v 1.11) fails for "floats" which are represented with too many digits. They end up being double, which seems to cause unrecoverable problems then.

Notes:

  1. JSON produced by dsbulk itself are OK, i.e. their floats are floats proper (low number of digits).
  2. But, with folks coming to load datasets generated elsewhere (viz Python, which lacks a clear float/double distinction) this limitation might get in the way.

Minimal reproducible case

create table mini_table (id text primary key, embedding vector<float, 2>);
java -jar dsbulk-1.11.0.jar load -k $KEYSPACE -t mini_table -u "token" -p $TOKEN -b $BUNDLEZIP --dsbulk.connector.json.mode SINGLE_DOCUMENT --connector.json.url GOOD_OR_BAD.json -c json
$> cat good.json 
[
 {
  "id": "my_row",
  "embedding": [
   6.64632,
   4.49715
  ]
 }
]

$> cat bad.json 
[
 {
  "id": "my_row",
  "embedding": [
   6.646329843,
   4.4971533213
  ]
 }
]
@absurdfarce
Copy link
Collaborator

I couldn't reproduce this, at least not with JSON inputs.

$ cat ../vector_test_data_json_tooprecise/one.json 
{
    "i":1,
    "j":[6.646329843, 4.4971533213, 58]
}
$ bin/dsbulk load -url "./../vector_test_data_json_tooprecise" -k test -t bar -c json
Operation directory: /work/git/dsbulk/dist_test/dsbulk-1.11.0/logs/LOAD_20240626-210637-895657
At least 1 record does not match the provided schema.mapping or schema.query. Please check that the connector configuration and the schema configuration are correct.
total | failed | rows/s | p50ms | p99ms | p999ms | batches
    3 |      1 |     16 |  4.62 |  5.93 |   5.93 |    1.00
Operation LOAD_20240626-210637-895657 completed with 1 errors in less than one second.
$ cat logs/LOAD_20240626-210637-895657/mapping-errors.log 
Resource: file:/work/git/dsbulk/dist_test/vector_test_data_json_tooprecise/one.json
Position: 1
Source: {"i":1,"j":[6.646329843,4.4971533213,58]}
com.datastax.oss.dsbulk.workflow.commons.schema.InvalidMappingException: Could not map field j to variable j; conversion from Java type com.fasterxml.jackson.databind.JsonNode to CQL type Vector(FLOAT, 3) failed for raw value: [6.646329843,4.4971533213,58].
        at com.datastax.oss.dsbulk.workflow.commons.schema.InvalidMappingException.encodeFailed(InvalidMappingException.java:90)
        at com.datastax.oss.dsbulk.workflow.commons.schema.DefaultRecordMapper.bindColumn(DefaultRecordMapper.java:182)
        at com.datastax.oss.dsbulk.workflow.commons.schema.DefaultRecordMapper.bindStatement(DefaultRecordMapper.java:158)
        at com.datastax.oss.dsbulk.workflow.commons.schema.DefaultRecordMapper.map(DefaultRecordMapper.java:127)
        at java.lang.Thread.run(Thread.java:750) [19 skipped]
Caused by: java.lang.ArithmeticException: Cannot convert 6.646329843 from BigDecimal to Float
        at com.datastax.oss.dsbulk.codecs.api.util.CodecUtils.conversionFailed(CodecUtils.java:610)
        at com.datastax.oss.dsbulk.codecs.api.util.CodecUtils.toFloatValueExact(CodecUtils.java:537)
        at com.datastax.oss.dsbulk.codecs.api.util.CodecUtils.convertNumber(CodecUtils.java:333)
        at com.datastax.oss.dsbulk.codecs.api.util.CodecUtils.narrowNumber(CodecUtils.java:191)
        at com.datastax.oss.dsbulk.codecs.text.json.JsonNodeToNumberCodec.narrowNumber(JsonNodeToNumberCodec.java:84)
        at com.datastax.oss.dsbulk.codecs.text.json.JsonNodeToFloatCodec.externalToInternal(JsonNodeToFloatCodec.java:78)
        at com.datastax.oss.dsbulk.codecs.text.json.JsonNodeToFloatCodec.externalToInternal(JsonNodeToFloatCodec.java:34)
        at com.datastax.oss.dsbulk.codecs.text.json.JsonNodeToVectorCodec.lambda$externalToInternal$0(JsonNodeToVectorCodec.java:50)
        at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
        at java.util.ArrayList$Itr.forEachRemaining(ArrayList.java:901)

This is consistent with the code; the JSON-to-vector codec already leverages dsbulk converting codecs when reading from input strings and these codecs already perform overflow checks.

It was a different story on the string side, however. There we were re-using CqlVector.from() to handle strings, which obviously doesn't allow for the insertion of additional (possibly more rigorous) policies. To support something more rigorous a version of this logic was moved into the dsbulk codecs. This solves the problem but it also makes more sense logically; dsbulk should be in charge of the formats it's willing to accept rather than relying on CqlVector to define that for him.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants