Ensure only valid unicode binaries are extracted from datatypes #616

Vorticity-Flux · 2016-03-04T00:47:48Z

This patch ensures that only valid unicode strings are passed to SOLR.

Use case:

CRDT map with a mix of valid unicode and non-unicode binaries is being indexed by yokozuna.
Only fields that contain valid unicode are added to the SOLR schema.

Current behaviour:

Indexing fails in mochijson2 due to invalid UTF8.

Expected behviour:

Non-indexable (non-unicode) fields should not prevent indexing of other fields that contain valid unicode.

zeeshanlakhani · 2016-03-04T15:33:55Z

@Vorticity-Flux thank you for this. After some thinking, I think that we should allow for two things. Firstly, we should only skip invalid UTF8 values/elems via an option b/c some users may actually be trying to index invalid UTF8 values/elems and should receive an error. So, I'm going to close this for now, but will reference it in an issue I'm writing up. We'll A) have better error'ing instead of that mochijson2 error you got and B) allow for a configurable option to skip all invalid utf8 values and force the ensure... as you did in your code.

For now, the fix is doing your patch or just forking the extractor into a new one and using that. Thanks and I'll update you once we have it in for release :).

Ensure only valid unicode binaries are extracted from datatypes

d8b1c11

zeeshanlakhani closed this Mar 4, 2016

zeeshanlakhani mentioned this pull request Mar 4, 2016

Handle error'ing on invalid UTF8 binaries correctly across all extractors and allow for skipping invalid ones for Solr index by a configurable option. [JIRA: RIAK-2430] #617

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure only valid unicode binaries are extracted from datatypes #616

Ensure only valid unicode binaries are extracted from datatypes #616

Vorticity-Flux commented Mar 4, 2016

zeeshanlakhani commented Mar 4, 2016

Ensure only valid unicode binaries are extracted from datatypes #616

Ensure only valid unicode binaries are extracted from datatypes #616

Conversation

Vorticity-Flux commented Mar 4, 2016

Use case:

Current behaviour:

Expected behviour:

zeeshanlakhani commented Mar 4, 2016