Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure only valid unicode binaries are extracted from datatypes #616

Closed

Conversation

Vorticity-Flux
Copy link

This patch ensures that only valid unicode strings are passed to SOLR.

Use case:

CRDT map with a mix of valid unicode and non-unicode binaries is being indexed by yokozuna.
Only fields that contain valid unicode are added to the SOLR schema.

Current behaviour:

Indexing fails in mochijson2 due to invalid UTF8.

Expected behviour:

Non-indexable (non-unicode) fields should not prevent indexing of other fields that contain valid unicode.

@zeeshanlakhani
Copy link
Contributor

@Vorticity-Flux thank you for this. After some thinking, I think that we should allow for two things. Firstly, we should only skip invalid UTF8 values/elems via an option b/c some users may actually be trying to index invalid UTF8 values/elems and should receive an error. So, I'm going to close this for now, but will reference it in an issue I'm writing up. We'll A) have better error'ing instead of that mochijson2 error you got and B) allow for a configurable option to skip all invalid utf8 values and force the ensure... as you did in your code.

For now, the fix is doing your patch or just forking the extractor into a new one and using that. Thanks and I'll update you once we have it in for release :).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants