Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No results on searches for languages with codes including a dash #21

Closed
decodekult opened this issue Jun 16, 2023 · 2 comments · Fixed by #22 or #25
Closed

No results on searches for languages with codes including a dash #21

decodekult opened this issue Jun 16, 2023 · 2 comments · Fixed by #22 or #25
Assignees

Comments

@decodekult
Copy link
Collaborator

decodekult commented Jun 16, 2023

The mechanism used to provide proper results in searches done in non-default languages includes comparing against a document field post-lang, which stores a comma-separated list of languages where each post should appear into.

  • For posts (translations) in a given secondary language, it stores that translation language code.
  • For posts in the primary language, it also includes language codes for all languages where the relevant post type is set to be displayed as translated but a translation to that secondary language does not exist. See fix: post types set to display as translated #14

Consider languages like zh-hans or pt-pt. Posts (translations) in those languages are failing to be returned in searches fired in their right language.

Regression of #13

@decodekult
Copy link
Collaborator Author

The analyzer for this post-lang field was set to be a custom one, with a default tokenized and no filters - in theory, this should remove filters that were splitting sich languages like zh-hans into zh and hans tokens, but that change was not enough. See 5156190

Turning this field into a keyword will not work either, since it can contain multiple, comma-separated language codes, and we need to tokenized each of them.

The most reliable solution consists in using a char_group tokenizer exploding values on commas. See https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-chargroup-tokenizer.html

@decodekult
Copy link
Collaborator Author

decodekult commented Jun 16, 2023

Apparently, I managed to use an Elasticsearch feature introduced in a newer version than the minimum supported on Elasticpress: #20 (comment)

Reopening and adjusting the tokenizer based on the Elasticsearch version, so we can use the faster, more performant tokenizer if it is available.

@decodekult decodekult reopened this Jun 16, 2023
@decodekult decodekult linked a pull request Jun 22, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant