Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix post filter for lang codes with a dash (zh-hans) #20

Closed
wants to merge 1 commit into from

Conversation

Dehax
Copy link
Contributor

@Dehax Dehax commented Jun 12, 2023

Added keyword for post_lang in order to fix filtering by "zh-hans".

Language codes with dashes give multiple tokens when analyzed:

POST /:index/_analyze
{
    "analyzer": "post_lang_field",
    "text": "zh-hans"
}
{
    "tokens": [
        {
            "token": "zh",
            "start_offset": 0,
            "end_offset": 2,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "hans",
            "start_offset": 3,
            "end_offset": 7,
            "type": "<ALPHANUM>",
            "position": 1
        }
    ]
}

And this filter always gives 0 results:

'term' => [
	'post_lang' => 'zh-hans',
],

So in order to fix the post filtering issue for such language codes we need to either use additional "keyword" field with "keyword" data type or use "keyword" analyzer for post_lang field or just map post_lang field to "keyword" data type.

Added `keyword` for `post_lang` in order to fix filtering by "zh-hans".
@decodekult
Copy link
Collaborator

Hi there, @Dehax

Thanks for the feedback, and the time you took to craft this pull request.

I am glad to inform you that we have this already solved in #13 (and an extra, small refinement is on its way).

We will be having a new release for this plugin relatively soon, including this fix among other changes from recent pull requests, and some other surprises.

If you do not mind, I am closing this as the defect fixed here was already fixed in the link above.

Thanks again!

@decodekult decodekult closed this Jun 12, 2023
@Dehax
Copy link
Contributor Author

Dehax commented Jun 13, 2023

Hi @decodekult ,

I'm using the newest version from master branch (commit b559d0b) which already includes changes from #13 and this issue is not resolved in it yet. This version still always returns 0 search results when filtering by "zh-hans" language although Elasticsearch index contains many documents with "post_lang": "zh-hans". I hope the small refinement you are mentioning will fix this "dash issue". Will be waiting for the fix. Thanks!

@decodekult
Copy link
Collaborator

Hi @Dehax

I have checked this deeply, and you are right! The change mentioned above just included a new analyzer for the language field, but it still uses a standard tokenizer, which splits content by dashes. In addition, another related commit about including support for posts set to display as translated when a translation is missing also introduced some unwanted outcome.

I will be submitting a pull request addressing both things later today or early tomorrow.

In any case, we are playing with alternatives for #6 that will probably result in not having, or not needing, a language field.

I will update here once I am done. Thanks a lot!

@decodekult decodekult reopened this Jun 15, 2023
@decodekult
Copy link
Collaborator

This is gettign fixed at #22

As described in the related issue on #21 we can not turn this field into a keyword. But we can improve its analyzer.

Keeping this one open until we merge and confirm the other solution.

FYI @Dehax

@Dehax
Copy link
Contributor Author

Dehax commented Jun 16, 2023

Hi @decodekult ,

Thanks for the update!
It seems that char_group tokenizer was added in Elasticsearch 6.4.0 and thus it's not available in 5.2.2 (the version providen by ElasticPress).
I've confirmed this at "ElasticPress Sync" page. After pressing "Delete all Data and Start a Fresh Sync" button the sync process fails with the error:

Mapping failed: Unknown tokenizer type [char_group] for [post_lang_tokenizer]

@decodekult
Copy link
Collaborator

Oh well, I just merged the change. I think it opens the issue again.

Sorry for the back-and-forth.

@decodekult
Copy link
Collaborator

This one should fix the problem for good: #25

Thanks for your patience.

@Dehax
Copy link
Contributor Author

Dehax commented Jun 22, 2023

Yes, #25 works with Elasticsearch 5.2.2 and fixes #21. Thanks!

@Dehax Dehax closed this Jun 22, 2023
@decodekult
Copy link
Collaborator

Great indeed!

Thanks a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants