Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IK分词器在处理中文时产生了错误的偏移量 #1022

Open
DemosHume opened this issue Sep 26, 2023 · 3 comments
Open

IK分词器在处理中文时产生了错误的偏移量 #1022

DemosHume opened this issue Sep 26, 2023 · 3 comments

Comments

@DemosHume
Copy link

DemosHume commented Sep 26, 2023

我在使用IK分词器处理中文文本时遇到了一个问题。

我有一个字段recommend_tags,它的值是"贝尔法斯特号"。
当我尝试将这个记录插入我的索引时,我收到了一个错误:
偏移量必须是非负的,endOffset必须大于等于startOffset,而且偏移量不能倒退。

输出的错误信息如下
('1 document(s) failed to index.', [{'index': {'_index': 'image_test_6_8_0', '_type': 'sql_record', '_id': 'WcgY0IoB7W0KhcCALXYf', 'status': 400, 'error': {'type': 'illegal_argument_exception', 'reason': "startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=2,endOffset=3,lastStartOffset=3 for field 'recommend_tags'"}, 'data': {'recommend_tags': '贝尔法斯特号'}}}])

当我使用分词API手动分析我的文本时,我发现问题可能出在"法"和"斯"这两个词元上。
"法"的startOffset为2,endOffset为3,然后下一个词元"斯"的startOffset也是3,这违反了偏移量不能倒退的规则。
这是分词结果:
{'tokens': [{'token': '贝尔法斯特', 'start_offset': 0, 'end_offset': 5, 'type': 'CN_WORD', 'position': 0}, {'token': '贝尔法', 'start_offset': 0, 'end_offset': 3, 'type': 'CN_WORD', 'position': 1}, {'token': '贝尔', 'start_offset': 0, 'end_offset': 2, 'type': 'CN_WORD', 'position': 2}, {'token': '斯', 'start_offset': 3, 'end_offset': 4, 'type': 'CN_CHAR', 'position': 3}, {'token': '法', 'start_offset': 2, 'end_offset': 3, 'type': 'CN_CHAR', 'position': 4}, {'token': '斯', 'start_offset': 3, 'end_offset': 4, 'type': 'CN_CHAR', 'position': 5}, {'token': '特号', 'start_offset': 4, 'end_offset': 6, 'type': 'CN_WORD', 'position': 6}]}

ik分词器版本信息如下
description=IK Analyzer for Elasticsearch version=6.8.0

索引字段信息如下:

"recommend_tags": { "type": "text", "analyzer": "ik_max_word" }

@DemosHume
Copy link
Author

DemosHume commented Sep 26, 2023

萨尔瓦多共和国
这个词也会出问题
{'tokens': [{'token': '萨尔瓦多', 'start_offset': 0, 'end_offset': 4, 'type': 'CN_WORD', 'position': 0}, {'token': '萨尔瓦', 'start_offset': 0, 'end_offset': 3, 'type': 'CN_WORD', 'position': 1}, {'token': '萨尔', 'start_offset': 0, 'end_offset': 2, 'type': 'CN_WORD', 'position': 2}, {'token': '瓦', 'start_offset': 2, 'end_offset': 3, 'type': 'CN_CHAR', 'position': 3}, {'token': '多', 'start_offset': 3, 'end_offset': 4, 'type': 'CN_CHAR', 'position': 4}]}

@lizongbo
Copy link

基于ES 8.10.2验证是正常的

image

AnalyzeRequest: POST /_analyze {"analyzer":"ik_max_word","text":["贝尔法斯特"]}======
AnalyzeResponse: {"tokens":[{"end_offset":5,"position":0,"start_offset":0,"token":"贝尔法斯特","type":"CN_WORD"},{"end_offset":3,"position":1,"start_offset":0,"token":"贝尔法","type":"CN_WORD"},{"end_offset":2,"position":2,"start_offset":0,"token":"贝尔","type":"CN_WORD"},{"end_offset":3,"position":3,"start_offset":2,"token":"法","type":"CN_CHAR"},{"end_offset":4,"position":4,"start_offset":3,"token":"斯","type":"CN_CHAR"},{"end_offset":5,"position":5,"start_offset":4,"token":"特","type":"CN_CHAR"}]}

AnalyzeRequest: POST /_analyze {"analyzer":"ik_max_word","text":["萨尔瓦多"]}======AnalyzeResponse: {"tokens":[{"end_offset":4,"position":0,"start_offset":0,"token":"萨尔瓦多","type":"CN_WORD"},{"end_offset":3,"position":1,"start_offset":0,"token":"萨尔瓦","type":"CN_WORD"},{"end_offset":2,"position":2,"start_offset":0,"token":"萨尔","type":"CN_WORD"},{"end_offset":3,"position":3,"start_offset":2,"token":"瓦","type":"CN_CHAR"},{"end_offset":4,"position":4,"start_offset":3,"token":"多","type":"CN_CHAR"}]}

@kin122
Copy link

kin122 commented Jul 29, 2024

image 可能是这段代码的逻辑bug,试着注释掉重新编译一下

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants