-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IK分词器在处理中文时产生了错误的偏移量 #1022
Comments
萨尔瓦多共和国 |
基于ES 8.10.2验证是正常的 AnalyzeRequest: POST /_analyze {"analyzer":"ik_max_word","text":["贝尔法斯特"]}====== AnalyzeRequest: POST /_analyze {"analyzer":"ik_max_word","text":["萨尔瓦多"]}======AnalyzeResponse: {"tokens":[{"end_offset":4,"position":0,"start_offset":0,"token":"萨尔瓦多","type":"CN_WORD"},{"end_offset":3,"position":1,"start_offset":0,"token":"萨尔瓦","type":"CN_WORD"},{"end_offset":2,"position":2,"start_offset":0,"token":"萨尔","type":"CN_WORD"},{"end_offset":3,"position":3,"start_offset":2,"token":"瓦","type":"CN_CHAR"},{"end_offset":4,"position":4,"start_offset":3,"token":"多","type":"CN_CHAR"}]} |
我在使用IK分词器处理中文文本时遇到了一个问题。
我有一个字段recommend_tags,它的值是"贝尔法斯特号"。
当我尝试将这个记录插入我的索引时,我收到了一个错误:
偏移量必须是非负的,endOffset必须大于等于startOffset,而且偏移量不能倒退。
输出的错误信息如下
('1 document(s) failed to index.', [{'index': {'_index': 'image_test_6_8_0', '_type': 'sql_record', '_id': 'WcgY0IoB7W0KhcCALXYf', 'status': 400, 'error': {'type': 'illegal_argument_exception', 'reason': "startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=2,endOffset=3,lastStartOffset=3 for field 'recommend_tags'"}, 'data': {'recommend_tags': '贝尔法斯特号'}}}])
当我使用分词API手动分析我的文本时,我发现问题可能出在"法"和"斯"这两个词元上。
"法"的startOffset为2,endOffset为3,然后下一个词元"斯"的startOffset也是3,这违反了偏移量不能倒退的规则。
这是分词结果:
{'tokens': [{'token': '贝尔法斯特', 'start_offset': 0, 'end_offset': 5, 'type': 'CN_WORD', 'position': 0}, {'token': '贝尔法', 'start_offset': 0, 'end_offset': 3, 'type': 'CN_WORD', 'position': 1}, {'token': '贝尔', 'start_offset': 0, 'end_offset': 2, 'type': 'CN_WORD', 'position': 2}, {'token': '斯', 'start_offset': 3, 'end_offset': 4, 'type': 'CN_CHAR', 'position': 3}, {'token': '法', 'start_offset': 2, 'end_offset': 3, 'type': 'CN_CHAR', 'position': 4}, {'token': '斯', 'start_offset': 3, 'end_offset': 4, 'type': 'CN_CHAR', 'position': 5}, {'token': '特号', 'start_offset': 4, 'end_offset': 6, 'type': 'CN_WORD', 'position': 6}]}
ik分词器版本信息如下
description=IK Analyzer for Elasticsearch version=6.8.0
索引字段信息如下:
"recommend_tags": { "type": "text", "analyzer": "ik_max_word" }
The text was updated successfully, but these errors were encountered: