THULAC Analysis for Elasticsearch

采用THULAC实现的Elasticsearch中文分词插件。

版本

Plugin 版本	ES 版本	THULAC 版本	Link
master	7.x -> master	lite
7.9.1	7.9.1	lite	下载
6.4.1-181027	6.4.1	lite	下载
6.4.0-181027	6.4.0	lite	下载
6.3.0-181027	6.3.0	lite	下载
6.2.0-181027	6.2.0	lite	下载
6.1.0-181027	6.1.0	lite	下载

下载安装

直接下载已经打包好的插件，解压到elasticsearch的plugins目录下即可。

编译安装

1.编译打包

git clone git@github.com:microbun/elasticsearch-thulac-plugin.git
cd elasticsearch-thulac-plugin
./gradlew release

2.安装到elasticsearch

cp build/distributions/elasticsearch-thulac-plugin-7.9.1.zip ${ES_HOME}/plugins
cd ${ES_HOME}/plugins
unzip elasticsearch-thulac-plugin-7.9.1.zip
rm elasticsearch-thulac-plugin-7.9.1.zip

解压后在plugins目录下会有一个thulac文件夹。

thulac
 |-elasticsearch-thulac-plugin-7.9.1.jar
 |-models #算法模型�目录
 |-plugin-descriptor.properties
 |-plugin.xml

3.由于THULAC的模型太大，插件中没有包含模型数据，可以在THULAC 下载模型(lite)，将模型拷贝到models中。

示例

1.创建索引

1.1 使用默认分词方式

curl -H "Content-Type:application/json" -XPUT http://localhost:9200/index -d'
{
  "mappings": {
    "properties": {
      "text": {
        "type": "text",
        "analyzer": "thulac"
      }
    }
  }
}
'

1.2 自定义分词器

curl -H "Content-Type:application/json" -XPUT http://localhost:9200/index -d'
{
  "settings": {
    "analysis": {
      "tokenizer": {
        "custom_thulac_tokenizer": {
          "type": "thulac",
          "user_dict": "userdict.txt",
          "t2s": true,
          "filter": false
        }
      },
      "analyzer": {
        "custom_thulac_analyzer": {
          "tokenizer": "custom_thulac_tokenizer",
          "filter": [
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "text": {
        "type": "text",
        "analyzer": "custom_thulac_analyzer"
      }
    }
  }
}'

参数名称	含义	值
t2s	将句子从繁体转化为简体。默认：true	false/true
filter	使用过滤器去除一些没有意义的词语，例如“可以”。默认：false	false/true
user_dict	自定义词典路径，每一个词一行，UTF8编码，相对路径和绝对路径. 相对路径：userdict.txt 会加载 ${ES_HOME}/plugins/module/userdict.txt文件绝对路径：/home/elasticsearch/userdict.txt 默认：userdict.txt

2.查看索引

curl http://localhost:9200/index

3.测试分词效果

curl -H "Content-Type:application/json"  -XPOST http://localhost:9200/index/_analyze -d'
{
 "analyzer":"thulac", 
 "text":"我是中国人"
}
'

4.删除索引

curl -XDELETE http://localhost:9200/index

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

THULAC Analysis for Elasticsearch

版本

下载安装

编译安装

示例

1.创建索引

2.查看索引

3.测试分词效果

4.删除索引

Files

README.md

Latest commit

History

README.md

File metadata and controls

THULAC Analysis for Elasticsearch

版本

下载安装

编译安装

示例

1.创建索引

2.查看索引

3.测试分词效果

4.删除索引