Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

6.x #884

Open
wants to merge 15 commits into
base: 6.x
Choose a base branch
from
2 changes: 2 additions & 0 deletions .github/FUNDING.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
patreon: medcl
custom: ["https://www.buymeacoffee.com/medcl"]
33 changes: 13 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,16 +10,9 @@ Versions

IK version | ES version
-----------|-----------
master | 6.x -> master
6.3.0| 6.3.0
6.2.4| 6.2.4
6.1.3| 6.1.3
5.6.8| 5.6.8
5.5.3| 5.5.3
5.4.3| 5.4.3
5.3.3| 5.3.3
5.2.2| 5.2.2
5.1.2| 5.1.2
master | 7.x -> master
6.x| 6.x
5.x| 5.x
1.10.6 | 2.4.6
1.9.5 | 2.3.5
1.8.1 | 2.2.1
Expand Down Expand Up @@ -64,13 +57,13 @@ curl -XPUT http://localhost:9200/index
2.create a mapping

```bash
curl -XPOST http://localhost:9200/index/fulltext/_mapping -H 'Content-Type:application/json' -d'
curl -XPOST http://localhost:9200/index/_mapping -H 'Content-Type:application/json' -d'
{
"properties": {
"content": {
"type": "text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_max_word"
"search_analyzer": "ik_smart"
}
}

Expand All @@ -80,33 +73,33 @@ curl -XPOST http://localhost:9200/index/fulltext/_mapping -H 'Content-Type:appli
3.index some docs

```bash
curl -XPOST http://localhost:9200/index/fulltext/1 -H 'Content-Type:application/json' -d'
curl -XPOST http://localhost:9200/index/_create/1 -H 'Content-Type:application/json' -d'
{"content":"美国留给伊拉克的是个烂摊子吗"}
'
```

```bash
curl -XPOST http://localhost:9200/index/fulltext/2 -H 'Content-Type:application/json' -d'
curl -XPOST http://localhost:9200/index/_create/2 -H 'Content-Type:application/json' -d'
{"content":"公安部:各地校车将享最高路权"}
'
```

```bash
curl -XPOST http://localhost:9200/index/fulltext/3 -H 'Content-Type:application/json' -d'
curl -XPOST http://localhost:9200/index/_create/3 -H 'Content-Type:application/json' -d'
{"content":"中韩渔警冲突调查:韩警平均每天扣1艘中国渔船"}
'
```

```bash
curl -XPOST http://localhost:9200/index/fulltext/4 -H 'Content-Type:application/json' -d'
curl -XPOST http://localhost:9200/index/_create/4 -H 'Content-Type:application/json' -d'
{"content":"中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"}
'
```

4.query with highlighting

```bash
curl -XPOST http://localhost:9200/index/fulltext/_search -H 'Content-Type:application/json' -d'
curl -XPOST http://localhost:9200/index/_search -H 'Content-Type:application/json' -d'
{
"query" : { "match" : { "content" : "中国" }},
"highlight" : {
Expand Down Expand Up @@ -248,13 +241,13 @@ curl -XGET "http://localhost:9200/your_index/_analyze" -H 'Content-Type: applica
4. ik_max_word 和 ik_smart 什么区别?


ik_max_word: 会将文本做最细粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,和,国国,国歌”,会穷尽各种可能的组合;
ik_max_word: 会将文本做最细粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,和,国国,国歌”,会穷尽各种可能的组合,适合 Term Query

ik_smart: 会做最粗粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,国歌”。
ik_smart: 会做最粗粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,国歌”,适合 Phrase 查询

Changes
------
*5.0.0*
*自 v5.0.0*

- 移除名为 `ik` 的analyzer和tokenizer,请分别使用 `ik_smart` 和 `ik_max_word`

Expand Down
8 changes: 8 additions & 0 deletions config/IKAnalyzer.cfg.xml
Original file line number Diff line number Diff line change
Expand Up @@ -10,4 +10,12 @@
<!-- <entry key="remote_ext_dict">words_location</entry> -->
<!--用户可以在这里配置远程扩展停止词字典-->
<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
<!-- 连接地址 如果未配置,则不开启数据库同步 -->
<entry key="db_url"><![CDATA[jdbc:mysql://127.0.0.1:3306/lexicon?characterEncoding=UTF-8&useSSL=false]]></entry>
<!-- 数据库用户名 -->
<entry key="db_user">root</entry>
<!-- 数据库密码 -->
<entry key="db_password">root</entry>
<!-- 同步间隔,单位:秒 -->
<entry key="db_reload_interval">10</entry>
</properties>
75 changes: 75 additions & 0 deletions mysql.extend.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
IK Analysis 扩展词新增mysql同步来源
=============================

- 支持启动全量加载扩展词
- 支持热更新扩展词

> mysql 扩展词表结构


```mysql

CREATE TABLE `es_lexicon` (
`id` bigint(20) NOT NULL AUTO_INCREMENT COMMENT '词库id',
`create_date` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT '创建时间',
`modify_date` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT '修改时间',
`lexicon_text` varchar(40) NOT NULL COMMENT '词条关键词',
`lexicon_type` tinyint(1) NOT NULL DEFAULT '0' COMMENT '0扩展词库 1停用词库',
`lexicon_status` tinyint(1) NOT NULL DEFAULT '0' COMMENT '词条状态 0正常 1暂停使用',
`del_flag` tinyint(1) NOT NULL DEFAULT '0' COMMENT '作废标志 0正常 1作废',
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COMMENT='ES远程扩展词库表'
```



```IKAnalyzer.cfg.xml```


```xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
<comment>IK Analyzer 扩展配置</comment>
<!--用户可以在这里配置自己的扩展字典 -->
<entry key="ext_dict"></entry>
<!--用户可以在这里配置自己的扩展停止词字典-->
<entry key="ext_stopwords"></entry>
<!--用户可以在这里配置远程扩展字典 -->
<!-- <entry key="remote_ext_dict">words_location</entry> -->
<!--用户可以在这里配置远程扩展停止词字典-->
<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
<!-- 连接地址 如果未配置,则不开启数据库同步 -->
<entry key="db_url"><![CDATA[jdbc:mysql://10.1.11.134:3306/post_bar?characterEncoding=UTF-8&zeroDateTimeBehavior=convertToNull&connectTimeout=60000&socketTimeout=60000&autoReconnect=true&failOverReadOnly=false&useSSL=true&useUnicode=true]]></entry>
<!-- 数据库用户名 -->
<entry key="db_user">root</entry>
<!-- 数据库密码 -->
<entry key="db_password">123456</entry>
<!-- 同步间隔,单位:秒 -->
<entry key="db_reload_interval">10</entry>
</properties>

```

> 自行打包 放入elasticsearch plugins 目录即可

启动日志如下:

```
[2021-06-02T15:16:07,593][INFO ][o.w.a.d.Dictionary ] ======start mysql to reload ik dict.======
[2021-06-02T15:16:07,828][INFO ][o.w.a.d.Dictionary ] last update mysql ext dic time :2021-05-27T14:36:05.000+0800,fill count:4843 ,disable count:0
[2021-06-02T15:16:07,837][INFO ][o.w.a.d.Dictionary ] the last reload stop word not found, the last update time :null
[2021-06-02T15:16:07,838][INFO ][o.w.a.d.Dictionary ] last update mysql stop word time :null,fill count:0 ,disable count:0
[2021-06-02T15:16:07,838][INFO ][o.w.a.d.Dictionary ] ======reload mysql ik dict finished.======
[2021-06-02T15:16:17,587][INFO ][o.w.a.d.Dictionary ] ======start mysql to reload ik dict.======
[2021-06-02T15:16:17,615][INFO ][o.w.a.d.Dictionary ] last update mysql ext dic time :2021-06-01T09:44:50.000+0800,fill count:4842 ,disable count:0
[2021-06-02T15:16:17,623][INFO ][o.w.a.d.Dictionary ] the last reload stop word not found, the last update time :null
[2021-06-02T15:16:17,624][INFO ][o.w.a.d.Dictionary ] last update mysql stop word time :null,fill count:0 ,disable count:0
[2021-06-02T15:16:17,624][INFO ][o.w.a.d.Dictionary ] ======reload mysql ik dict finished.======
[2021-06-02T15:16:27,596][INFO ][o.w.a.d.Dictionary ] ======start mysql to reload ik dict.======
[2021-06-02T15:16:27,602][INFO ][o.w.a.d.Dictionary ] the latest update record was not found, the last update time :2021-06-01T09:44:50.000+0800
[2021-06-02T15:16:27,602][INFO ][o.w.a.d.Dictionary ] last update mysql ext dic time :2021-06-01T09:44:50.000+0800,fill count:0 ,disable count:0
[2021-06-02T15:16:27,608][INFO ][o.w.a.d.Dictionary ] the last reload stop word not found, the last update time :null
[2021-06-02T15:16:27,608][INFO ][o.w.a.d.Dictionary ] last update mysql stop word time :null,fill count:0 ,disable count:0
[2021-06-02T15:16:27,608][INFO ][o.w.a.d.Dictionary ] ======reload mysql ik dict finished.======
```
7 changes: 6 additions & 1 deletion pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
<inceptionYear>2011</inceptionYear>

<properties>
<elasticsearch.version>6.5.0</elasticsearch.version>
<elasticsearch.version>6.2.3</elasticsearch.version>
<maven.compiler.target>1.8</maven.compiler.target>
<elasticsearch.assembly.descriptor>${project.basedir}/src/main/assemblies/plugin.xml</elasticsearch.assembly.descriptor>
<elasticsearch.plugin.name>analysis-ik</elasticsearch.plugin.name>
Expand Down Expand Up @@ -95,6 +95,11 @@
<artifactId>log4j-api</artifactId>
<version>2.3</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.49</version>
</dependency>

<dependency>
<groupId>org.hamcrest</groupId>
Expand Down
1 change: 1 addition & 0 deletions src/main/assemblies/plugin.xml
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@
<useTransitiveFiltering>true</useTransitiveFiltering>
<includes>
<include>org.apache.httpcomponents:httpclient</include>
<include>mysql:mysql-connector-java</include>
</includes>
</dependencySet>
</dependencySets>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ public class IkTokenizerFactory extends AbstractTokenizerFactory {
private Configuration configuration;

public IkTokenizerFactory(IndexSettings indexSettings, Environment env, String name, Settings settings) {
super(indexSettings, name, settings);
super(indexSettings, name,settings);
configuration=new Configuration(env,settings);
}

Expand Down
8 changes: 8 additions & 0 deletions src/main/java/org/wltea/analyzer/cfg/Configuration.java
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,9 @@ public class Configuration {
//是否启用远程词典加载
private boolean enableRemoteDict=false;

//是否启用远程词典加载
private boolean enableMysqlDict=false;

//是否启用小写处理
private boolean enableLowercase=true;

Expand All @@ -36,6 +39,7 @@ public Configuration(Environment env,Settings settings) {
this.useSmart = settings.get("use_smart", "false").equals("true");
this.enableLowercase = settings.get("enable_lowercase", "true").equals("true");
this.enableRemoteDict = settings.get("enable_remote_dict", "true").equals("true");
this.enableMysqlDict = settings.get("enable_mysql_dict", "true").equals("true");

Dictionary.initial(this);

Expand Down Expand Up @@ -69,6 +73,10 @@ public boolean isEnableRemoteDict() {
return enableRemoteDict;
}

public boolean isEnableMysqlDict() {
return enableMysqlDict;
}

public boolean isEnableLowercase() {
return enableLowercase;
}
Expand Down
4 changes: 2 additions & 2 deletions src/main/java/org/wltea/analyzer/core/AnalyzeContext.java
Original file line number Diff line number Diff line change
Expand Up @@ -268,13 +268,13 @@ void outputToResult(){
while(l != null){
this.results.add(l);
//字典中无单字,但是词元冲突了,切分出相交词元的前一个词元中的单字
int innerIndex = index + 1;
/*int innerIndex = index + 1;
for (; innerIndex < index + l.getLength(); innerIndex++) {
Lexeme innerL = path.peekFirst();
if (innerL != null && innerIndex == innerL.getBegin()) {
this.outputSingleCJK(innerIndex - 1);
}
}
}*/

//将index移至lexeme后
index = l.getBegin() + l.getLength();
Expand Down
58 changes: 58 additions & 0 deletions src/main/java/org/wltea/analyzer/db/DBConfigProperties.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
package org.wltea.analyzer.db;

import java.io.Serializable;

/**
* @author fsren
* @date 2021-05-25
*/
public class DBConfigProperties implements Serializable {

private static final long serialVersionUID = 688310733642302993L;
private String dbUrl;
private String user;
private String password;
private Integer reloadInterval;

public String getDbUrl() {
return dbUrl;
}

public void setDbUrl(String dbUrl) {
this.dbUrl = dbUrl;
}

public String getUser() {
return user;
}

public void setUser(String user) {
this.user = user;
}

public String getPassword() {
return password;
}

public void setPassword(String password) {
this.password = password;
}

public Integer getReloadInterval() {
return reloadInterval;
}

public void setReloadInterval(Integer reloadInterval) {
this.reloadInterval = reloadInterval;
}

@Override
public String toString() {
return "DBConfigProperties{" +
"dbUrl='" + dbUrl + '\'' +
", user='" + user + '\'' +
", password='" + password + '\'' +
", reloadInterval=" + reloadInterval +
'}';
}
}
43 changes: 43 additions & 0 deletions src/main/java/org/wltea/analyzer/db/DataSourceFactory.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
package org.wltea.analyzer.db;

import com.mysql.jdbc.jdbc2.optional.MysqlConnectionPoolDataSource;
import org.apache.logging.log4j.Logger;
import org.elasticsearch.SpecialPermission;
import org.wltea.analyzer.help.ESPluginLoggerFactory;

import javax.sql.DataSource;
import java.security.AccessController;
import java.security.PrivilegedAction;
import java.sql.SQLException;

/**
* @author fsren
* @date 2021-05-25
*/
public class DataSourceFactory {


private static final Logger logger = ESPluginLoggerFactory.getLogger(DataSourceFactory.class.getName());


public static DataSource getDataSource(DBConfigProperties configProperties) {

SpecialPermission.check();
return AccessController.doPrivileged((PrivilegedAction<DataSource>) () -> {
logger.info("load datasource start");
MysqlConnectionPoolDataSource dataSource = new MysqlConnectionPoolDataSource();
dataSource.setURL(configProperties.getDbUrl());
dataSource.setUser(configProperties.getUser());
dataSource.setPassword(configProperties.getPassword());
dataSource.setAllowMultiQueries(true);
try {
dataSource.setSocketTimeout(1000);
} catch (SQLException ignore) {
}
logger.info("load datasource end");
return dataSource;
});
}


}
Loading