一、es内置分词器
只支持英文分词,不支持中文分词
2、es内置分词器
standard:默认分词,单词会被拆分,大小会转换为小写。
simple:按照非字母分词。大写转为小写。
whitespace:按照空格分词。忽略大小写。
stop:去除无意义单词,比如
the
/a
/an
/is
…-
keyword:不做分词。把整个文本作为一个单独的关键词。
# 示例json { "analyzer": "standard", "text": "My name is Peter Parker,I am a Super Hero. I don't like the Criminals." }
3、内置分词器用例
-
请求(POST)
192.168.56.101:9200/_analyze
关键词 : _analyze
-
json参数
{ "analyzer": "standard", "text": "This is a good job" }
关键词:"analyze
二、ik分词器
1、ik分词器安装
主要用于中文分词,英文也支持
下载对应版本
上床es所在服务器
-
加压到es目录下的
plugins
下/usr/local/es/elasticsearch-8.4.3/plugins/ik/
重启es即可
2、分词器
ik_max_word
ik_smart
3、用例
-
请求(POST)
同es内置分词器
192.168.56.101:9200/_analyze
-
json参数
使用
ik_max_wor
分词器{ "analyzer": "ik_max_word", "text": "上下班车流量很大。" }
-
结果
{ "tokens": [ { "token": "上下班", "start_offset": 0, "end_offset": 3, "type": "CN_WORD", "position": 0 }, { "token": "上下", "start_offset": 0, "end_offset": 2, "type": "CN_WORD", "position": 1 }, { "token": "下班", "start_offset": 1, "end_offset": 3, "type": "CN_WORD", "position": 2 }, { "token": "班车", "start_offset": 2, "end_offset": 4, "type": "CN_WORD", "position": 3 }, { "token": "车流量", "start_offset": 3, "end_offset": 6, "type": "CN_WORD", "position": 4 }, { "token": "车流", "start_offset": 3, "end_offset": 5, "type": "CN_WORD", "position": 5 }, { "token": "流量", "start_offset": 4, "end_offset": 6, "type": "CN_WORD", "position": 6 }, { "token": "很大", "start_offset": 6, "end_offset": 8, "type": "CN_WORD", "position": 7 } ] }
-
json参数
使用
ik_smart
分词器{ "analyzer": "ik_smart", "text": "上下班车流量很大。" }
-
结果
{ "tokens": [ { "token": "上下班", "start_offset": 0, "end_offset": 3, "type": "CN_WORD", "position": 0 }, { "token": "车流量", "start_offset": 3, "end_offset": 6, "type": "CN_WORD", "position": 1 }, { "token": "很大", "start_offset": 6, "end_offset": 8, "type": "CN_WORD", "position": 2 } ] }
4、ik_max_wor
和ik_smart
分词器的区别
ik_max_word: 会将文本做最细粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,和,国国,国歌”,会穷尽各种可能的组合,适合 Term Query;
ik_smart: 会做最粗粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,国歌”,适合 Phrase 查询。
5、ik自定义词汇
-
配置文件地址:
/usr/local/es/elasticsearch-8.4.3/plugins/ik/config/IKAnalyzer.cfg.xml
根据自己的安装目录对应其位置
-
修改配置信息
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd"> <properties> <comment>IK Analyzer 扩展配置</comment> <!--用户可以在这里配置自己的扩展字典 --> <entry key="ext_dict">custom/mydict.dic;custom/single_word_low_freq.dic</entry> <!--用户可以在这里配置自己的扩展停止词字典--> <entry key="ext_stopwords">custom/ext_stopword.dic</entry> <!--用户可以在这里配置远程扩展字典 --> <entry key="remote_ext_dict">location</entry> <!--用户可以在这里配置远程扩展停止词字典--> <entry key="remote_ext_stopwords">http://xxx.com/xxx.dic</entry> </properties>
-
创建自定义字典
.dic
/usr/local/es/elasticsearch-8.4.3/plugins/ik/config/custom/
- 创建
mydict.dic
和single_word_low_freq.dic
文件
- 创建
小小小
小小少年
测测
子天
-
测试
小小小少年测测想成为天子的儿子天下无敌。
- json参数
{
"analyzer": "ik_max_word",
"text": "小小小少年测测想成为天子的儿子天下无敌。"
}
- 结果
{
"tokens": [
{
"token": "小小小",
"start_offset": 0,
"end_offset": 3,
"type": "CN_WORD",
"position": 0
},
{
"token": "小小",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 1
},
{
"token": "小小少年",
"start_offset": 1,
"end_offset": 5,
"type": "CN_WORD",
"position": 2
},
{
"token": "小小",
"start_offset": 1,
"end_offset": 3,
"type": "CN_WORD",
"position": 3
},
{
"token": "少年",
"start_offset": 3,
"end_offset": 5,
"type": "CN_WORD",
"position": 4
},
{
"token": "测测",
"start_offset": 5,
"end_offset": 7,
"type": "CN_WORD",
"position": 5
},
{
"token": "想成",
"start_offset": 7,
"end_offset": 9,
"type": "CN_WORD",
"position": 6
},
{
"token": "成为",
"start_offset": 8,
"end_offset": 10,
"type": "CN_WORD",
"position": 7
},
{
"token": "天子",
"start_offset": 10,
"end_offset": 12,
"type": "CN_WORD",
"position": 8
},
{
"token": "的",
"start_offset": 12,
"end_offset": 13,
"type": "CN_CHAR",
"position": 9
},
{
"token": "儿子",
"start_offset": 13,
"end_offset": 15,
"type": "CN_WORD",
"position": 10
},
{
"token": "子天",
"start_offset": 14,
"end_offset": 16,
"type": "CN_WORD",
"position": 11
},
{
"token": "天下无敌",
"start_offset": 15,
"end_offset": 19,
"type": "CN_WORD",
"position": 12
},
{
"token": "天下",
"start_offset": 15,
"end_offset": 17,
"type": "CN_WORD",
"position": 13
},
{
"token": "无敌",
"start_offset": 17,
"end_offset": 19,
"type": "CN_WORD",
"position": 14
}
]
}