1.分词器介绍
-
什么是分词器?
将一段文本按照一定的逻辑,分析成多个词语,同时对这些词语进行常规化(normalization)的一种工具,例如:
"hello tom and jerry"可以分为"hello"、"tom"、"and"、"jerry"这4个单词
常规化是说,例如,"hello tom & jerry",那么把"&"这个字符转换为"and",对一个html标签进行分词时,先去掉标签"<span>hello<span>" -> "hello"
-
常用的内置分词器
- standard analyzer
- simple analyzer
- whitespace analyzer
- stop analyzer
- language analyzer
- pattern analyzer
1.1 standard analyzer
默认分词器:按照非字母和非数字字符进行分隔,单词转为小写
测试文本:a*B!c d4e 5f 7-h
分词结果:a
、b
、c
、d4e
、5f
、7
、h
{
"tokens" : [
{
"token" : "a", # 分词后的单词
"start_offset" : 0, # 在原文本中的起始位置
"end_offset" : 1, # 原文本中的结束位置
"type" : "<ALPHANUM>", # 单词类型:ALPHANUM(字母)、NUM(数字)
"position" : 0 # 单词位置,是分出来的所有单词的第几个单词
},
{
"token" : "b",
"start_offset" : 2,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "c",
"start_offset" : 4,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "d4e",
"start_offset" : 6,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "5f",
"start_offset" : 10,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "7",
"start_offset" : 13,
"end_offset" : 14,
"type" : "<NUM>",
"position" : 5
},
{
"token" : "h",
"start_offset" : 15,
"end_offset" : 16,
"type" : "<ALPHANUM>",
"position" : 6
}
]
}
1.2 simple analyzer
分词效果:按照非字母字符进行分隔,单词转为小写
测试文本:a*B!c d4e 5f 7-h
分词结果:a
、b
、c
、d
、e
、f
、h
GET _analyze
{
"analyzer": "simple",
"text": "a*B!c d4e 5f 7-h"
}
{
"tokens" : [
{
"token" : "a",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
},
{
"token" : "b",
"start_offset" : 2,
"end_offset" : 3,
"type" : "word",
"position" : 1
},
{
"token" : "c",
"start_offset" : 4,
"end_offset" : 5,
"type" : "word",
"position" : 2
},
{
"token" : "d",
"start_offset" : 6,
"end_offset" : 7,
"type" : "word",
"position" : 3
},
{
"token" : "e",
"start_offset" : 8,
"end_offset" : 9,
"type" : "word",
"position" : 4
},
{
"token" : "f",
"start_offset" : 11,
"end_offset" : 12,
"type" : "word",
"position" : 5
},
{
"token" : "h",
"start_offset" : 15,
"end_offset" : 16,
"type" : "word",
"position" : 6
}
]
}
1.3 whitespace analyzer
分词效果:按照空白字符进行分隔
测试文本:a*B!c D d4e 5f 7-h
分词结果:a*B!c
、D
、d4e
、5f
、7-h
GET _analyze
{
"analyzer": "whitespace",
"text": "a*B!c D d4e 5f 7-h"
}
{
"tokens" : [
{
"token" : "a*B!c",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "D",
"start_offset" : 6,
"end_offset" : 7,
"type" : "word",
"position" : 1
},
{
"token" : "d4e",
"start_offset" : 8,
"end_offset" : 11,
"type" : "word",
"position" : 2
},
{
"token" : "5f",
"start_offset" : 12,
"end_offset" : 14,
"type" : "word",
"position" : 3
},
{
"token" : "7-h",
"start_offset" : 15,
"end_offset" : 18,
"type" : "word",
"position" : 4
}
]
}
1.4 stop analyzer
分词效果:使用非字母字符进行分隔,单词转换为小写,并去掉停用词(默认为英语的停用词,例如the
、a
、an
、this
、of
、at
等)
测试文本:The apple is red
分词结果:apple
、red
GET _analyze
{
"analyzer": "stop",
"text": "The apple is red"
}
{
"tokens" : [
{
"token" : "apple",
"start_offset" : 4,
"end_offset" : 9,
"type" : "word",
"position" : 1
},
{
"token" : "red",
"start_offset" : 13,
"end_offset" : 16,
"type" : "word",
"position" : 3
}
]
}
1.5 language analyzer
分词效果:使用指定的语言的语法进行分词,默认为english
,没有内置中文分词器
GET _analyze
{
"analyzer": "english",
"text": "\"I'm Tony,\", he said, \"nice to meet you!\""
}
{
"tokens" : [
{
"token" : "i'm",
"start_offset" : 1,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "toni",
"start_offset" : 5,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "he",
"start_offset" : 13,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "said",
"start_offset" : 16,
"end_offset" : 20,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "nice",
"start_offset" : 23,
"end_offset" : 27,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "meet",
"start_offset" : 31,
"end_offset" : 35,
"type" : "<ALPHANUM>",
"position" : 6
},
{
"token" : "you",
"start_offset" : 36,
"end_offset" : 39,
"type" : "<ALPHANUM>",
"position" : 7
}
]
}
1.6 pattern analyzer
分词效果:使用指定的正则表达式进行分词,默认\\W+
,即多个非数字非字母字符
GET _analyze
{
"analyzer": "pattern",
"text": "The best 3-points shooter is Curry!"
}
{
"tokens" : [
{
"token" : "the",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
},
{
"token" : "best",
"start_offset" : 4,
"end_offset" : 8,
"type" : "word",
"position" : 1
},
{
"token" : "3",
"start_offset" : 9,
"end_offset" : 10,
"type" : "word",
"position" : 2
},
{
"token" : "points",
"start_offset" : 11,
"end_offset" : 17,
"type" : "word",
"position" : 3
},
{
"token" : "shooter",
"start_offset" : 18,
"end_offset" : 25,
"type" : "word",
"position" : 4
},
{
"token" : "is",
"start_offset" : 26,
"end_offset" : 28,
"type" : "word",
"position" : 5
},
{
"token" : "curry",
"start_offset" : 29,
"end_offset" : 34,
"type" : "word",
"position" : 6
}
]
}
2.分词器使用
2.1 指定index的分词器
创建测试索引
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer_1": {
"type": "whitespace"
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"id": {
"type": "keyword"
},
"name": {
"type": "text"
},
"desc": {
"type": "text",
"analyzer": "my_analyzer_1"
}
}
}
}
}
创建测试数据:
PUT my_index/_doc/1
{
"id": "001",
"name": "Curry",
"desc": "The best 3-points shooter is Curry!"
}
查询:由于desc
字段使用whitespace
分词,所以通过curry
是查询不到的,需要通过Curry!
来查询
GET my_index/_search
{
"query": {
"match": {
"desc": "curry"
}
}
}
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}
GET my_index/_search
{
"query": {
"match": {
"desc": "Curry!"
}
}
}
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.2876821,
"hits" : [
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.2876821,
"_source" : {
"id" : "001",
"name" : "Curry",
"desc" : "The best 3-points shooter is Curry!"
}
}
]
}
}
2.2 更改分词器设置
# 创建索引,并设置分词器,启用停用词,默认的standard分词器是没有使用停用词的
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_standard": {
"type": "standard",
"stopwords": "_english_"
}
}
}
}
}
# 测试
GET /my_index/_analyze
{
"analyzer": "my_standard",
"text": "a dog is in the house"
}
{
"tokens": [
{
"token": "dog",
"start_offset": 2,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "house",
"start_offset": 16,
"end_offset": 21,
"type": "<ALPHANUM>",
"position": 5
}
]
}
2.3 自定义分词器
PUT /my_index
{
"settings": {
"analysis": {
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": ["& => and"] # "$"转换为"and"
}
},
"filter": {
"my_filter": {
"type": "stop",
"stopwords": ["the", "a"] # 指定两个停用词
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"char_filter": ["html_strip", "my_char_filter"], # 使用内置的html标签过滤和自定义的my_char_filter
"tokenizer": "standard",
"filter": ["lowercase", "my_filter"] # 使用内置的lowercase filter和自定义的my_filter
}
}
}
}
}
GET /my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "tom&jerry are a friend in the house, <a>, HAHA!!"
}
{
"tokens": [
{
"token": "tomandjerry",
"start_offset": 0,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "are",
"start_offset": 10,
"end_offset": 13,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "friend",
"start_offset": 16,
"end_offset": 22,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "in",
"start_offset": 23,
"end_offset": 25,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "house",
"start_offset": 30,
"end_offset": 35,
"type": "<ALPHANUM>",
"position": 6
},
{
"token": "haha",
"start_offset": 42,
"end_offset": 46,
"type": "<ALPHANUM>",
"position": 7
}
]
}
2.4 为指定的type、指定的字段设置自定义的分词器
PUT /my_index/_mapping/my_type
{
"properties": {
"content": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
3. 中文分词器
3.1. 中文分词器介绍
Elasticsearch内置的分词器无法对中文进行分词,例如:
GET _analyze
{
"analyzer": "standard",
"text": "火箭明年总冠军"
}
{
"tokens" : [
{
"token" : "火",
"start_offset" : 0,
"end_offset" : 1,
"type" : "<IDEOGRAPHIC>",
"position" : 0
},
{
"token" : "箭",
"start_offset" : 1,
"end_offset" : 2,
"type" : "<IDEOGRAPHIC>",
"position" : 1
},
{
"token" : "明",
"start_offset" : 2,
"end_offset" : 3,
"type" : "<IDEOGRAPHIC>",
"position" : 2
},
{
"token" : "年",
"start_offset" : 3,
"end_offset" : 4,
"type" : "<IDEOGRAPHIC>",
"position" : 3
},
{
"token" : "总",
"start_offset" : 4,
"end_offset" : 5,
"type" : "<IDEOGRAPHIC>",
"position" : 4
},
{
"token" : "冠",
"start_offset" : 5,
"end_offset" : 6,
"type" : "<IDEOGRAPHIC>",
"position" : 5
},
{
"token" : "军",
"start_offset" : 6,
"end_offset" : 7,
"type" : "<IDEOGRAPHIC>",
"position" : 6
}
]
}
我们期望的分词结果是火箭
、明年
、总冠军
,这就需要使用中文分词器了。
- 常见的中文分词器
- smartCN :一个简单的中⽂或中英⽂混合文本分词器
- IK分词器:更智能更友好的中⽂分词器
3.2 smartCN安装方式
bin/elasticsearch-plugin install analysis-smartcn
完成后重启ES集群,测试:
GET _analyze
{
"analyzer": "smartcn",
"text": "火箭明年总冠军"
}
{
"tokens" : [
{
"token" : "火箭",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
},
{
"token" : "明年",
"start_offset" : 2,
"end_offset" : 4,
"type" : "word",
"position" : 1
},
{
"token" : "总",
"start_offset" : 4,
"end_offset" : 5,
"type" : "word",
"position" : 2
},
{
"token" : "冠军",
"start_offset" : 5,
"end_offset" : 7,
"type" : "word",
"position" : 3
}
]
}
3.3 IK分词器安装
下载地址:https://github.com/medcl/elasticsearch-analysis-ik/releases
下载与ES同版本的IK分词器
elasticsearch-analysis-ik-x.x.x.zip
-
在ES的
plugins
目录下创建ik
目录[giant@jd2 plugins]$ mkdir ik
-
将
elasticsearch-analysis-ik-x.x.x.zip
上传到plugins/ik
目录下并解压[giant@jd2 ik]$ unzip elasticsearch-analysis-ik-6.6.0.zip
-
删除
elasticsearch-analysis-ik-x.x.x.zip
安装包[giant@jd2 ik]$ rm -rf elasticsearch-analysis-ik-6.6.0.zip [giant@jd2 ik]$ ll total 1428 -rw-r--r-- 1 giant giant 263965 Jan 15 17:07 commons-codec-1.9.jar -rw-r--r-- 1 giant giant 61829 Jan 15 17:07 commons-logging-1.2.jar drwxr-xr-x 2 giant giant 299 Jan 15 17:07 config -rw-r--r-- 1 giant giant 54693 Jan 15 17:07 elasticsearch-analysis-ik-6.6.0.jar -rw-r--r-- 1 giant giant 736658 Jan 15 17:07 httpclient-4.5.2.jar -rw-r--r-- 1 giant giant 326724 Jan 15 17:07 httpcore-4.4.4.jar -rw-r--r-- 1 giant giant 1805 Jan 15 17:07 plugin-descriptor.properties -rw-r--r-- 1 giant giant 125 Jan 15 17:07 plugin-security.policy
所有ES节点均进行以上操作,然后重启ES集群
IK分词器测试:
GET _analyze
{
"analyzer": "ik_max_word",
"text": "火箭明年总冠军"
}
{
"tokens" : [
{
"token" : "火箭",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "明年",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "总冠军",
"start_offset" : 4,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "冠军",
"start_offset" : 5,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 3
}
]
}
IK分词器有两种analyzer,ik_max_word和ik_smart
- ik_max_word:会将文本做最细粒度的拆分
- ik_smart:会做最粗粒度的拆分
3.4 IK分词器配置文件
- IKAnalyzer.cfg.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
<comment>IK Analyzer 扩展配置</comment>
<!--用户可以在这里配置自己的扩展字典 -->
<entry key="ext_dict"></entry>
<!--用户可以在这里配置自己的扩展停止词字典-->
<entry key="ext_stopwords"></entry>
<!--用户可以在这里配置远程扩展字典 -->
<!-- <entry key="remote_ext_dict">words_location</entry> -->
<!--用户可以在这里配置远程扩展停止词字典-->
<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>
- main.dic:IK分词器原生内置的中文词库,总共有27万多条,只要这里定义的单词,都会被分在一起
- quantifier.dic:放了一些单位相关的词
- suffix.dic:放了一些后缀单词
- surname.dic:中国的姓氏
- stopword.dic:英文停用词
3.5 自定义词库
自定义词库:每年都会涌现一些特殊的流行词,网红,蓝瘦香菇,喊麦,鬼畜,一般不会在ik的原生词典里,自己补充这些最新的词语,到ik的词库里面去,然后修改IKAnalyzer.cfg.xml配置文件
-
自定义停用词库:比如"了","的","啥","么",我们可能并不想去建立索引,让人家搜索
<entry key="ext_dict">custom/mydict.dic</entry> <entry key="ext_stopwords">custom/mystopdict.dic</entry>
然后需要重启es,才能生效
- 测试
GET _analyze
{
"analyzer": "ik_max_word",
"text": "网红"
}
{
"tokens": [
{
"token": "网",
"start_offset": 0,
"end_offset": 1,
"type": "CN_CHAR",
"position": 0
},
{
"token": "红",
"start_offset": 1,
"end_offset": 2,
"type": "CN_CHAR",
"position": 1
}
]
}
- 自定义词库
mkdir -p ${ELASTICSEARCH_HOME}/plugins/ik/config/custom
touch ${ELASTICSEARCH_HOME}/plugins/ik/config/custom/mydict.dic
# 然后把网红这个词写进去
# 然后修改IKAnalyzer.cfg.xml
<entry key="ext_dict">custom/mydict.dic</entry>
- 重启es,并测试
GET _analyze
{
"analyzer": "ik_max_word",
"text": "网红"
}
{
"tokens": [
{
"token": "网红",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
}
]
}