Elastic Stack

Beats 数据采集
LogStash 数据转换
ElasticSearch 存储/索引/聚合
Kibana 数据可视化

节点角色

节点角色	配置+默认值
Master Eligible (主节点候选)	`node.master=true`
Data	`node.data=true`
Machine Learning	`node.ml=true` && `xpack.ml.enabled`
Ingest(预处理)	`node.ingest=true`
Coordinating only(只用于协同)	`无` 除 xpack 设置其他值全为 false; 各种节点都包含协同功能, 不能禁用; 用来处理收集数据.

任何一个节点都了解集群中其他节点的节点, 可以转发请求到合适的节点;
默认配置下, 任何节点都可以处理 HTTP查询和传输数据;

Data Node 持有:

Shards 数据
集群元数据+索引元数据

Master Eligible Node 持有:

集群元数据+索引元数据

https://www.elastic.co/guide/en/elasticsearch/reference/7.6/modules-node.html

分片

Primary Shard 主分片	Replica Shard 副本分片
For: 水平扩展. 各个主分片被分配到多台机器	For: 高可用. 为主分片拷贝.
主分片数在索引创建时确定, 之后不能修改.	可动态增减副本分片, 来调整可用性和读取性能.

一个分片即一个 Lucene 实例;

状态查看 API

# 查看集群状态
GET /_cluster/health

# 查看节点列表
GET /_cat/nodes?v

# 查看分片列表
GET /_cat/shards?v

# 查看索引概况列表
GET /_cat/indices?v

# 查看改索引概况
GET /_cat/indices/movies?v

# 查看该索引的 Setting 和 Mapping
GET /movies

CRUD

创建文档时让 ES 自动生成 ID, 需要使用 POST, PUT 必须指明 ID.

# get: 读取 by ID
GET /users/_doc/123
GET /users/_doc/234

# create: 只新建(ID不存在的文档)
POST /users/_create/234
{
  "firstName": "BD",
  "lastName": "C",
  "tags": ["boy", "engineer"]
}

# index: 新建 or 全量覆盖
PUT /users/_doc/123
{
  "firstName": "BD",
  "lastName": "C",
  "tags": ["boy", "engineer"]
}

# update: 只部分修改(ID已存在的文档)
POST /users/_update/123
{
  "doc": {
    "firstName": "Focus"
  }
}

# delete: 删除 by ID
DELETE /users/_doc/123
DELETE /users/_doc/234

# 批量操作 (出错继续执行)
POST /_bulk
{"delete": {"_index": "users", "_id": 123}}
{"create": {"_index": "users", "_id": 123}}
{"tags": ["boy", "engineer"]}
{"index": {"_index": "users", "_id": 123}}
{"firstName": "bulk FN", "lastName": "bulk LN"}
{"update": {"_index": "users", "_id": 123}}
{"doc": {"tags": ["PHP", "Ruby", "Java"]}}


# 批量读取
GET /_mget
{
  "docs": [
    {
      "_index": "users",
      "_id": 123
    },
    {
      "_index": "users",
      "_id": 234
    }
  ]
}

倒排索引

	索引	倒排索引
key	`id: 123`	`term: Ming`
value	`name: Xiao Ming Ming, age: 19`	`id: 123, term_frequency: 2, position: [5, 10], offset: [[5, 9], [10, 14]]`

分词器 Analyzer

Character Filter => Tokenizer => Token Filter

查询

Path	Range
/_search	全部索引
/index1/_search	index1
/index1,index2/_search	index1 和 index2
/index*/_search	以 index 开头的索引

基于 URI 的查询

# 在全部索引的任意字段搜索 Jack
GET /_search?q=Jack

# 在全部索引的title字段搜索 Jack
GET /_search?q=title:Jack

# 在 users 索引的任意字段搜索 Jack
GET /users/_search?q=Jack

# 在 users 索引的 firstName 字段搜索 Jack
GET /users/_search?q=firstName:Jack

# 在全部索引上模糊查询
GET /_search?q=ExhalX~1

# 查询要求包含 Waiting 或 Exhale
GET /_search?q=Waiting Exhale

# 查询要求包含 Waiting 和  Exhale, 并且要求 Exhale 的位置紧跟 Waiting 后面
# 位置(position)相邻, 不要求中间空格的数量
GET /_search?q="Waiting Exhale"

基于 Request Body 的查询

# 查询所有
GET /movies/_search
{
  "query": {
    "match_all": {}
  }
}

# `_source`  选定需要返回的字段;
# `from` `size` 游标;
# script_fields 脚本
GET /movies/_search
{
  "_source": [
    "id",
    "title"
  ],
  "from": 0,
  "size": 5,
  "query": {
    "match_all": {}
  },
  "script_fields": {
    "beautiful_title": {
      "script": {
        "lang": "painless",
        "source": "'<h1>' + doc['title.keyword'].value + '<h1/>'"
      }
    }
  }
}

# 短语匹配
# 先将文本分词, 拆成 term, (包含位置顺序)
# 要求 term 全部能搜索到, 并且位置顺序一致 (空格不要求)
GET /movies/_search
{
  "query": {
    "match_phrase": {
      "title": "Waiting to Exhale"
    }
  }
}

# slop 让 term 的顺序不再严格
# slop 为 2, 颠倒临近词位置可查
# slop 为 3, 中间有连接词时颠倒位置可查
GET /movies/_search
{
  "query": {
    "match_phrase": {
      "title": {
        "query": "Exhale Waiting",
        "slop": 3,
        "analyzer": "simple"
      }
    }
  }
}

Information Retrieval

precision 查准率, 尽可能少返回无关文档 (返回中判断正确的个数/返回的总个数)
recall 查全率, 尽量返回更多的相关文档 (返回中判断正确的个数/所有正确的个数)
ranking 按相关度排序

true or false, positive or negative

true positive: 判断正确, 被返回
false positive: 判断错误, 被返回
true negative: 判断正确, 没有返回
false negative: 判断错误, 没有返回

Mapping

-- 简单类型:

Text / Keyword
Date
Integer / Floating
Boolean
IPv4 / IPv6

-- 复杂类型:

对象类型
嵌套类型

-- 特殊类型:

Geo point
Geo shape
Percolator

_doc Mapping Dynamic	文档可搜索	字段可索引	Mapping 可更新
true default	√	√	√
false	√	×	×
strict	×	×	×

对于已存在字段, 不能修改(除非 ReIndex);
Mapping Dynamic 为 false时, 新增字段会保存在 _source 中, 没有Mapping的更新, 不能被搜索;
是否能被搜索取决于 Mapping 否是有更新;

DELETE /demo

PUT /demo/_doc/1
{
  "name": "Xiao Ming"
}

GET /demo/_doc/1

GET /demo/_mapping

# 修改 Mapping 的 Dynamic 属性
POST /demo/_mapping
{
  "dynamic": "strict"
}

GET /demo/_mapping

# 尝试插入新的字段 (报错)
POST /demo/_update/1
{
  "doc": {
    "formats": [
      "json",
      "xml"
    ]
  }
}

# 修改 Mapping 的 Dynamic 属性
POST /demo/_mapping
{
  "dynamic": "false"
}

GET /demo/_mapping

# 尝试插入新的字段
# 新字段存入 _source, mapping 没有修改
POST /demo/_update/1
{
  "doc": {
    "name": "Xiao Gang",
    "tags": [
      "Ruby",
      "Java"
    ]
  }
}

GET /demo/_doc/1

# mapping 中存在的字段可以被索引
GET /demo/_search
{
  "query": {
    "match": {
      "name": "xiao"
    }
  }
}

# 新字段不能被索引(Mapping 不存在该字段)
GET /demo/_search
{
  "query": {
    "match": {
      "tags": "ruby"
    }
  }
}

Mapping Definition

"index": false, 不创建倒排索引, 不能被索引.
创建索引 mapping, 只能使用 PUT /index_name {"mappings": {}} .

DELETE /demo

PUT /demo
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text"
      },
      "mobile": {
        "type": "text",
        "index": false
      }
    }
  }
}

GET /demo/_mapping

POST /demo/_doc/1
{
  "name": "Xiao Ming",
  "mobile": "021-1234567"
}

GET /demo/_search
{
  "query": {
    "match": {
      "name": "Xiao"
    }
  }
}

# Cannot search on field [mobile] since it is not indexed.
GET /demo/_search
{
  "query": {
    "match": {
      "mobile": "021"
    }
  }
}

Index Options 级别

级别	包含内容
docs	doc id
freqs	doc id, term frequencies
positions	doc id, term frequencies, term position
offsets	doc id, term frequencies, term position, character offsets

text 默认是 positions 级别, 其他默认为 docs

NULL Value

elasticsearch 本身不能存储空值, 默认情况下, null 和 [] 都被认为是空值.
null_value 可以把空值替换为指定的"空值替身"进行索引.
_source 依旧显示原值, "空值替身" 仅在索引时有效.
"空值替身" 的数据类型要跟属性匹配.
text 不能应用该属性,keyword 可以.

DELETE /demo

PUT /demo
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "fields": {
          "raw": {
            "type": "keyword",
            "null_value": "NULLVALUE"
          }
        }
      },
      "age": {
        "type": "integer",
        "null_value": 18
      }
    }
  }
}

GET /demo/_mapping

POST /demo/_doc/1
{
  "name": null,
  "age": null
}

GET /demo/_doc/1

GET /demo/_search
{
  "query": {
    "match": {
      "name.raw": "NULLVALUE"
    }
  }
}

copy_to

_all 在新版本中已经弃用, _copy_to 可以实现类似的功能.
目标值不会出现在 _source 中.
copy_to 的目标本身也可以存值.
源值更新了, 目标索引效果会跟随更新.

DELETE /demo

PUT /demo
{
  "mappings": {
    "properties": {
      "first": {
        "type": "text",
        "copy_to": "full"
      },
      "second": {
        "type": "text",
        "copy_to": "full"
      },
      "full": {
        "type": "text"
      }
    }
  }
}

GET /demo/_mapping

POST /demo/_doc/1
{
  "full": "hi",
  "first": "HELLO",
  "second": "WORLD"
}

GET /demo/_doc/1

GET /demo/_search
{
  "query": {
    "match": {
      "full": "hello"
    }
  }
}

POST /demo/_update/1
{
  "doc": {
    "first": "ruby",
    "second": "java"
  }
}

GET /demo/_doc/1

GET /demo/_search
{
  "query": {
    "match": {
      "full": "hello"
    }
  }
}

fields

类型自动映射时, 会把字符串自动设置为:
即利用了多字段特性, 在一个属性上使用多种类型.

POST /demo
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      }
    }
  }
}

Analyzer

处理顺序: Character Filter => Tokenizer => Token Filter

Character Filter:
负责对源文本的处理, 可以设置多个字符过滤器, 按顺序执行.
例如, 先去除 HTML 标签, 再应用自定义替换规则:

GET /_analyze
{
  "char_filter": [
    "html_strip",
    {
      "type": "mapping",
      "mappings": [
        "_ => -",
        ":( => __unhappy__",
        ":) => __happy__"
      ]
    }
  ],
  "text": [
    "<h1>Hello_world <code>:)</code> </h1>"
  ]
}

Tokenizer:
负责分隔 Term, 仅允许设置一个.
它还会负责记录 term 的 order, position, offset 信息.

GET /_analyze
{
  "tokenizer": ["whitespace"],
  "text": "A big Apple."
}

Token Filter:
对分隔完成的 token 执行过滤和其他操作(例如同义词), 即 Token 的标准化. 可以设置多个.

GET /_analyze
{
  "tokenizer": "whitespace",
  "filter": [
    "lowercase",
    "stop"
  ],
  "text": "A big Apple."
}

自定义 analyzer

DELETE /demo

PUT /demo
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "char_filter": [
            "my_char_filter"
          ],
          "tokenizer": "my_tokenizer",
          "filter": [
            "lowercase",
            "my_token_filter"
          ]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "mapping",
          "mappings": [
            ":) => __happy__",
            ":( => __unhappy__"
          ]
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "pattern",
          "pattern": "[,.!?-@]"
        }
      },
      "filter": {
        "my_token_filter": {
          "type": "stop",
          "stopwords": [
            "hi",
            "hello"
          ]
        }
      }
    }
  }
}

GET /demo/_analyze
{
  "analyzer": "my_analyzer",
  "text": "Hello, @Big-Apple! :)"
}

自定义同义词分析器:

写入或检索时, 会进行同义词替换
expand 默认为 true, 意为同义词之间可相互替换;
expand 设为 false, 将后续词映射到第一个词;

DELETE /demo

PUT /demo
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "my_synonym"
          ]
        }
      },
      "filter": {
        "my_synonym": {
          "type": "synonym",
          "expand": true,
          "synonyms": [
            "IT, Internet",
            "IT, Internet Technology",
            "IT, Integration Testing"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "job": {
        "type": "text",
        "analyzer": "my_analyzer",
        "search_analyzer": "standard"
      }
    }
  }
}

GET /demo/_analyze
{
  "analyzer": "my_analyzer",
  "text": "Integration Testing"
}

POST /_bulk
{"index": {"_index": "demo", "_id": 1}}
{"job": "Testing"}
{"index": {"_index": "demo", "_id": 2}}
{"job": "IT"}
{"index": {"_index": "demo", "_id": 3}}
{"job": "Internet"}
{"index": {"_index": "demo", "_id": 4}}
{"job": "IT"}
{"index": {"_index": "demo", "_id": 5}}
{"job": "Internet Technology"}
{"index": {"_index": "demo", "_id": 6}}
{"job": "Integration Testing"}


GET /demo/_search
{
  "query": {
    "match": {
      "job": "Testing"
    }
  }
}

分析器工作在两个阶段:

索引时期 analyzer
检索时期 search_analyzer

Index analyzer 判断顺序:
1. 该属性上的 analyzer mapping 参数
2. 索引 settings 中的 analysis.analyzer.default
3. 默认的 standard analyzer

Search analyzer 判断顺序:
1. 该检索指定的 analyzer
2. 该属性上的 search_analyzer mapping 参数
3. 索引 settings 中的 analysis.analyzer.default_search
4. 该属性上的 analyzer mapping 参数
5. 默认的 standard analyzer

https://www.elastic.co/guide/en/elasticsearch/reference/current/specify-analyzer.html#specify-index-time-analyzer

Index Template

相当于全局设置, 设置在符合匹配规则的索引上.
仅在索引创建时有效(删除模板不影响既有索引, 新加的模板不影响原有索引).

索引创建时的设置顺序:
1. 默认的 settings / mappings;
2. 根据 index template order 的顺序, 从 0 到大依次覆盖生效;
3. 用户指定的 settings / mappings 覆盖以上.

低阶模板提供基础设置, 高阶模板提供特定设置.

# 查看所有 _template
GET /_template
GET /_template/*

# 查看特定的 _template
GET /_template/my_template
GET /_template/my*

DELETE /_template/my_template

POST /_template/my_template
{
  "order": 1, 
  "index_patterns": [
    "*"
  ],
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "my_synonym"
          ]
        }
      },
      "filter": {
        "my_synonym": {
          "type": "synonym",
          "synonyms": [
            "IT, Internet",
            "IT, Internet Technology",
            "IT, Integration Testing"
          ]
        }
      }
    }
  },
  "mappings": {}
}

DELETE /demo

PUT /demo
{
  "mappings": {
    "properties": {
    "job": {
      "type": "text",
      "analyzer": "my_analyzer",
      "search_analyzer": "simple"
    }
  }
  }
}

GET /demo/_mapping

GET /demo/_analyze
{
  "field": "job",
  "text": "Internet Technology"
}

Dynamic Template

设置在特定索引上, 提供了更方便的匹配方式.

https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic-templates.html

聚合 aggs

GET /kibana_sample_data_flights/_search
{
  "size": 0,
  "aggs": {
    "my_aggs_dest": {
      "terms": {
        "field": "DestCountry",
        "size": 3
      },
      "aggs": {
        "my_price": {
          "stats": {
            "field": "AvgTicketPrice"
          }
        }
      }
    }
  }
}

Elastic Search Conception