elasticsearch入门

标签： search

elasticsearch 是一款非常强大的搜索开源搜索和分析软件，高扩展高可用。

最新版本 1.7.2 搜索很强大，结合中文分词，可以高效的检索数据，给出符合查询条件的数据，并排序赋予权重得分。

本篇只是入门级讲解，大部分常用的功能基本涉及到了。

elasticsearch 是一款非常成熟的产品，如果需要更深层次的理解，需要详细阅读官方文档。

目前状况

虽然 elasticsearch 很强大，但是数据需要自己组织结构并导入。1.5 之前的版本支持 river 插件，可以通过插件直接从数据库同步数据到 elasticsearch 中，但是目前版本已经不推荐数据导入插件，所以需要自己写 数据同步模块。

中文分词 可以使用，该插件支持实时更新热词，并且可以配置不同的分词策略。

elasticsearch 基于 java 开发，运行需要安装 java 环境。接口为 Restful 风格，实际使用时可以按照 CURD 的原则使用相应的 Http 协议。

默认配置绑定 localhost ，端口 9200 。

数据结构

数据存储结构为 /{index}/{type}/{id} ，使用三级结构保存数据，原始数据保存为 JSON 。

例如：

PUT /index/test/1
{ "title": "最新电影" }

GET /index/test/1
{
    "_index": "index",
    "_type": "test",
    "_id": "1",
    "_version": 1,
    "found": true,
    "_source": {
        "title": "最新电影"
    }
}

原始文档存放在 _source 下，并且存储的数据添加了其他 MetaData 信息 _index _type _id _version ，再次使用 PUT 可以更新文档，使 _version 变为 2 。

数据导入

数据导入可以通过 Post 来完成。

1. 建立Index

建立一个普通的 _index ，不用传任何参数：

PUT http://localhost:9200/test
// 返回
{
    "acknowledged": true
}

如果需要使用 Analysis (语句分词分析) ，则可以设置详细的 _index ：

PUT http://localhost:9200/test
{
  "settings": {
     "refresh_interval": "5s",
     "number_of_shards" :   1, // 一个主节点，默认5
     "number_of_replicas" : 0 // 0个副本，后面可以加，默认1
  },
  "mappings": {
    "_default_":{
      "_all": { "enabled":  false } // 关闭_all字段，因为我们只搜索title字段
    },
    "resource": { // 这个是 _type
      "dynamic": false, // 关闭“动态修改索引”
      "properties": {
        "title": { // 表明对title字段进行分词分析
          "type": "string",
          "index": "analyzed",
          "fields": { // elasticsearch可以识别语言
            "cn": { // 中文使用中文分词
              "type": "string",
              "analyzer": "ik_smart"
            },
            "en": { // 英文使用英文分词
              "type": "string",
              "analyzer": "english"
            }
          }
        }
      }
    }
  }
}

然后向上述 _index(test) 下导入数据：

POST /test/resource/ { "title": "周星驰" } // 这种会自动生成id
PUT /test/resource/1?op_type=create { "title": "周星驰" }
PUT /test/resource/1/_create { "title": "周星驰" }

上述的第一种方式会自动生成 _id 。

POST /_bulk or /test/_bulk or /test/resource/_bulk
{ "create": { "_index": "test", "_type": "resource", "_id": 1 } }
{ "title": "周星驰最新电影" }
{ "create": { "_index": "test", "_type": "resource", "_id": 2 } }
{ "title": "周星驰最好看的新电影" }
{ "create": { "_index": "test", "_type": "resource", "_id": 3 } }
{ "title": "周星驰最新电影，最好，新电影" }
{ "create": { "_index": "test", "_type": "resource", "_id": 4 } }
{ "title": "最最最最好的新新新新电影" }
{ "create": { "_index": "test", "_type": "resource", "_id": 5 } }
{ "title": "I'm not happy about the foxes" }

也可以将 /_bulk 提交的内容放入一个文本（文件末尾必须有一空行 \n）

curl -s -XPOST localhost:9200/_bulk --data-binary "@requests"

// 已经存在会报错
{
  "error" : "DocumentAlreadyExistsException[[website][4] [blog][123]:
             document already exists]",
  "status" : 409
}

数据检索

Retrieving 检索文档

可以使用 Head 判断是否存在

HEAD /{index}/{type}/{id}

可以直接检索到 id 一级，精确获取文档。

// pretty会格式化JSON
GET /{index}/{type}/{id}[?pretty][&_source=field1,field...]

_source命令可以精确检索字段

可以使用 _search 命令：

GET /{index}/_search or /{index}/{type}/_search
// 返回示例
{
    "took": 1, // 耗费毫秒数 
    "timed_out": false, // 可以在命令中设置?timeout=10ms
    "_shards": { // 分区
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": { // 所有文档
        "total": 2,
        "max_score": 1,
        "hits": [ // 所有文档
          {
            "_index": "test",
            "_type": "resource",
            "_id": 1,
            "_score": 1,
            "_source": {
              "title": "周星驰"
            }
          },
          ...
        ]
    }
}

如果开启了 Analysis ，使用 multi_match 则可以按关键字（分词）进行搜索，返回结果按 _score 来排序。

POST /{index}/{type}/_search
{
  "query": {
    "multi_match": {
      "type":     "most_fields", 
      "query":    "周星驰最新电影fox",
      "fields": ["title", "title.cn", "title.en"]
    }
  }
}
// 返回示例
{
    "took": 7,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "failed": 0
    },
    "hits": {
        "total": 5,
        "max_score": 1.4102149,
        "hits": [
            {
                "_index": "index",
                "_type": "test",
                "_id": "1",
                "_score": 1.4102149,
                "_source": {
                    "title": "周星驰最新电影"
                }
            },
            {
                "_index": "index",
                "_type": "test",
                "_id": "3",
                "_score": 1.1354887,
                "_source": {
                    "title": "周星驰最新电影，最好，新电影"
                }
            },
            {
                "_index": "index",
                "_type": "test",
                "_id": "2",
                "_score": 1.0024924,
                "_source": {
                    "title": "周星驰最好看的新电影"
                }
            },
            {
                "_index": "index",
                "_type": "test",
                "_id": "4",
                "_score": 0.31740457,
                "_source": {
                    "title": "最最最最好的新新新新电影"
                }
            },
            {
                "_index": "index",
                "_type": "test",
                "_id": "5",
                "_score": 0.013072087,
                "_source": {
                    "title": "I'm not happy about the foxes"
                }
            }
        ]
    }
}

还可以加上分页，高亮以及最小匹配度：

POST /{index}/{type}/_search
{
  "query": {
    "multi_match": {
      "type":     "most_fields",  // 搜索使用的模式
      "query":    "周星驰最新电影fox",
      "fields": [ "title", "title.cn", "title.en" ], // 设置搜索的范围
      "minimum_should_match": "20%" // 最小匹配度
    }
  },
  "from": 0,
  "size": 10,
  "highlight" : {
    "pre_tags" : ["<strong>"],
    "post_tags" : ["</strong>"],
    "fields" : {
      "title" : {},
      "title.cn" : {},
      "title.en" : {}
    }
  }
}
// 返回
{
    "took": 13,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "failed": 0
    },
    "hits": {
        "total": 5,
        "max_score": 1.1782398,
        "hits": [
            {
                "_index": "test",
                "_type": "resource",
                "_id": "1",
                "_score": 1.1782398,
                "_source": {
                    "title": "周星驰最新电影"
                },
                "highlight": {
                    "title": [
                        "<strong>周</strong><strong>星</strong><strong>驰</strong><strong>最</strong><strong>新</strong><strong>电</strong><strong>影</strong>"
                    ],
                    "title.cn": [
                        "<strong>周星驰</strong><strong>最新</strong><strong>电影</strong>"
                    ],
                    "title.en": [
                        "<strong>周</strong><strong>星</strong><strong>驰</strong><strong>最</strong><strong>新</strong><strong>电</strong><strong>影</strong>"
                    ]
                }
            },
            {
                "_index": "test",
                "_type": "resource",
                "_id": "3",
                "_score": 0.9440402,
                "_source": {
                    "title": "周星驰最新电影，最好，新电影"
                },
                "highlight": {
                    "title": [
                        "<strong>周</strong><strong>星</strong><strong>驰</strong><strong>最</strong><strong>新</strong><strong>电</strong><strong>影</strong>，<strong>最</strong>好，<strong>新</strong><strong>电</strong><strong>影</strong>"
                    ],
                    "title.cn": [
                        "<strong>周星驰</strong><strong>最新</strong><strong>电影</strong>，最好，<strong>新</strong><strong>电影</strong>"
                    ],
                    "title.en": [
                        "<strong>周</strong><strong>星</strong><strong>驰</strong><strong>最</strong><strong>新</strong><strong>电</strong><strong>影</strong>，<strong>最</strong>好，<strong>新</strong><strong>电</strong><strong>影</strong>"
                    ]
                }
            },
            {
                "_index": "test",
                "_type": "resource",
                "_id": "2",
                "_score": 0.8302629,
                "_source": {
                    "title": "周星驰最好看的新电影"
                },
                "highlight": {
                    "title": [
                        "<strong>周</strong><strong>星</strong><strong>驰</strong><strong>最</strong>好看的<strong>新</strong><strong>电</strong><strong>影</strong>"
                    ],
                    "title.cn": [
                        "<strong>周星驰</strong><strong>最</strong>好看的<strong>新</strong><strong>电影</strong>"
                    ],
                    "title.en": [
                        "<strong>周</strong><strong>星</strong><strong>驰</strong><strong>最</strong>好看的<strong>新</strong><strong>电</strong><strong>影</strong>"
                    ]
                }
            },
            {
                "_index": "test",
                "_type": "resource",
                "_id": "4",
                "_score": 0.255055,
                "_source": {
                    "title": "最最最最好的新新新新电影"
                },
                "highlight": {
                    "title": [
                        "<strong>最</strong><strong>最</strong><strong>最</strong><strong>最</strong>好的<strong>新</strong><strong>新</strong><strong>新</strong><strong>新</strong><strong>电</strong><strong>影</strong>"
                    ],
                    "title.cn": [
                        "最最<strong>最</strong>最好的新新新新<strong>电影</strong>"
                    ],
                    "title.en": [
                        "<strong>最</strong><strong>最</strong><strong>最</strong><strong>最</strong>好的<strong>新</strong><strong>新</strong><strong>新</strong><strong>新</strong><strong>电</strong><strong>影</strong>"
                    ]
                }
            },
            {
                "_index": "test",
                "_type": "resource",
                "_id": "5",
                "_score": 0.012243208,
                "_source": {
                    "title": "I'm not happy about the foxes"
                },
                "highlight": {
                    "title.en": [
                        "I'm not happy about the <strong>foxes</strong>"
                    ]
                }
            }
        ]
    }
}

elasticsearch 虽然可以识别语言类型，但是可以看到，英文分词对中文是每个字都区分开了，中文分词则不支持英文。所以使用的时候需要注意。

在上述例子中， multi_match 使用了 most_fields，表示匹配任何满足条件的 field ，multi_match支持如下几种模式：

best_fields
默认模式，搜索任何 field ，但是使用 _score 是所有 field 中最高的一项。
most_fields
搜索任何 field ，但是 _score 是所有 field 的和值。
cross_fields
将所有 field 看成是一个进行搜索。
match_phrase or match_phrase_prefix
两个与 best_fields 类似，但是会把 fileds 拆开，变成多个 queries

{
  "multi_match" : {
    "query":      "quick brown f",
    "type":       "phrase_prefix",
    "fields":     [ "subject", "message" ]
  }
}
// to
{
  "dis_max": {
    "queries": [
      { "match_phrase_prefix": { "subject": "quick brown f" }},
      { "match_phrase_prefix": { "message": "quick brown f" }}
    ]
  }
}

删除文档

DELETE 用来删除文档。

不会立即删除，只是标记删除，在需要的时候再删除。

DELETE /{index}  删除整个index
DELETE /{index}/{type} 删除type一级
DELETE /{index}/{type}/{id} 删除具体的某个文档
// 200
{
  "found" :    true,
  "_index" :   "x",
  "_type" :    "x",
  "_id" :      "x",
  "_version" : 3
}
// 404
{
  "found" :    false,
  "_index" :   "x",
  "_type" :    "x",
  "_id" :      "x",
  "_version" : 4
}

中文分词

https://github.com/medcl/elasticsearch-analysis-ik

配置，使用配置1或者2

elasticsearch.yml
// 1
index:
  analysis:
    analyzer:
      ik:
          alias: [ik_analyzer]
          type: org.elasticsearch.index.analysis.IkAnalyzerProvider
      ik_max_word:
          type: ik
          use_smart: false
      ik_smart:
          type: ik
          use_smart: true
// 2
index.analysis.analyzer.ik.type : "ik" // = ik_max_word

ik_max_word 会将文本做最细粒度的拆分，如

『中华人民共和国国歌』被拆分成
『中华人民共和国』
『中华人民』
...
『国歌』，会穷尽各种可能的组合

ik_smart 会做最粗粒度的拆分，如

『中华人民共和国国歌』拆分为
『中华人民共和国』
『国歌』

在之前的例子中已经使用到了这个分词插件。

最后编辑于：2017.11.27 03:57:09

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 206,378评论 6赞 481
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 88,356评论 2赞 382
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 152,702评论 0赞 342
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 55,259评论 1赞 279
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 64,263评论 5赞 371
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 49,036评论 1赞 285
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 38,349评论 3赞 400
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,979评论 0赞 259
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 43,469评论 1赞 300
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,938评论 2赞 323
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 38,059评论 1赞 333
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,703评论 4赞 323
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 39,257评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 30,262评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,485评论 1赞 262
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 45,501评论 2赞 354
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,792评论 2赞 345