Rank Feature为es能在机器学习场景应用提供支持,是es处理特征计算的开始
1、介绍
rank_feature 是es7.0引入的一种特殊的查询query ,这种查询只在rank_feature 和 rank_features字段类型上有效(rank_feature 与rank_features是es7.0新增的数据类型),通常被放到boolean query中的should子句中用来提升文档score,需要注意的是这种查询的性能要高于function score。
通过一个例子进行介绍:
PUT test
{
"mappings": {
"properties": {
"pagerank": {
"type": "rank_feature"
},
"url_length": {
"type": "rank_feature",
"positive_score_impact": false
},
"topics": {
"type": "rank_features"
}
}
}
}
PUT test/_doc/1
{
"url": "http://en.wikipedia.org/wiki/2016_Summer_Olympics",
"content": "Rio 2016",
"pagerank": 50.3,
"url_length": 42,
"topics": {
"sports": 50,
"brazil": 30
}
}
PUT test/_doc/2
{
"url": "http://en.wikipedia.org/wiki/2016_Brazilian_Grand_Prix",
"content": "Formula One motor race held on 13 November 2016 at the Autódromo José Carlos Pace in São Paulo, Brazil",
"pagerank": 50.3,
"url_length": 47,
"topics": {
"sports": 35,
"formula one": 65,
"brazil": 20
}
}
PUT test/_doc/3
{
"url": "http://en.wikipedia.org/wiki/Deadpool_(film)",
"content": "Deadpool is a 2016 American superhero film",
"pagerank": 50.3,
"url_length": 37,
"topics": {
"movies": 60,
"super hero": 65
}
}
POST test/_refresh
GET test/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"content": "2016"
}
}
],
"should": [
{
"rank_feature": {
"field": "pagerank"
}
},
{
"rank_feature": {
"field": "url_length",
"boost": 0.1
}
},
{
"rank_feature": {
"field": "topics.sports",
"boost": 0.4
}
}
]
}
}
}
2、操作
rank_feature query 支持3中影响打分的函数,分别是saturation(默认)、Logarithm、Sigmoid。
-
saturation
score区间(0,1),该函数的打分公式是 S / (S + pivot) ,其中S是rank feature 或 rank features的value值,pivod是score分界值,当S值大于pivot时,score>0.5 ;当S值小于pivot时,score<0.5 。
GET test/_search
{
"query": {
"rank_feature": {
"field": "pagerank",
"saturation": {
"pivot": 8
}
}
}
}
如果不指定pivot,elasticsearch会计算该field下索引值,近似求解出一个平均值作为pivot值;如果不知道如何设置pivot,官方建议不设置。
GET test/_search
{
"query": {
"rank_feature": {
"field": "pagerank",
"saturation": {}
}
}
}
-
Logarithm
score无边界,该函数打分公式是 log(scaling_factor + S) ,其中S是rank feature 或 rank features的value值,scaling_factor 是配置的缩放系数。
GET test/_search
{
"query": {
"rank_feature": {
"field": "pagerank",
"log": {
"scaling_factor": 4
}
}
}
}
需要注意的是该函数下的rank feature 或 rank features的value值必须是正数。
-
Sigmoid
score区间(0,1),该函数是 saturation 函数的扩展,计算公式是 Sexp / (Sexp + pivotexp) ,其中新增了一个指数参数 exponent,该参数必须是整数,建议取值区间[0.5,1] ,如果开始不知道如何设置一个比较理想的exponent值时,官方建议先从saturation函数开始。
GET test/_search
{
"query": {
"rank_feature": {
"field": "pagerank",
"sigmoid": {
"pivot": 7,
"exponent": 0.6
}
}
}
}