文章目录

  • 前期准备
  • 应用场景
    • 1.constant_score查询-不考虑文档频率得分,与搜索关键字命中更多的返回结果
    • 2.sort排序-分数相同情况下,按照指定价格域排序
    • 3.不考虑文档频率TF/IDF情况下,不同域打分权重不同进行召回
    • 4.不考虑文档频率TF/IDF情况下,不同域打分权重不同,再加上制定field的分数,最后最终得分返回,eg:title\^3\+content^1+time
    • 5.不考虑TFIDF得分,同一区域下,不同品牌权重不同
    • 6.如何基于地理位置查询,并且类似于自如租房查找周边价格便宜并且距离近的搜索,但是距离不会完全限定死?
    • 7.有些场景需要根据配置参数值进行排序,例如在所有手机中xiaomi手机得分最高?
    • 8.bm25相似度调优,禁用归一化
    • 9.query_string使用:
    • 10.黄桃、罐头badcase-命中黄桃和罐头商品排在前面,没有完全命中排在后面解决方案
  • 监控
    • _stats索引监控

前期准备

索引mappings:

{"shop_titled_index": {"mappings": {"properties": {"brand": {"type": "text","fields": {"keyword": {"type": "keyword","ignore_above": 256}}},"price": {"type": "long"},"region": {"type": "long"},"shopId": {"type": "long"},"skuId": {"type": "long"},"title": {"type": "text","fields": {"keyword": {"type": "keyword","ignore_above": 256}}}}}}
}

准备数据:

      {"_index": "shop_titled_index","_type": "_doc","_id": "dJAM3HYByj_ONITHr0gq","_score": 1,"_source": {"brand": "iphone","price": 8000,"title": "iphone 12 64G red 5G","skuId": 2020122201,"shopId": 2,"region": 1001}}
      {"_index": "shop_titled_index","_type": "_doc","_id": "9ZA6inYByj_ONITHT0bH","_score": 1,"_source": {"brand": "iphone","price": 8000,"title": "iphone 12 64G red 5G","skuId": 2020122201,"shopId": 1,"region": 1001}}

应用场景

1.constant_score查询-不考虑文档频率得分,与搜索关键字命中更多的返回结果

{"query": {"bool": {"should": [{"constant_score": {"filter": {"match": {"title": "iphone"}},"boost": 1}},{"constant_score": {"filter": {"match": {"title": "12"}}}}]}}

2.sort排序-分数相同情况下,按照指定价格域排序

{"query": {"bool": {"should": [{"constant_score": {"filter": {"match": {"title": "iphone"}},"boost": 1}},{"constant_score": {"filter": {"match": {"title": "12"}}}}]}},"sort": [{"_score": {"order": "desc"}},{"price": {"order": "asc"}}]
}

3.不考虑文档频率TF/IDF情况下,不同域打分权重不同进行召回

{"query": {"bool": {"should": [{"constant_score": {"filter": {"match": {"title": "red"}},"boost": 1}},{"constant_score": {"filter": {"match": {"brand": "iphone"}},"boost":3}}]}},"sort":[{"_score":{"order":"desc"},"price":{"order":"asc"}}]
}

4.不考虑文档频率TF/IDF情况下,不同域打分权重不同,再加上制定field的分数,最后最终得分返回,eg:title^3+content^1+time

{"query": {"function_score": {"query": {"bool": {"should": [{"constant_score": {"filter": {"match": {"title": "red"}},"boost": 1}},{"constant_score": {"filter": {"match": {"brand": "iphone"}},"boost": 3}}]}},"field_value_factor": {"field": "shopId"},"boost_mode": "sum"}}
}

5.不考虑TFIDF得分,同一区域下,不同品牌权重不同

文档:https://www.elastic.co/guide/cn/elasticsearch/guide/current/function-score-filters.html

{"query": {"function_score": {"query": {"term": {"region":1002}},"boost": "1","functions": [{"filter": {"term": {"brand.keyword": "huawei"}},"weight": 3},{"filter":{"match":{"brand":"xiaomi"}},"weight":1}],"score_mode": "sum","boost_mode": "sum"}}
}

使用注意,以下查询会由于function_score没有主query,则会返回所有文档

{"query": {"function_score": {"functions": [{"filter": {"term": {"brand.keyword": "huawei"}},"weight": 3},{"filter":{"match":{"brand":"xiaomi"}},"weight":1}],"score_mode": "sum","boost_mode": "sum"}}
}

6.如何基于地理位置查询,并且类似于自如租房查找周边价格便宜并且距离近的搜索,但是距离不会完全限定死?

参考文档:https://www.cnblogs.com/xiaoxiaoliu/p/11054405.html

  1. 新建索引
  2. 创建mappings
post geo_index/_mappings
{"properties": {"location": {"type": "geo_point"},"price": {"type": "double"},"name": {"type": "text"}}
}

3.准备数据

{"location":{"lon":"116.488781","lat":"39.950565"},"price":"4000","name":"朝阳公园 两室一厅 12m"
}
{"location":{"lon":"116.327805","lat":"39.900988"},"price":"2400","name":"北京西站 三室一厅 9m"
}
{"location": {"lon": "116.403981","lat": "39.916485"},"price": "88888","name": "故宫 无价之宝"
}
{"location": {"lon": "116.341316","lat": "39.948795"},"price": "3700","name": "北京动物园 三室一厅 19m"
}

4.geo_distance:找出附近两公里以内数据

GET geo_index/_search
{"query": {"constant_score": {"filter": {"geo_distance": {"distance": "2km","location": {"lat": 39.93869837,"lon": 116.48357391}}},"boost": 1.2}}
}

输出

{"took": 2,"timed_out": false,"_shards": {"total": 1,"successful": 1,"skipped": 0,"failed": 0},"hits": {"total": {"value": 1,"relation": "eq"},"max_score": 1.2,"hits": [{"_index": "geo_index","_type": "_doc","_id": "1JC14HYByj_ONITHikiw","_score": 1.2,"_source": {"location": {"lon": "116.488781","lat": "39.950565"},"price": "4000","name": "朝阳公园 两室一厅 12m"}}]}
}

5.找出数据,并按照距离排序

文档:https://www.elastic.co/guide/cn/elasticsearch/guide/current/sorting-by-distance.html

{"query": {"constant_score": {"filter": {"geo_distance": {"distance": "10km","location": {"lat": 39.93869837,"lon": 116.48357391}}},"boost": 1.2}},"sort": {"_geo_distance": {"location": [{"lat": 39.93869837,"lon": 116.48357391}],"unit": "km","distance_type": "arc","order": "asc"}}
}

6.根据附近租房和价格查找数据

我更偏向距离更近,因此将权重调高
参考:https://www.elastic.co/guide/cn/elasticsearch/guide/current/decay-functions.html#CO119-4

{"query": {"function_score": {"query": {"range":{"price":{"gte":2000,"lte":5000}}},"functions": [{"gauss": {"location": {"origin": {"lon": "116.47464752","lat": "39.94606859"},"offset": "100m","scale": "1000m"}},"weight":2.0},{"gauss": {"price": {"origin": 3000,"offset": 100,"scale":500}}}],"score_mode": "sum","boost_mode": "replace"}}
}

结果:

{"took": 5,"timed_out": false,"_shards": {"total": 1,"successful": 1,"skipped": 0,"failed": 0},"hits": {"total": {"value": 4,"relation": "eq"},"max_score": 0.7460326,"hits": [{"_index": "geo_index","_type": "_doc","_id": "95A14XYByj_ONITHg0if","_score": 0.7460326,"_source": {"location": {"lon": "116.47155762","lat": "39.9523853"},"price": "3500","name": "亮马桥 两室一厅 12m"}},{"_index": "geo_index","_type": "_doc","_id": "1JC14HYByj_ONITHikiw","_score": 0.36586136,"_source": {"location": {"lon": "116.488781","lat": "39.950565"},"price": "4000","name": "朝阳公园 两室一厅 12m"}},{"_index": "geo_index","_type": "_doc","_id": "1ZC34HYByj_ONITHRkht","_score": 5.823735e-39,"_source": {"location": {"lon": "116.341316","lat": "39.948795"},"price": "3700","name": "北京动物园 三室一厅 19m"}},{"_index": "geo_index","_type": "_doc","_id": "1pC44HYByj_ONITHAkgJ","_score": 0,"_source": {"location": {"lon": "116.327805","lat": "39.900988"},"price": "2400","name": "北京西站 三室一厅 9m"}}]}
}

7.有些场景需要根据配置参数值进行排序,例如在所有手机中xiaomi手机得分最高?

function_score结合scrit_score排序

{"query": {"function_score": {"query": {"match_all":{}},"functions": [{"script_score": {"script": {"lang": "painless","params": {"brand": "xiaomi"},"source": "if(doc['brand.keyword'].size() == 0)return 0f; String brandStr = doc['brand.keyword'].value ?: new String();if(params.brand.compareTo(brandStr) == 0){return 1f}return 0"}}}],"score_mode":"sum","boost_mode":"replace"}}
}

score_mode定义的是如何将各个function的分值合并成一个综合的分值; boost_mode则定义如何将这个综合的分值作用在原始query产生的分值上

8.bm25相似度调优,禁用归一化

BM25:bm25提供两个调参因子
k1:k1 这个参数控制着词频结果在词频饱和度中的上升速度。默认值为 1.2 。值越小饱和度变化越快,值越大饱和度变化越慢。词频饱和度可以参看下面官方文档的截图,图中反应了词频对应的得分曲线,k1 控制 tf of BM25 这条曲线。

b:这个参数控制着字段长归一值所起的作用, 0.0 会禁用归一化, 1.0 会启用完全归一化。默认值为 0.75

  1. mapping设置
{"settings": {"index": {"number_of_shards": "1","provided_name": "my_sim_index","similarity": {"cbm25": {"type": "BM25","b": "0"}},"creation_date": "1610181315498","number_of_replicas": "1","uuid": "V8NhMRofQRu-oPFt6hheWA","version": {"created": "7070099"}}},"mappings": {"_doc": {"properties": {"body": {"similarity": "BM25","type": "text"},"title": {"similarity": "cbm25","type": "text"}}}}
}
  1. 数据准备
{"title": "Elasticsearch allows you to configure a scoring algorithm or similarity per field. The similarity setting provides a simple way of choosing a similarity algorithm other than the default BM25, such as TF/IDF.","body": "Elasticsearch allows you to configure a scoring algorithm or similarity per field. The similarity setting provides a simple way of choosing a similarity algorithm other than the default BM25, such as TF/IDF."
}
{"title": "A simple boolean similarity, which is used when full-text ranking is not needed and the score should only be based on whether the query terms match or not. Boolean similarity gives terms a score equal to their query boost.","body": "A simple boolean similarity, which is used when full-text ranking is not needed and the score should only be based on whether the query terms match or not. Boolean similarity gives terms a score equal to their query boost."
}
{"title": "or similarity per field. The similarity setting provides a simple way of choosing a similarity","body": "or similarity per field. The similarity setting provides a simple way of choosing a similarity"
}
  1. 搜索
    title用两cbm25忽略文档长度归一化,搜索结果与文档长度无关
GET my_sim_index/_search
{"query":{"match":{"title":"similarity"}}
}

输出:

{"took": 1,"timed_out": false,"_shards": {"total": 1,"successful": 1,"skipped": 0,"failed": 0},"hits": {"total": {"value": 3,"relation": "eq"},"max_score": 0.20983505,"hits": [{"_index": "my_sim_index","_type": "_doc","_id": "nZBO5nYByj_ONITHhknJ","_score": 0.20983505,"_source": {"title": "Elasticsearch allows you to configure a scoring algorithm or similarity per field. The similarity setting provides a simple way of choosing a similarity algorithm other than the default BM25, such as TF/IDF.","body": "Elasticsearch allows you to configure a scoring algorithm or similarity per field. The similarity setting provides a simple way of choosing a similarity algorithm other than the default BM25, such as TF/IDF."}},{"_index": "my_sim_index","_type": "_doc","_id": "oZBW5nYByj_ONITHkEli","_score": 0.20983505,"_source": {"title": "or similarity per field. The similarity setting provides a simple way of choosing a similarity","body": "or similarity per field. The similarity setting provides a simple way of choosing a similarity"}},{"_index": "my_sim_index","_type": "_doc","_id": "npBP5nYByj_ONITHK0mo","_score": 0.18360566,"_source": {"title": "A simple boolean similarity, which is used when full-text ranking is not needed and the score should only be based on whether the query terms match or not. Boolean similarity gives terms a score equal to their query boost.","body": "A simple boolean similarity, which is used when full-text ranking is not needed and the score should only be based on whether the query terms match or not. Boolean similarity gives terms a score equal to their query boost."}}]}
}

0.20983505得分相同,尽管文档长度不一样

利用body搜索:

GET my_sim_index/_search
{"query":{"match":{"body":"similarity"}}
}

可以看出最后虽然都命中similary两次但是会受到文档长度影响

9.query_string使用:

{"query":{"query_string":{"query":"(title:red)^1.0 AND (brand:iphone)"}}
}

10.黄桃、罐头badcase-命中黄桃和罐头商品排在前面,没有完全命中排在后面解决方案

方案一:利用contant_score
添加一个忽略TFIDF得分并且自定义得分的查询过滤器用来给完全命中的商品排在前面

       "should": [{"constant_score": {"filter": {"query_string": {"query": "allWord:(+(黄桃) AND +(罐头))"}},"boost": 500}}]

方案二
在原function_score查询语句下的functions里面添加过滤器并添加权重

          "function_score" : {"query" : {"bool" : {"must" : [{"query_string" : {"query" : "(title:(+(黄桃 罐头))^2.4 OR catBrand:(+(黄桃 罐头))^0.6 OR facet:(+(黄桃 罐头))^0.6 OR allWord:(+(黄桃 罐头))^0.0)","fields" : [ ],"use_dis_max" : true,"tie_breaker" : 0.0,"default_operator" : "or","auto_generate_phrase_queries" : false,"max_determinized_states" : 10000,"enable_position_increments" : true,"fuzziness" : "AUTO","fuzzy_prefix_length" : 0,"fuzzy_max_expansions" : 50,"phrase_slop" : 0,"escape" : false,"split_on_whitespace" : true,"boost" : 1.0}}],"filter" : [{"term" : {"skuDocType" : {"value" : 1,"boost" : 1.0}}},{"bool" : {"must_not" : [{"term" : {"spMask" : {"value" : 1,"boost" : 1.0}}}],"disable_coord" : false,"adjust_pure_negative" : true,"boost" : 1.0}}],"disable_coord" : false,"adjust_pure_negative" : true,"boost" : 1.0}},"functions" : [{"filter": {"query_string": {"query":"allWord:(黄桃 AND 罐头)"}},"weight":400},{"filter" : {"match_all" : {"boost" : 1.0}},"script_score" : {"script" : {"id" : "osop_score_script","lang" : "painless","params" : {"catSearch" : false,"fakeCat" : "cat16035591","weight" : true,"topSku" : {"pop8013634719" : 300.0,"1130765898" : 300.0},"hotCatIds" : {"cat16035591" : 0.9666818804198996}}}}}],"score_mode" : "sum","boost_mode" : "sum","max_boost" : 3.4028235E38,"boost" : 1.0}

监控

_stats索引监控

Elasticsearch Index Monitoring(索引监控)之Index Stats API详解
请求方式:

GET 索引名/_stats

参数解释:

1 {  2     "_nodes": {3     "total": 1,4     "successful": 1,5     "failed": 06   },7   "cluster_name": "ELKTEST",8   "nodes": {9     "lnlHC8yERCKXCuAc_2DPCQ": {10       "timestamp": 1534242595995,11       "name": "OPS01-ES01",12       "transport_address": "10.9.125.148:9300",13       "host": "10.9.125.148",14       "ip": "10.9.125.148:9300",15       "roles": [16         "master",17         "data",18         "ingest"19       ],20       "attributes": {21         "ml.machine_memory": "8203104256",22         "xpack.installed": "true",23         "ml.max_open_jobs": "20",24         "ml.enabled": "true"25       },26       "indices": {27         "docs": {28           "count": 8111612,   # 显示节点上有多少文档29           "deleted": 16604    # 有多少已删除的文档还未从数据段中删除30         },31         "store": {32           "size_in_bytes": 2959876263  # 显示该节点消耗了多少物理存储33         },34         "indexing": {       #表示索引文档的次数,这个是通过一个计数器累加计数的。当文档被删除时,它不会减少。注意这个值永远是递增的,发生在内部索引数据的时候,包括那些更新操作35           "index_total": 17703152,36           "index_time_in_millis": 2801934,37           "index_current": 0,38           "index_failed": 0,39           "delete_total": 46242,40           "delete_time_in_millis": 2130,41           "delete_current": 0,42           "noop_update_total": 0,43           "is_throttled": false,44           "throttle_time_in_millis": 0    # 这个值高的时候,说明磁盘流量设置太低45         },46         "get": {47           "total": 185179,48           "time_in_millis": 22341,49           "exists_total": 185178,50           "exists_time_in_millis": 22337,51           "missing_total": 1,52           "missing_time_in_millis": 4,53           "current": 054         },55         "search": {   56           "open_contexts": 0,   # 主动检索的次数,57           "query_total": 495447,    # 查询总数58           "query_time_in_millis": 298344,   # 节点启动到此查询消耗总时间,  query_time_in_millis / query_total的比值可以作为你的查询效率的粗略指标。比值越大,每个查询用的时间越多,你就需要考虑调整或者优化。59           "query_current": 0,         #后面关于fetch的统计,是描述了查询的第二个过程(也就是query_the_fetch里的fetch)。fetch花的时间比query的越多,表示你的磁盘很慢,或者你要fetch的的文档太多。或者你的查询参数分页条件太大,(例如size等于160           "fetch_total": 130194,61           "fetch_time_in_millis": 51211,62           "fetch_current": 0,63           "scroll_total": 22,64           "scroll_time_in_millis": 2196665,65           "scroll_current": 0,66           "suggest_total": 0,67           "suggest_time_in_millis": 0,68           "suggest_current": 069         },70         "merges": { # 包含lucene段合并的信息,它会告诉你有多少段合并正在进行,参与的文档数,这些正在合并的段的总大小,以及花在merge上的总时间。                如果你的集群写入比较多,这个merge的统计信息就很重要。merge操作会消耗大量的磁盘io和cpu资源。如果你的索引写入很多,你会看到大量的merge操作71           "current": 0,72           "current_docs": 0,73           "current_size_in_bytes": 0,
..