ELasticSearch

ELasticSearch

项目优势

  1. 基于ES自定义拼音分词器实现对关键词、拼音实现搜索框自动补全
  2. 基于ES实现依靠关键词,距离,城市,价格范围等多条件高效搜索,并对关键词进行高亮显示
  3. 聚合索引库数据,根据酒店口杯、数量筛选,动态展现搜索条件
  4. 基于RabbitMq实现ES索引库和数据库数据同步
  5. 部署es集群结合Ribbon实现负载均衡

倒排索引概念

将数据分为词条,保存各词条与id的映射关系

mysql中的like查询是逐一查询的,速率很慢;但是对于简单的id查询工作数据库速度其实不慢。

image-20231022111257660ElasticSearch架构:

es主要可以用于数据搜索(查询商品),mysql强调数据持久化和一致性(订单系统,博客系统)

image-20231022160120120

IK中文分词插件

IK插件词条分析模式:

  • ik_smart:最少切分

  • ik_max_word:最细切分

1
2
3
4
5
POST /_analyze
{
"text": "我是黑马java程序员哈哈!",
"analyzer": "ik_max_word"
}

IK字典管理:

字典中有些网络热词没有添加进来,还有一些不该出现的词需要禁掉,这里我们需要对他的一个配置文件进行设置

修改IKAnalyzer.cfg.xml文件:

1
2
3
4
5
6
7
8
9
10
11
12
13
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
<comment>IK Analyzer 扩展配置</comment>
<!--用户可以在这里配置自己的扩展字典 -->
<entry key="ext_dict">ext.dic</entry>
<!--用户可以在这里配置自己的扩展停止词字典-->
<entry key="ext_stopwords">stopword.dic</entry>
<!--用户可以在这里配置远程扩展字典 -->
<!-- <entry key="remote_ext_dict">words_location</entry> -->
<!--用户可以在这里配置远程扩展停止词字典-->
<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

写入词语进入ext.dic则会扩展词典,写入stopword.dic则会添加禁语

索引库属性

mapping属性:

这里index的属性默认为ture,意思为参与倒排索引搜索

image-20231022173811211

创建索引库(数据库表):

类似于mysql的数据库表,需要定义各种属性,以及数据类型

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
PUT /heima
{
  "mappings": {
    "properties": {
      "info":{
        "type": "text",
        "analyzer": "ik_smart"
      },
      "email":{
        "type": "keyword",
        "index": "false"
      },
      "name":{
        "properties": {
          "firstName": {
            "type": "keyword"
          }
        }
      },
    }
  }
}

DSL语法:

索引库操作(数据库表)

无法修改字段,但是可以新增字段

image-20231022183915567

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#新增索引库
PUT /heima
{
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "ik_smart"
},
"email": {
"type": "keyword",
"index": false
}
}
}
}

#获取索引
GET /heima

#添加索引属性
PUT /heima/_mapping
{
"properties": {
"isRoot": {
"type": "boolean",
"index": false
}
}
}

#删除索引库
DELETE /heima

文档操作(表字段)

其实就是CRUD

image-20231022190727717

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#添加文档(字段)
POST /heima/_doc/1
{
"name": "赵云",
"email": "3065941239@qq.cm",
"isRoot": true
}

#查看文档
GET /heima/_doc/1

#删除文档
DELETE /heima/_doc/1

#修改文档1:全量修改
PUT /heima/_doc/1
{
"name": "赵云",
"email": "3065941239@qq.cm",
"isRoot": false
}

#修改文档2:单变量修改
POST /heima/_update/1
{
"doc": {
"isRoot": true
}
}

查询词条

image-20231023191322282

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
#利用分词器分词查询单字段
GET /hotel/_search
{
"query": {
"match": {
"name": "酒店"
}
}
}

#多字段查询
GET /hotel/_search
{
"query": {
"multi_match": {
"query": "酒店",
"fields": ["brand","business","name"]
}
}
}

#根据范围查询(keyword)
GET /hotel/_search
{
"query": {
"range": {
"price": {
"gte": 10,
"lte": 20
}
}
}
}

#根据keyword字段精确查询
GET /hotel/_search
{
"query": {
"term": {
"business": {
"value": "外滩"
}
}
}
}

#根据地理位置周围圆周区域查询
GET /hotel/_search
{
"query": {
"geo_distance":{
"distance": "15km",
"location": "31.21,121.5"
}
}
}

GET /hotel/_search
{
"query": {
"geo_bounding_box": {
"FIELD": {
"top_left": {
"lat": 31.1,
"lon": 121.5
},
"bottom_right": {
"lat": 30.9,
"lon": 121.7
}
}
}
}
}

在java代码操作ES(RESTClient)

RESTClient文档

根据数据表设计索引库

image-20231022201642068

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
PUT /hotel
{
"mappings": {
"properties": {
"id": {
"type": "keyword"
},
"name": {
"type": "text",
"analyzer": "ik_max_word",
"copy_to": "all"
},
"address": {
"type": "keyword",
"index": false
},
"price": {
"type": "integer"
},
"score": {
"type": "integer"
},
"brand": {
"type": "keyword",
"copy_to": "all"
},
"city": {
"type": "keyword"
},
"starName": {
"type": "keyword"
},
"business": {
"type": "keyword",
"copy_to": "all"
},
"location": {
"type": "geo_point"
},
"pic": {
"type": "keyword",
"index": false
},
"all": {
"type": "text",
"analyzer": "ik_max_word"
}
}
}
}

copy_to是将该字段加入到all,对all操作就能一起查询对应加入的字段

javaRestClient快速开始

  1. 引入依赖
1
2
3
4
5
6
7
8
9
<properties>
<java.version>1.8</java.version>
<elasticsearch.version>7.12.1</elasticsearch.version>
</properties>

<dependency>
<groupId>org.elasticsearch.client</groupId>
<artifactId>elasticsearch-rest-high-level-client</artifactId>
</dependency>
  1. 简单使用
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
RestHighLevelClient client;

@BeforeEach
void initialize(){
this.client = new RestHighLevelClient(
RestClient.builder(HttpHost.create("http://192.168.25.80:9200"))
);
}

@AfterEach
void close() throws IOException {
client.close();
}

@Test
void testClient(){
System.out.println(client);
}

操作索引库(indices)

image-20231023104708723

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
RestHighLevelClient client;
@Test
void testClient() throws IOException {
CreateIndexRequest request = new CreateIndexRequest("hotel");
request.source(HOTEL_INDEX, XContentType.JSON);
client.indices().create(request, RequestOptions.DEFAULT);
}
@Test
void testDelete() throws IOException {
DeleteIndexRequest request = new DeleteIndexRequest("hotel");
client.indices().delete(request, RequestOptions.DEFAULT);
}
@Test
void testExit() throws IOException {
GetIndexRequest request = new GetIndexRequest("hotel");
Boolean getIndexResponse = client.indices().exists(request, RequestOptions.DEFAULT);
System.out.println(getIndexResponse);
}

操作文档(index)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
@Test
//添加文档,多字段更新文档
void testIndexAdd() throws IOException {
Hotel hotel = hotelMapper.selectById(36934L);
HotelDoc hotelDoc = new HotelDoc(hotel);
IndexRequest request = new IndexRequest("hotel").id(hotelDoc.getId().toString());
request.source(JSON.toJSON(hotelDoc),XContentType.JSON);
client.index(request,RequestOptions.DEFAULT);
}
//查询文档
@Test
void testIndexGet() throws IOException {
GetRequest request = new GetRequest("hotel","36934");
GetResponse response = client.get(request, RequestOptions.DEFAULT);
String json = response.getSourceAsString();
System.out.println(json);
}
//更新文档,单字段更新
@Test
void testIndexUpdate() throws IOException {
Hotel hotel = hotelMapper.selectById(36934L);
HotelDoc hotelDoc = new HotelDoc(hotel);
UpdateRequest request = new UpdateRequest("hotel", hotelDoc.getId().toString());
request.doc("city","北京");
client.update(request,RequestOptions.DEFAULT);
}
//删除文档
@Test
void testIndexDelete() throws IOException {
Hotel hotel = hotelMapper.selectById(36934L);
HotelDoc hotelDoc = new HotelDoc(hotel);
DeleteRequest request = new DeleteRequest("hotel", hotelDoc.getId().toString());
client.delete(request,RequestOptions.DEFAULT);
}
//批量添加文档
@Test
void testIndexBulk() throws IOException {
List<Hotel> hotels = hotelMapper.selectList(null);
BulkRequest request = new BulkRequest();
for(Hotel hotel:hotels ){
HotelDoc hotelDoc = new HotelDoc(hotel);
request.add(new IndexRequest("hotel").id(hotelDoc.getId().toString()).source(JSON.toJSON(hotelDoc),XContentType.JSON));
}
client.bulk(request,RequestOptions.DEFAULT);
}

相关性排名算分

  1. TF算法

完全根据词条在该文档出现的频率打分

  1. TF-IDF算法

吸收上种算法的特性,再比较他每个词条在全文中的权重

  1. BM25算法

image-20231025085300052

image-20231025084744168

function_score修改排名算分

image-20231024155110062

复合条件查询(bool查询)

image-20231025083042261

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
GET /hotel/_search
{
"query": {
"bool": {
"must": [
{"term": {
"brand": {
"value": "如家"
}
}}
],
"must_not": [
{
"range": {
"price": {
"gt": 300
}
}
}
],
"filter": [
{
"geo_distance": {
"distance": "10km",
"location": {
"lat": 31.3,
"lon": 121.4
}
}
}
]
}
}
}

搜索结果处理

sort标签与query同级

排序

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# 如果score值相同,则会按照price升序排序
GET /hotel/_search
{
"query": {
"match_all": {}
},
"sort": [
{
"score": "desc"
},
{
"price": "asc"
}
]
}

分页

指定from,size

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
GET  /hotel/_search
{
"query": {
"match_all": {}
},
"from": 0,
"size": 2,
"sort": [
{
"price": {
"order": "asc"
}
}
]
}

RestClient查找操作

单字段查询

1
2
3
4
5
6
7
8
9
10
11
@Test
void testSearch() throws IOException {
SearchRequest request = new SearchRequest("hotel");
// request.source().query(QueryBuilders.matchQuery("all","如家"));
// request.source().query(QueryBuilders.multiMatchQuery("如家","brand","business","name"));
// request.source().query(QueryBuilders.termQuery("brand","如家"));
request.source().query(QueryBuilders.rangeQuery("price").lte(500).gte(300));
SearchResponse response = client.search(request, RequestOptions.DEFAULT);
handleResponse(response);

}

复合条件查询

1
2
3
4
5
6
7
8
9
10
11
12
@Test
void testSearch() throws IOException {
SearchRequest request = new SearchRequest("hotel");
BoolQueryBuilder boolQuery = QueryBuilders.boolQuery();
boolQuery.must(QueryBuilders.termQuery("brand","如家"));
boolQuery.mustNot(QueryBuilders.rangeQuery("price").gt(1000));
boolQuery.filter(QueryBuilders.geoDistanceQuery("location").point(31.1D,121.5D).distance("100km"));
request.source().query(boolQuery);
SearchResponse response = client.search(request, RequestOptions.DEFAULT);
handleResponse(response);

}

分页,排序查询

1
2
3
4
5
6
7
8
9
10
11
12
@Test
void testOrder() throws IOException {
int page = 1;
int size = 5;
SearchRequest request = new SearchRequest("hotel");
request.source().query(QueryBuilders.matchAllQuery());
request.source().sort("price", SortOrder.ASC);
request.source().from((page - 1)*size).size(size);
SearchResponse response = client.search(request, RequestOptions.DEFAULT);
handleResponse(response);

}

高亮显示

requireFieldMatchHighlightBuilder 的一个设置,用于确定高亮是否仅应用于与查询匹配的字段。当 requireFieldMatch 设置为 true 时,Elasticsearch将仅高亮那些在查询中明确提及的字段中与查询匹配的文本。这意味着如果你查询了字段A但没有查询字段B,即使字段B中有与查询匹配的文本,字段B中的文本也不会被高亮。

当你将 requireFieldMatch 设置为 false 时,Elasticsearch将尝试在请求中指定的所有字段中查找与查询匹配的文本,并对其进行高亮,而不仅仅是在查询中明确提及的字段。

1
2
3
4
5
6
7
8
9
@Test
void testHighlight() throws IOException {
SearchRequest request = new SearchRequest("hotel");
request.source().query(QueryBuilders.matchQuery("all","如家"));
request.source().highlighter(new HighlightBuilder().field("name").requireFieldMatch(false));
SearchResponse response = client.search(request, RequestOptions.DEFAULT);
handleResponse(response);

}

黑马旅游

image-20231026101152752

广告置顶

image-20231026221526894

  1. 添加广告标识
1
2
3
public class HotelDoc {
private Boolean isAD;
}
  1. 使用functionScore排序将isAD为true的重新算分
1
2
3
4
5
6
7
FunctionScoreQueryBuilder scoreQuery = QueryBuilders.functionScoreQuery(boolQuery, new FunctionScoreQueryBuilder.FilterFunctionBuilder[]{
new FunctionScoreQueryBuilder.FilterFunctionBuilder(
QueryBuilders.termQuery("isAD", true),
ScoreFunctionBuilders.weightFactorFunction(10)
)
});
request.source().query(scoreQuery);

数据聚合展示桌面选项

118a7526b7eb50dc05b3753e071014dc

image-20231028165453188

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
public Map<String, List<String>> getMultiAggregation() {
Map<String, List<String>> map = new HashMap<>();
SearchRequest request = new SearchRequest("hotel");
request.source().size(0);
request.source().aggregation(AggregationBuilders.terms("brandAgg").field("brand").size(20));
request.source().aggregation(AggregationBuilders.terms("cityAgg").field("city").size(20));
request.source().aggregation(AggregationBuilders.terms("starAgg").field("starName").size(20));
// request.source().aggregation();
SearchResponse response = null;
try {
response = client.search(request, RequestOptions.DEFAULT);
} catch (IOException e) {
e.printStackTrace();
}
// System.out.println("response = " + response);
Aggregations aggregations = response.getAggregations();
List listBrand = getAggList(aggregations,"brandAgg");
map.put("品牌",listBrand);
List listCity = getAggList(aggregations,"cityAgg");
map.put("城市",listCity);
List listStar = getAggList(aggregations,"starAgg");
map.put("星级",listStar);
return map;
}
private List getAggList(Aggregations aggregations,String aggName) {
Terms brandAgg = aggregations.get(aggName);
List list = new ArrayList<String>();
List<? extends Terms.Bucket> buckets = brandAgg.getBuckets();
for(Terms.Bucket bucket:buckets){
String key = bucket.getKeyAsString();
list.add(key);
// System.out.println("key = " + key);
}
return list;
}

数据聚合

查询文档

数据聚合可以理解为对文档做分类之后统计数据,比如会统计:

  1. 品牌分类
  2. 品牌平均评分,最高分,最低分

聚合分类

image-20231028155655044

聚合语法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
GET /hotel/_search
{
#设置聚合的范围
"query": {
"range": {
"price": {
"lte": 1000
}
}
},
#设置命中文档为0,只显示聚合结果
"size": 0,
"aggs": {
"brandAgg": {
"terms": {
"field": "brand",
"size": 10,
#order选择需要排序的字段
"order": {
"scoreAgg.avg": "desc"
}
},
#对评分做聚合,结果带上评分聚合结果
"aggs": {
"scoreAgg": {
"stats": {
"field": "score"
}
}
}
}
}
}

RestAPI实现聚合

image-20231028162343295

image-20231028154356894

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
@Test
void testAggregation() throws IOException {
//实现聚合
SearchRequest request = new SearchRequest("hotel");
request.source().size(0);
request.source().aggregation(AggregationBuilders.terms("brandAgg").field("brand").size(20));
SearchResponse response = client.search(request, RequestOptions.DEFAULT);
System.out.println("response = " + response);
//解析response
Aggregations aggregations = response.getAggregations();
Terms brandAgg = aggregations.get("brandAgg");
List<? extends Terms.Bucket> buckets = brandAgg.getBuckets();
for(Terms.Bucket bucket:buckets){
String key = bucket.getKeyAsString();
System.out.println("key = " + key);
}

}

自动补全

拼音分词器

下载拼音分词器

自定义分词器

image-20231028211649169

analyzer:该分词器是创建倒排索引时,使用的分词器,相当于将分词初始化。

search_analyzer:这个是将搜索词语分词,再在倒排索引中匹配返回。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
// 自定义拼音分词器
PUT /test
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "ik_max_word",
"filter": "py"
}
},
"filter": {
"py": {
"type": "pinyin",
"keep_full_pinyin": false,
"keep_joined_full_pinyin": true,
"keep_original": true,
"limit_first_letter_length": 16,
"remove_duplicated_term": true,
"none_chinese_pinyin_tokenize": false
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer",
"search_analyzer": "ik_max_word"
},
"id": {
"type": "keyword"
}
}
}
}

这个filter配置是用于Elasticsearch的一个拼音分词器(pinyin analyzer)的自定义设置。Elasticsearch是一个基于Lucene的搜索和分析引擎,它提供了全文搜索功能,可以对大量的数据进行实时分析。拼音分词器可以将中文文本转换为拼音,这对于中文的模糊搜索、拼音搜索等场景非常有用。

下面是对这个filter配置中各个字段的分析:

  1. type
    • : “pinyin”
    • 说明: 这个字段指定了分词器的类型为”pinyin”,即拼音分词器。
  2. keep_full_pinyin
    • : false
    • 说明: 这个设置决定了是否保留完整的拼音。设置为false意味着不会保留完整的拼音,只会保留首字母或其他配置的拼音形式。
  3. keep_joined_full_pinyin
    • : true
    • 说明: 这个设置决定了是否保留连写的完整拼音。设置为true意味着会保留连续的完整拼音。
  4. keep_original
    • : true
    • 说明: 这个设置决定了是否保留原始的中文字符。设置为true意味着在分词结果中会保留原始的中文字符。
  5. limit_first_letter_length
    • : 16
    • 说明: 这个设置用于限制首字母的长度。例如,当keep_full_pinyin为false时,这个设置可以限制保留的首字母的最大长度。这里设置为16,意味着最多保留16个首字母。
  6. remove_duplicated_term
    • : true
    • 说明: 这个设置决定了是否去除重复的词条。设置为true意味着会去除分词结果中的重复词条。
  7. none_chinese_pinyin_tokenize
    • : false
    • 说明: 这个设置决定了非中文字符是否应该被拼音分词器处理。设置为false意味着非中文字符不会被拼音分词器处理,它们会保持原样

分词效果:

1
2
3
4
5
6
7
8
POST /test/_analyze
{
"text": ["小米手机"],
"analyzer": "my_analyzer"
}

//分词效果
小米,xiaomi,xm,手机,shouji,sj

RestApi实现自动补全

completion查询字段

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// 自动补全的索引库
PUT test
{
"mappings": {
"properties": {
"title":{
"type": "completion"
}
}
}
}
// 示例数据
POST test/_doc
{
"title": ["Sony", "WH-1000XM3"]
}
POST test/_doc
{
"title": ["SK-II", "PITERA"]
}
POST test/_doc
{
"title": ["Nintendo", "switch"]
}

发起请求

重新设计索引库:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
// 酒店数据索引库
PUT /hotel
{
"settings": {
"analysis": {
"analyzer": {
"text_anlyzer": {
"tokenizer": "ik_max_word",
"filter": "py"
},
"completion_analyzer": {
"tokenizer": "keyword",
"filter": "py"
}
},
"filter": {
"py": {
"type": "pinyin",
"keep_full_pinyin": false,
"keep_joined_full_pinyin": true,
"keep_original": true,
"limit_first_letter_length": 16,
"remove_duplicated_term": true,
"none_chinese_pinyin_tokenize": false
}
}
}
},
"mappings": {
"properties": {
"id":{
"type": "keyword"
},
"name":{
"type": "text",
"analyzer": "text_anlyzer",
"search_analyzer": "ik_smart",
"copy_to": "all"
},
"address":{
"type": "keyword",
"index": false
},
"price":{
"type": "integer"
},
"score":{
"type": "integer"
},
"brand":{
"type": "keyword",
"copy_to": "all"
},
"city":{
"type": "keyword"
},
"starName":{
"type": "keyword"
},
"business":{
"type": "keyword",
"copy_to": "all"
},
"location":{
"type": "geo_point"
},
"pic":{
"type": "keyword",
"index": false
},
"all":{
"type": "text",
"analyzer": "text_anlyzer",
"search_analyzer": "ik_smart"
},
"suggestion":{
"type": "completion",
"analyzer": "completion_analyzer"
}
}
}
}

image-20231029193146116

处理结果

image-20231029193101865

数据同步

image-20231029205608987

方式二:异步通知

image-20231029205800945

方式三:监听binlog

image-20231029205646361

消息队列实现数据同步

image-20231029212527173

ES集群搭建

image-20231031162653795

集群的脑裂

image-20231031163148001

分片存储

image-20231031163854723