直方图聚合
是一种基于多桶值聚合,可从文档中提取的数值
或数值范围值
来进行聚合。它可以对参与聚合的值来动态的生成固定大小的桶。
假设我们有一个值是32
,并且桶的大小是5
,那么32四舍五入后变成30,因此文档将落入与键30关联的存储桶中。下面的算式可以精确的确定每个文档的归属桶
bucket_key = Math.floor((value - offset) / interval) * interval + offset
offset:
的值默认是从0
开始。并且offset的值必须在[0, interval)
之间。且需要是一个正数
。value:
值的参与计算的值,比如某个文档中的价格字段等。此处是我自己的一个理解,如果错误欢迎指出。
存在的数据: [3, 8, 15]
offset = 0
interval = 5
那么可能会分成如下几个桶 [0,5) [5,10) [10, 15) [15,+∞)
Math.floor((3 - 0) / 5) * 5 + 0 = 0
,即落入[0,5)
这个桶中Math.floor((8 - 0) / 5) * 5 + 0 = 5
,即落入[5,10)
这个桶中Math.floor((15 - 0) / 5) * 5 + 0 = 15
,即落入[15,+∞)
这个桶中我们有一组api
响应时间数据,根据这组数据进行histogram
聚合统计
PUT /index_api_response_time
{"settings": {"number_of_shards": 1},"mappings": {"properties": {"id": {"type": "long"},"api": {"type": "keyword"},"response_time": {"type": "integer"}}}
}
此处的mapping
比较简单,就3个字段id
,api
和response_time
。
PUT /index_api_response_time/_bulk
{"index":{"_id":1}}
{"api":"/user/infos","response_time": 3}
{"index":{"_id":2}}
{"api":"/user/add"}
{"index":{"_id":3}}
{"api":"/user/update","response_time": 8}
{"index":{"_id":4}}
{"api":"/user/list","response_time": 15}
{"index":{"_id":5}}
{"api":"/user/export","response_time": 30}
{"index":{"_id":6}}
{"api":"/user/detail","response_time": 32}
此处先记录 id=2
的数据,这个是没有response_time
的,后期聚合时额外处理。
GET /index_api_response_time/_search
{"size": 0,"aggs": {"agg_01": {"histogram": {"field": "response_time","interval": 5}}}
}
@Test
@DisplayName("根据response_time聚合,间隔为5")
public void test01() throws IOException {SearchRequest request = SearchRequest.of(search ->search.index("index_api_response_time").size(0).aggregations("agg_01", agg -> agg.histogram(histogram -> histogram.field("response_time").interval(5D))));System.out.println("request: " + request);SearchResponse response = client.search(request, String.class);System.out.println("response: " + response);
}
此处聚合一下是为了结合已有的数据,看看每个数据是否落入到了相应的桶中
GET /index_api_response_time/_search
{"size": 0,"aggs": {"agg_01": {"histogram": {"field": "response_time","interval": 5},"aggs": {"agg_sum": {"sum": {"field": "response_time"}}}}}
}
@Test
@DisplayName("在test01基础上聚合出每个桶总的响应时间")
public void test02() throws IOException {SearchRequest request = SearchRequest.of(search ->search.index("index_api_response_time").size(0).aggregations("agg_01", agg ->agg.histogram(histogram -> histogram.field("response_time").interval(5D)).aggregations("agg_sum", aggSum -> aggSum.sum(sum -> sum.field("response_time")))));System.out.println("request: " + request);SearchResponse response = client.search(request, String.class);System.out.println("response: " + response);
}
从5.1中的结果我们可以知道,不管桶中是否存在数据,我们都返回了,即返回了很多空桶。 简单理解就是返回的 桶中存在 doc_count=0 的数据,此处我们需要将这个数据不返回
GET /index_api_response_time/_search
{"size": 0,"aggs": {"agg_01": {"histogram": {"field": "response_time","interval": 5,"min_doc_count": 1}}}
}
@Test
@DisplayName("每个桶中必须存在1个文档的结果才返回-min_doc_count")
public void test03() throws IOException {SearchRequest request = SearchRequest.of(search ->search.index("index_api_response_time").size(0).aggregations("agg_01", agg -> agg.histogram(histogram -> histogram.field("response_time").interval(5D).minDocCount(1))));System.out.println("request: " + request);SearchResponse response = client.search(request, String.class);System.out.println("response: " + response);
}
这个是什么意思?假设我们通过 response_time >= 10 进行过滤,并且 interval=5 那么es默认情况下就不会返回 bucket_key =0,5,10的桶,那么如果我想返回那么该如何处理呢?可以通过 extended_bounds 来实现
。
使用extended_bounds
时,min_doc_count=0
时才有意义。 extended_bounds不会过滤桶。
GET /index_api_response_time/_search
{"size": 0,"query": {"range": {"response_time": {"gte": 10}}}, "aggs": {"agg_01": {"histogram": {"field": "response_time","interval": 5,"min_doc_count": 0,"extended_bounds": {"min": 0,"max": 50}}}}
}
@Test
@DisplayName("补充空桶数据-extended_bounds")
public void test04() throws IOException {SearchRequest request = SearchRequest.of(search ->search.index("index_api_response_time").size(0).query(query-> query.range(range -> range.field("response_time").gte(JsonData.of(10)))).aggregations("agg_01", agg -> agg.histogram(histogram -> histogram.field("response_time").interval(5D).minDocCount(0).extendedBounds(bounds -> bounds.min(1D).max(50D)))));System.out.println("request: " + request);SearchResponse response = client.search(request, String.class);System.out.println("response: " + response);
}
此处的数据:
PUT /index_api_response_time/_bulk
{"index":{"_id":1}}
{"api":"/user/infos","response_time": 3}
{"index":{"_id":2}}
{"api":"/user/add"}
{"index":{"_id":3}}
{"api":"/user/update","response_time": 8}
{"index":{"_id":4}}
{"api":"/user/list","response_time": 15}
{"index":{"_id":5}}
{"api":"/user/export","response_time": 25}
{"index":{"_id":6}}
{"api":"/user/detail","response_time": 32}
GET /index_api_response_time/_search
{"size": 0,"query": {"range": {"response_time": {"gte": 10}}}, "aggs": {"agg_01": {"histogram": {"field": "response_time","interval": 5,"min_doc_count": 0,"hard_bounds": {"min": 15,"max": 25}},"aggs": {"a_s": {"sum": {"field": "response_time"}}}}}
}
@Test
@DisplayName("只展示min-max之间的桶-hard_bounds")
public void test05() throws IOException {SearchRequest request = SearchRequest.of(search ->search.index("index_api_response_time").size(0).query(query-> query.range(range -> range.field("response_time").gte(JsonData.of(10)))).aggregations("agg_01", agg ->agg.histogram(histogram -> histogram.field("response_time").interval(5D).minDocCount(0).hardBounds(bounds -> bounds.min(1D).max(50D))).aggregations("a_s", sumAgg -> sumAgg.sum(sum -> sum.field("response_time")))));System.out.println("request: " + request);SearchResponse response = client.search(request, String.class);System.out.println("response: " + response);
}
By default the returned buckets are sorted by their key
ascending, though the order behaviour can be controlled using the order setting. Supports the same order functionality as the Terms Aggregation
.
GET /index_api_response_time/_search
{"size": 0,"query": {"range": {"response_time": {"gte": 10}}}, "aggs": {"agg_01": {"histogram": {"field": "response_time","interval": 5,"order": {"_count": "desc"}}}}
}
@Test
@DisplayName("排序order")
public void test06() throws IOException {SearchRequest request = SearchRequest.of(search ->search.index("index_api_response_time").size(0).query(query-> query.range(range -> range.field("response_time").gte(JsonData.of(10)))).aggregations("agg_01", agg ->agg.histogram(histogram -> histogram.field("response_time").interval(5D).order(NamedValue.of("_count", SortOrder.Desc)))));System.out.println("request: " + request);SearchResponse response = client.search(request, String.class);System.out.println("response: " + response);
}
GET /index_api_response_time/_search
{"size": 0,"aggs": {"agg_01": {"histogram": {"field": "response_time","interval": 5,"missing": 0}}}
}
@Test
@DisplayName("文档中缺失聚合字段时如何处理-missing")
public void test07() throws IOException {SearchRequest request = SearchRequest.of(search ->search.index("index_api_response_time").size(0).query(query-> query.range(range -> range.field("response_time").gte(JsonData.of(10)))).aggregations("agg_01", agg ->agg.histogram(histogram -> histogram.field("response_time").interval(5D) .missing(0D))));System.out.println("request: " + request);SearchResponse response = client.search(request, String.class);System.out.println("response: " + response);
}
https://gitee.com/huan1993/spring-cloud-parent/blob/master/es/es8-api/src/main/java/com/huan/es8/aggregations/bucket/HistogramAggs.java
上一篇:嵌入式学习笔记(1)基本知识、C语言常用关键字、OLED
下一篇:面向对象编程·下