部分词分词器与基于词的分词器 Elasticsearch 是一种文本分析工具,用于将输入文本分解成离散的词语或标记。下面是一个使用部分词分词器和基于词的分词器的示例代码:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_partial_word_analyzer": {
"tokenizer": "my_ngram_tokenizer"
}
},
"tokenizer": {
"my_ngram_tokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 5,
"token_chars": [
"letter",
"digit"
]
}
}
}
}
}
GET my_index/_analyze
{
"analyzer": "my_partial_word_analyzer",
"text": "Elasticsearch"
}
上述代码将会使用 ngram 分词器将 "Elasticsearch" 分解成部分词,输出结果如下:
{
"tokens": [
{
"token": "El",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
},
{
"token": "las",
"start_offset": 1,
"end_offset": 4,
"type": "word",
"position": 1
},
{
"token": "ast",
"start_offset": 2,
"end_offset": 5,
"type": "word",
"position": 2
},
{
"token": "sti",
"start_offset": 3,
"end_offset": 6,
"type": "word",
"position": 3
},
{
"token": "tic",
"start_offset": 4,
"end_offset": 7,
"type": "word",
"position": 4
},
{
"token": "ticsearch",
"start_offset": 5,
"end_offset": 15,
"type": "word",
"position": 5
}
]
}
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_word_based_analyzer": {
"tokenizer": "my_standard_tokenizer"
}
},
"tokenizer": {
"my_standard_tokenizer": {
"type": "standard"
}
}
}
}
}
GET my_index/_analyze
{
"analyzer": "my_word_based_analyzer",
"text": "Elasticsearch"
}
上述代码将会使用基于词的分词器将 "Elasticsearch" 分解成独立的词语,输出结果如下:
{
"tokens": [
{
"token": "Elasticsearch",
"start_offset": 0,
"end_offset": 13,
"type": "",
"position": 0
}
]
}
这是一个简单的示例,你可以根据自己的需求进行配置和扩展。
上一篇:部分初始化领域实体