700字范文,内容丰富有趣,生活中的好帮手!
700字范文 > ElasticSerach7.15.2插件中文分词器(IK+pinyin)

ElasticSerach7.15.2插件中文分词器(IK+pinyin)

时间:2019-09-01 16:33:43

相关推荐

ElasticSerach7.15.2插件中文分词器(IK+pinyin)

前言

如果直接使用Elasticsearch的朋友在处理中文内容的搜索时,肯定会遇到很尴尬的问题——中文词语被分成了一个一个的汉字,当用Kibana作图的时候,按照term来分组,结果一个汉字被分成了一组。

这是因为使用了Elasticsearch中默认的标准分词器,这个分词器在处理中文的时候会把中文单词切分成一个一个的汉字,因此引入中文的分词器就能解决这个问题。

默认已安装:ElasticSerach7.15.2 和 Kibana-7.15.2

插件下载

IK分词器:/medcl/elasticsearch-analysis-ik/releases/download/v7.15.2/elasticsearch-analysis-ik-7.15.2.zip

pinyin分词器:/medcl/elasticsearch-analysis-pinyin/releases/download/v7.15.2/elasticsearch-analysis-pinyin-7.15.2.zip

下载编译压缩包方法:

pingying 下载地址: /medcl/elasticsearch-analysis-pinyin中文 分词器地址:/medcl/elasticsearch-analysis-ik

将下载且编译好的zip包进行解压,进行拷贝,上传到ElasticSerach的 plugins的文件夹下(/usr/local/elasticsearch-7.15.2/plugins)

重启ElasticSearch:systemctl restart elasticsearch.service

IK分词器测试

#创建索引put index#创建映射post index/_mapping{"properties": {"content": {"type": "text","analyzer": "ik_max_word","search_analyzer": "ik_smart"}}}#添加内容post /index/_create/1{"content":"美国留给伊拉克的是个烂摊子吗"}post /index/_create/2{"content":"公安部:各地校车将享最高路权"}post /index/_create/3{"content":"中韩渔警冲突调查:韩警平均每天扣1艘中国渔船"}post /index/_create/4{"content":"中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"}#根据条件查询post /index/_search{"query" : { "match" : { "content" : "中国" }},"highlight" : {"pre_tags" : ["<tag1>", "<tag2>"],"post_tags" : ["</tag1>", "</tag2>"],"fields" : {"content" : {}}}}

返回结果:

{"took" : 4,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 2,"relation" : "eq"},"max_score" : 0.642793,"hits" : [{"_index" : "index","_type" : "_doc","_id" : "3","_score" : 0.642793,"_source" : {"content" : "中韩渔警冲突调查:韩警平均每天扣1艘中国渔船"},"highlight" : {"content" : ["中韩渔警冲突调查:韩警平均每天扣1艘<tag1>中国</tag1>渔船"]}},{"_index" : "index","_type" : "_doc","_id" : "4","_score" : 0.642793,"_source" : {"content" : "中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"},"highlight" : {"content" : ["<tag1>中国</tag1>驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"]}}]}}

拼音分词器测试

1、使用自定义拼音分析器创建索引

PUT /medcl/{"settings" : {"analysis" : {"analyzer" : {"pinyin_analyzer" : {"tokenizer" : "my_pinyin"}},"tokenizer" : {"my_pinyin" : {"type" : "pinyin","keep_separate_first_letter" : false,"keep_full_pinyin" : true,"keep_original" : true,"limit_first_letter_length" : 16,"lowercase" : true,"remove_duplicated_term" : true}}}}}

2、测试分析器,分析一个中文名称,如:刘德华

GET /medcl/_analyze{"text": ["刘德华"],"analyzer": "pinyin_analyzer"}

{"tokens" : [{"token" : "liu","start_offset" : 0,"end_offset" : 1,"type" : "word","position" : 0},{"token" : "de","start_offset" : 1,"end_offset" : 2,"type" : "word","position" : 1},{"token" : "hua","start_offset" : 2,"end_offset" : 3,"type" : "word","position" : 2},{"token" : "刘德华","start_offset" : 0,"end_offset" : 3,"type" : "word","position" : 3},{"token" : "ldh","start_offset" : 0,"end_offset" : 3,"type" : "word","position" : 4}]}

3、创建映射

POST /medcl/_mapping{"properties": {"name": {"type": "keyword","fields": {"pinyin": {"type": "text","store": false,"term_vector": "with_offsets","analyzer": "pinyin_analyzer","boost": 10}}}}}

4、添加内容

POST /medcl/_create/andy{"name":"刘德华"}

5、测试搜索

#根据条件查询GET /medcl/_search{"query": {"match": {"name.pinyin": "liu"}}}

6、使用过滤

PUT /medcl1/{"settings" : {"analysis" : {"analyzer" : {"user_name_analyzer" : {"tokenizer" : "whitespace","filter" : "pinyin_first_letter_and_full_pinyin_filter"}},"filter" : {"pinyin_first_letter_and_full_pinyin_filter" : {"type" : "pinyin","keep_first_letter" : true,"keep_full_pinyin" : false,"keep_none_chinese" : true,"keep_original" : false,"limit_first_letter_length" : 16,"lowercase" : true,"trim_whitespace" : true,"keep_none_chinese_in_first_letter" : true}}}}}

Token Test:刘德华 张学友 郭富城 黎明 四大天王

GET /medcl1/_analyze{"text": ["刘德华 张学友 郭富城 黎明 四大天王"],"analyzer": "user_name_analyzer"}

{"tokens" : [{"token" : "ldh","start_offset" : 0,"end_offset" : 3,"type" : "word","position" : 0},{"token" : "zxy","start_offset" : 4,"end_offset" : 7,"type" : "word","position" : 1},{"token" : "gfc","start_offset" : 8,"end_offset" : 11,"type" : "word","position" : 2},{"token" : "lm","start_offset" : 12,"end_offset" : 14,"type" : "word","position" : 3},{"token" : "sdtw","start_offset" : 15,"end_offset" : 19,"type" : "word","position" : 4}]}

7、用于短语查询

PUT /medcl2/{"settings" : {"analysis" : {"analyzer" : {"pinyin_analyzer" : {"tokenizer" : "my_pinyin"}},"tokenizer" : {"my_pinyin" : {"type" : "pinyin","keep_first_letter":false,"keep_separate_first_letter" : false,"keep_full_pinyin" : true,"keep_original" : false,"limit_first_letter_length" : 16,"lowercase" : true}}}}}GET /medcl2/_search{"query": {"match_phrase": {"name.pinyin": "刘德华"}}}

PUT /medcl3/{"settings" : {"analysis" : {"analyzer" : {"pinyin_analyzer" : {"tokenizer" : "my_pinyin"}},"tokenizer" : {"my_pinyin" : {"type" : "pinyin","keep_first_letter":true,"keep_separate_first_letter" : true,"keep_full_pinyin" : true,"keep_original" : false,"limit_first_letter_length" : 16,"lowercase" : true}}}}}POST /medcl3/_mapping{"properties": {"name": {"type": "keyword","fields": {"pinyin": {"type": "text","store": false,"term_vector": "with_offsets","analyzer": "pinyin_analyzer","boost": 10}}}}}GET /medcl3/_analyze{"text": ["刘德华"],"analyzer": "pinyin_analyzer"}POST /medcl3/_create/andy{"name":"刘德华"}GET /medcl3/_search{"query": {"match_phrase": {"name.pinyin": "刘德h"}}}GET /medcl3/_search{"query": {"match_phrase": {"name.pinyin": "刘dh"}}}GET /medcl3/_search{"query": {"match_phrase": {"name.pinyin": "liudh"}}}GET /medcl3/_search{"query": {"match_phrase": {"name.pinyin": "liudeh"}}}GET /medcl3/_search{"query": {"match_phrase": {"name.pinyin": "liude华"}}}

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。