一、拼音分词的应用
拼音分词在日常生活中其实很常见,也许你每天都在用。打开淘宝看一看吧,输入拼音”zhonghua”,下面会有包含”zhonghua”对应的中文”中华”的商品的提示:
拼音分词是根据输入的拼音提示对应的中文,通过拼音分词提升搜索体验、加快搜索速度。下面介绍如何在Elasticsearch 5.1.1中配置和实现pinyin+iK分词。
二、IK分词器下载与安装
关于IK分词器的介绍不再多少,一言以蔽之,IK分词是目前使用非常广泛分词效果比较好的中文分词器。做ES开发的,中文分词十有八九使用的都是IK分词器。
下载地址:/medcl/elasticsearch-analysis-ik
配置之前关闭elasticsearch,配置完成以后再重启。
IK的版本要和当前ES的版本一致,README中有说明。我使用的是ES是5.1.1,IK的版本为5.1.1(你也许会奇怪为什么IK上一个版本是1.X,下一个版本一下升到5.X?是因为Elastic官方为了统一版本号,之前es的版本是2.x,logstash的版本是2.x,同时Kibana的版本是4.x,ik的版本是1.x,这样版本很混乱。5.0之后,统一版本号,这样你使用5.1.1的es,其它软件的版本也使用5.1.1就好了)。
下载之后进入到elasticsearch-analysis-pinyin-master目录,mvn打包(没有安装maven的自行安装),运行命令:
mvn package
1
打包成功以后,会生成一个target文件夹,在elasticsearch-analysis-ik-master/target/releases目录下,找到elasticsearch-analysis-ik-5.1.1.zip,这就是我们需要的安装文件。解压elasticsearch-analysis-ik-5.1.1.zip,得到下面内容:
commons-codec-1.9.jarcommons-logging-1.2.jarconfigelasticsearch-analysis-ik-5.1.1.jarhttpclient-4.5.2.jarhttpcore-4.4.4.jarplugin-descriptor.properties
1234567
然后在elasticsearch-5.1.1/plugins目录下新建一个文件夹ik,把elasticsearch-analysis-ik-5.1.1.zip解压后的文件拷贝到elasticsearch-5.1.1/plugins/ik目录下.截图方便理解。
三、pinyin分词器下载与安装
pinyin分词器的下载地址:
/medcl/elasticsearch-analysis-pinyin
安装过程和IK一样,下载、打包、加入ES。这里不在重复上述步骤,给出最后配置截图
四、分词测试
IK和pinyin分词配置完成以后,重启ES。如果重启过程中ES报错,说明安装有错误,没有报错说明配置成功。
4.1 IK分词测试
创建一个索引:
curl -XPUT "http://localhost:9200/index"
1
测试分词效果:
curl -XPOST "http://localhost:9200/index/_analyze?analyzer=ik_max_word&text=中华人民共和国"
1
分词结果:
{"tokens": [{"token": "中华人民共和国","start_offset": 0,"end_offset": 7,"type": "CN_WORD","position": 0}, {"token": "中华人民","start_offset": 0,"end_offset": 4,"type": "CN_WORD","position": 1}, {"token": "中华","start_offset": 0,"end_offset": 2,"type": "CN_WORD","position": 2}, {"token": "华人","start_offset": 1,"end_offset": 3,"type": "CN_WORD","position": 3}, {"token": "人民共和国","start_offset": 2,"end_offset": 7,"type": "CN_WORD","position": 4}, {"token": "人民","start_offset": 2,"end_offset": 4,"type": "CN_WORD","position": 5}, {"token": "共和国","start_offset": 4,"end_offset": 7,"type": "CN_WORD","position": 6}, {"token": "共和","start_offset": 4,"end_offset": 6,"type": "CN_WORD","position": 7}, {"token": "国","start_offset": 6,"end_offset": 7,"type": "CN_CHAR","position": 8}, {"token": "国歌","start_offset": 7,"end_offset": 9,"type": "CN_WORD","position": 9}]}
12345678910111213141516171819222324252627282930313233343536373839404142434445464748495051525354555657585960616263
使用ik_smart分词:
curl -XPOST "http://localhost:9200/index/_analyze?analyzer=ik_smart&text=中华人民共和国"
1
分词结果:
{"tokens": [{"token": "中华人民共和国","start_offset": 0,"end_offset": 7,"type": "CN_WORD","position": 0}, {"token": "国歌","start_offset": 7,"end_offset": 9,"type": "CN_WORD","position": 1}]}
123456789101112131415
截图方便理解:
4.2拼音分词测试
测试拼音分词:
curl -XPOST "http://localhost:9200/index/_analyze?analyzer=pinyin&text=张学友"
1
分词结果:
{"tokens": [{"token": "zhang","start_offset": 0,"end_offset": 1,"type": "word","position": 0}, {"token": "xue","start_offset": 1,"end_offset": 2,"type": "word","position": 1}, {"token": "you","start_offset": 2,"end_offset": 3,"type": "word","position": 2}, {"token": "zxy","start_offset": 0,"end_offset": 3,"type": "word","position": 3}]}
12345678910111213141516171819222324252627
五、IK+pinyin分词配置
5.1创建索引与分析器设置
创建一个索引,并设置index分析器相关属性:
curl -XPUT "http://localhost:9200/medcl/" -d'{"index": {"analysis": {"analyzer": {"ik_pinyin_analyzer": { "type": "custom", "tokenizer": "ik_smart", "filter": ["my_pinyin", "word_delimiter"] }},"filter": {"my_pinyin": { "type": "pinyin", "first_letter": "prefix", "padding_char": " " }}}}}'
12345678910111213141516171819
创建一个type并设置mapping:
curl -XPOST http://localhost:9200/medcl/folks/_mapping -d'{"folks": {"properties": {"name": {"type": "keyword","fields": {"pinyin": {"type": "text","store": "no","term_vector": "with_positions_offsets","analyzer": "ik_pinyin_analyzer","boost": 10}}}}}}'
12345678910111213141516171819
5.2索引测试文档
索引2份测试文档。
文档1:
curl -XPOST http://localhost:9200/medcl/folks/andy -d'{"name":"刘德华"}'
1
文档2:
curl -XPOST http://localhost:9200/medcl/folks/tina -d'{"name":"中华人民共和国国歌"}'
1
5.3测试(1)拼音分词
下面四条命命令都可以匹配”刘德华”
curl -XPOST "http://localhost:9200/medcl/folks/_search?q=name.pinyin:liu"curl -XPOST "http://localhost:9200/medcl/folks/_search?q=name.pinyin:de"curl -XPOST "http://localhost:9200/medcl/folks/_search?q=name.pinyin:hua"curl -XPOST "http://localhost:9200/medcl/folks/_search?q=name.pinyin:ldh"
1234567
5.4测试(2)IK分词测试
curl -XPOST "http://localhost:9200/medcl/_search?pretty" -d'{"query": {"match": {"name.pinyin": "国歌"}},"highlight": {"fields": {"name.pinyin": {}}}}'
12345678910111213
返回结果:
{"took" : 2,"timed_out" : false,"_shards" : {"total" : 5,"successful" : 5,"failed" : 0},"hits" : {"total" : 1,"max_score" : 16.698704,"hits" : [{"_index" : "medcl","_type" : "folks","_id" : "tina","_score" : 16.698704,"_source" : {"name" : "中华人民共和国国歌"},"highlight" : {"name.pinyin" : ["<em>中华人民共和国</em><em>国歌</em>"]}}]}}
123456789101112131415161718192223242526272829
说明IK分词器起到了效果。
5.3测试(4)pinyin+ik分词测试:
curl -XPOST "http://localhost:9200/medcl/_search?pretty" -d'{"query": {"match": {"name.pinyin": "zhonghua"}},"highlight": {"fields": {"name.pinyin": {}}}}'
12345678910111213
返回结果:
{"took" : 3,"timed_out" : false,"_shards" : {"total" : 5,"successful" : 5,"failed" : 0},"hits" : {"total" : 2,"max_score" : 5.9814634,"hits" : [{"_index" : "medcl","_type" : "folks","_id" : "tina","_score" : 5.9814634,"_source" : {"name" : "中华人民共和国国歌"},"highlight" : {"name.pinyin" : ["<em>中华人民共和国</em>国歌"]}},{"_index" : "medcl","_type" : "folks","_id" : "andy","_score" : 2.2534127,"_source" : {"name" : "刘德华"},"highlight" : {"name.pinyin" : ["<em>刘德华</em>"]}}]}}
1234567891011121314151617181922232425262728293031323334353637383940414243
截图如下:
使用pinyin分词以后,原始的字段搜索要加上.pinyin后缀,搜索原始字段没有返回结果: