700字范文 > Elasticsearch 之（24）IK分词器配置文件讲解以及自定义词库

Elasticsearch 之（24）IK分词器配置文件讲解以及自定义词库

时间：2018-06-25 18:07:49

1、ik配置文件

ik配置文件地址：es/plugins/ik/config目录

IKAnalyzer.cfg.xml：用来配置自定义词库

main.dic：ik原生内置的中文词库，总共有27万多条，只要是这些单词，都会被分在一起

quantifier.dic：放了一些单位相关的词

suffix.dic：放了一些后缀

surname.dic：中国的姓氏

stopword.dic：英文停用词

ik原生最重要的两个配置文件

main.dic：包含了原生的中文词语，会按照这个里面的词语去分词

stopword.dic：包含了英文的停用词

停用词，stopword

a the and at but ...一般，像停用词，会在分词的时候，直接被干掉，不会建立在倒排索引中

2、自定义词库

IKAnalyzer.cfg.xml

<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE properties SYSTEM "/dtd/properties.dtd"><properties><comment>IK Analyzer 扩展配置</comment><entry key="ext_dict">custom/mydict.dic;custom/single_word_low_freq.dic</entry><entry key="ext_stopwords">custom/ext_stopword.dic</entry></properties>

（1）自己建立词库：每年都会涌现一些特殊的流行词，网红，蓝瘦香菇，喊麦，鬼畜，一般不会在ik的原生词典里

自己补充自己的最新的词语，到ik的词库custom/mydict.dic里面去

（2）自己建立停用词库：比如了，的，啥，么，我们可能并不想去建立索引，让人家搜索

custom/ext_stopword.dic，已经有了常用的中文停用词，可以补充自己的停用词

补充自己的词语，然后需要重启es，才能生效

3、使用自定义词库分词查询

还未 ik\config\custom\mydict.dic 文件中添加 “喊麦”，进行分词

GET /my_index/_analyze{"text": "喊麦","analyzer": "ik_max_word"}{"tokens": [{"token": "喊","start_offset": 0,"end_offset": 1,"type": "CN_WORD","position": 0},{"token": "麦","start_offset": 1,"end_offset": 2,"type": "CN_WORD","position": 1}]}

在mydict.dic 文件中添加 “喊麦”后，重启es，测试

GET /my_index/_analyze{"text": "喊麦","analyzer": "ik_max_word"}{"tokens": [{"token": "喊麦","start_offset": 0,"end_offset": 2,"type": "CN_WORD","position": 0},{"token": "喊","start_offset": 0,"end_offset": 1,"type": "CN_WORD","position": 1},{"token": "麦","start_offset": 1,"end_offset": 2,"type": "CN_WORD","position": 2}]}

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。