700字范文 > 自然语言处理——word2vec项目实战—— NLP理论基础

自然语言处理——word2vec项目实战—— NLP理论基础

时间：2020-06-21 17:59:48

NLP理论基础

语料库

NLTK : pip install nltk

(40条消息) NLTK安装方法_一脑子RMC136的博客-CSDN博客_nltk安装教程

文本处理流程

句子→预处理→分词（Tokenize）→特征工程（make features）→机器学习（machine learning）

分词（Tokenize）

把长句子拆成“有意义”的小部件

英文

from nltk.tokenize import word_tokenizesentences = 'hello world'token = word_tokenize(sentences)print(token)

['hello', 'world']

中文

import jiebaseg_list = jieba.cut('我到北京清华大学',cut_all=True)print('full mode:','/'.join(seg_list)) # 全模式seg_list = jieba.cut('我到北京清华大学',cut_all=False)print('default mode:','/'.join(seg_list)) # 精确模式seg_list = jieba.cut('我到北京清华大学')print('/'.join(seg_list)) # 默认精确模式seg_list = jieba.cut_for_search('我到北京清华大学')print('/'.join(seg_list)) # 搜素引擎模式

full mode: 我/到/北京/清华/清华大学/华大/大学default mode: 我/到/北京/清华大学我/到/北京/清华大学我/到/北京/清华/华大/大学/清华大学

预处理

社交语言

举例：

from nltk.tokenize import word_tokenizetweet = 'RT @angelababy : love you baby! :D http://ah.love #168cm'print(word_tokenize(tweet))

['RT', '@', 'angelababy', ':', 'love', 'you', 'baby', '!', ':', 'D', 'http', ':', '//ah.love', '#', '168cm']

如何做呢？

import re ：正则表达式 → 常用于字符串处理

对照表：/zh/regref.htm

#### pile 是变为方法去判定，符合条件的筛选出来import reemotions_str = r"""(?:[:=;] # 眼睛[oO\-]? # 鼻子[D\)\]\<\]/\\OpP] # 嘴)"""# []表示里面任何一个都可以，[]?表示可存在可不存在regex_str = [emotions_str,r'<[^>]+>', # HTML tagsr'(?:@[\w_]+)', # @某人r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)", # 话题标签r'http[s]?://(?:[a-z][0-9]|[$-_@.&amp:+]|[!*,]|(?:%[0-9a-f][0-9a-f]))+', # URLsr'(?:(?:\d+,?)+(?:\.?\d+)?)', # 数字r"(?:[a-z][a-z'\-_]+[a-z])", # 含有-和‘的单词r'(?:[\w_]+)', # 其他r'(?:\s)', # 其他]emotion_re = pile(r'^'+emotions_str+'$',re.VERBOSE | re.IGNORECASE)tokens_re = pile(r'('+'|'.join(regex_str)+')',re.VERBOSE | re.IGNORECASE)def tokenize(s):return tokens_re.findall(s)def preprocess(s, lowercase=False):tokens = tokenize(s)if lowercase:tokens = [token if emotion_re.search(token) else token.lower() for token in tokens]return tokenstweet = 'RT @angelababy: love you baby! :D http://ah.love #168cm'print(preprocess(tweet))

['RT', ' ', '@angelababy', ' ', 'love', ' ', 'you', ' ', 'baby', ' ', ':D', ' ', 'http://ah.love', ' ', '#168cm']

纷繁复杂的词性

Inflection（不影响词性）：walk→walking→walkedderivation（影响词性）：nation(noun)→national(objective)→nationalize(verb)

Stemming词干提取

一般来说，就是把不影响词性的inflection的小尾巴砍掉——词根

walking→walk、walked→walk

PorterStemmer

from nltk.stem.porter import PorterStemmerP = PorterStemmer() # 类调用之前要初始化print(P.stem('maximum'))print(P.stem('walking'))

maximumwalk

SnowballStemmer

from nltk.stem import SnowballStemmerS = SnowballStemmer('english')print(S.stem('maximum'))print(S.stem('walking'))

maximwalk

LancasterStemmer

from nltk.stem.lancaster import LancasterStemmerL = LancasterStemmer()print(L.stem('maximum'))print(L.stem('walking'))

maximwalk

Lemmatization词形归一

把各种类型的词的变形，都归为一个形式——语料库

went→go、are→be

需要经常更新

from nltk.stem import WordNetLemmatizerW = WordNetLemmatizer()print(W.lemmatize('went',pos='v'))print(W.lemmatize('are',pos='v'))

gobe

词性标注

import nltktext = nltk.word_tokenize('what does the fox say')print(text)tag = nltk.pos_tag(text)print(tag)

['what', 'does', 'the', 'fox', 'say'][('what', 'WDT'), ('does', 'VBZ'), ('the', 'DT'), ('fox', 'NNS'), ('say', 'VBP')]

Stopwords（歧义太多的词）

一个he有一千种指代，一个the有一千种事 → 把这些词删掉

如果只需要考虑词义，可以去掉停止词；但是面对查重或者检查句子是否通顺，就不能去掉停止词。

全体stopwords列表(英文) http://www.ranks.nl/stopwords

import nltkfrom nltk.corpus import stopwords"""先token一把，得到一个word_list...再filter一把"""word_list = nltk.word_tokenize('what does the fox say')print(word_list)filtered_words = [word for word in word_list if word not in stopwords.words('english')]print(filtered_words)

['what', 'does', 'the', 'fox', 'say']['fox', 'say']

NLTK在NLP上的经典应用

情感分析文本相似度文本分类

情感分析

关键词打分

关键词打分机制表（AFINN-111）http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010

import nltkwords = nltk.word_tokenize('bad')sentiment_dictionary = {}for line in open('AFINN/AFINN-111.txt'):word,score = line.split('\t')sentiment_dictionary[word] = int(score)total_score = sum(sentiment_dictionary.get(word,0) for word in words)print(total_score)

-3

配上ML的情感分析

新词怎么办？特殊词汇怎么办？更深层次的怎么办？

from nltk.classify import NaiveBayesClassifier# 随手造点训练集s1 = 'this is a good book's2 = 'this is a awesome book's3 = 'this is a bad book's4 = 'this is a terrible book'def preprocess(s):#句子处理，这里是用split()，把每个单词都分开，没有用到tokenize，因为例子比较简单。return {word: True for word in s.lower().split()}#{fname,fval} 这里用true是最简单的存储形式，fval 每个文本单词对应的值，高级的可以用word2vec来得到fval。#训练 this is terrible good awesome bad book 这样一次单词长列（1,1,0，1,0,0，1）如s1对应的向量training_data = [[preprocess(s1),'pos'],[preprocess(s2), 'pos'],[preprocess(s3), 'neg'],[preprocess(s4), 'neg']]model = NaiveBayesClassifier.train(training_data)print(model.classify(preprocess('this is a good book')))

pos

文本相似度

用元素频率表示文本特征

把文本变成相同长度的向量，用余弦定理判断相似性（相似度越高，夹角越小）先点成再叉乘

s i m i l a r i t y = c o s ( θ ) − A ⋅ B ∥ A ∥ ∥ B ∥ similarity=cos(\theta)-\frac{A\cdot{B}}{\|A\|\|B\|} similarity=cos(θ)−∥A∥∥B∥A⋅B

import nltkfrom nltk import FreqDist# 做个词库先corpus = 'this is my sentence'\'this is my life'\'this is my day'# 可以作任何prepocessingtokens = nltk.word_tokenize(corpus)print(tokens)# 借用NLTK的FreqDist统计一下单词出现的次数fdist = FreqDist(tokens)print(fdist['is'])# 把常用的50个词拿出来，得到一个常用词频率对照表standard_freq_vector = fdist.most_common(50)size = len(standard_freq_vector)print(standard_freq_vector)# 按照频率出现大小，记录每个单词的位置def position_lookup(v):res = {}counter = 0for word in v:res[word[0]] = countercounter += 1return res# 得到一个位置对照表standard_position_dict = position_lookup(standard_freq_vector)print(standard_position_dict)# 新句子sentence = 'this is cool'# 新建一个和标准vector同样大小的向量freq_vector = [0] * sizetokens = nltk.word_tokenize(sentence)for word in tokens:try:# 如果在我们的词库出现过，就在标准词库位置+1freq_vector[standard_position_dict[word]] += 1except KeyError:# 如果是个新词，就pass掉continueprint(freq_vector)

['this', 'is', 'my', 'sentencethis', 'is', 'my', 'lifethis', 'is', 'my', 'day']3[('is', 3), ('my', 3), ('this', 1), ('sentencethis', 1), ('lifethis', 1), ('day', 1)]{'is': 0, 'my': 1, 'this': 2, 'sentencethis': 3, 'lifethis': 4, 'day': 5}[1, 0, 1, 0, 0, 0]

文本分类

TF-IDF

TF(Term Frequency)，一个单词在一个文档中出现的有多频繁。

T F ( t ) = t 出现在文档中的次数 / 文档中的单词总数 TF(t) = t出现在文档中的次数/文档中的单词总数 TF(t)=t出现在文档中的次数/文档中的单词总数

IDF(Inverse Document Frequency)，衡量一个term有多重要，如 ‘is’ ‘the’ 这些不重要——所以需要把罕见的权值弄高，把常见的词权值弄低。

I D F ( t ) = l n ( 文档总数 / 含有 t 的文档总数 ) IDF(t)=ln(文档总数/含有t的文档总数) IDF(t)=ln(文档总数/含有t的文档总数)

T F − I D F ( t ) = T F ( t ) × I D F ( t ) TF-IDF(t)=TF(t)\times{IDF(t)} TF−IDF(t)=TF(t)×IDF(t)

举个栗子🍰

一个文档有100个单词，其中单词baby出现了3次。

TF(baby)=(3/100)=0.03

如果有10M文档，baby出现在其中的1000个文档中。

IDF(baby)=log_e(10,000,000/1,000)=4

TF-IDF(baby)=TF(babyIDF(baby)=0.03×4=0.12

from nltk.text import TextCollection# 首首先, 把所有的文文档放到TextCollection类中。这个类会自自动帮你断句句, 做统计, 做计算corpus = TextCollection(['this is sentence one','this is sentence two','this is sentence three'])# 直接就能算出tfidf (term: 一句话中的某个term, text: 这句话)print(corpus.tf_idf('one', 'this is sentence one')) # 如果是0，那么这个词出现频率太高，每句话都有# 同理, 怎么得到一个标准大小的vector来表示所有的句子?# 对于每个新句子new_sentence = 'this is sentence five'# 遍历一一遍所有的vocabulary中的词:---语料库for word in standard_vocab:print(corpus.tf_idf(word, new_sentence))# 我们会得到一一个巨⻓长(=所有vocab⻓长度)的向量量

内存不够用迭代器读进来

Kaggle竞赛题

/c/home-depot-product-search-relevance

Home Depot Product Search Relevance

import numpy as npimport pandas as pdfrom sklearn.ensemble import RandomForestRegressor,BaggingRegressorfrom nltk.stem.snowball import SnowballStemmer# 读入训练集、测试集，产品介绍df_train = pd.read_csv('./home-depot-product-search-relevance/train.csv',encoding="ISO-8859-1")df_test = pd.read_csv('./home-depot-product-search-relevance/test.csv',encoding="ISO-8859-1")df_desc = pd.read_csv('./home-depot-product-search-relevance/product_descriptions.csv')# 看起来不需要复杂的处理，于是合并测试\训练集，以便于统一做进一步的文本预处理 (240760, 5)df_all = pd.concat((df_train,df_test),axis=0,ignore_index=True) # 上下拼接df_all = pd.merge(df_all,df_desc,how='left',on='product_uid')# 文本预处理：去掉停止词，纠正拼写，去掉数字，去掉表情等等stemmer = SnowballStemmer('english')def str_stemmer(s):return " ".join([stemmer.stem(word) for word in s.lower().split()]) # s.lower()小写化# 为了计算【关键词】的有效性，我们可以native的直接看【出现了几次】def str_common_word(str1,str2):return sum(int(str2.find(word)>=0) for word in str1.split())# 把每一个与文本有关的列都跑一遍，清洗所有文本df_all['search_term'] = df_all['search_term'].map(lambda x:str_stemmer(x)) # 放到str_stemmer里面进行词干提取df_all['product_title'] = df_all['product_title'].map(lambda x:str_stemmer(x))df_all['product_description'] = df_all['product_description'].map(lambda x:str_stemmer(x))# 自制文本特征——脑洞大开，想到什么加什么# 关键词的长度：df_all['len_of_query'] = df_all['search_term'].map(lambda x:len(x.split())).astype(np.int64)# 商品标题中有多少关键词重合df_all['commons_in_title'] = df_all.apply(lambda x:str_common_word(x['search_term'],x['product_title']),axis=1)# 商品描述中有多少关键词重合df_all['commons_in_desc'] = df_all.apply(lambda x:str_common_word(x['search_term'],x['product_description']),axis=1)# 然后把不能被机器学习处理的column给drop掉df_all = df_all.drop(['search_term','product_title','product_description'],axis=1)# 重塑训练集，测试集---数据处理也是这样，搞完一圈之后，让数据重回原本的样貌# 分开训练集和测试集df_train = df_all.loc[df_train.index]df_test = df_all.loc[df_test.index]# 记录下测试集的idtest_idx = df_test['id']# 分离出y_trainy_train = df_train['relevance'].values# 把原数据集中的label删除，否则就cheating了X_train = df_train.drop(['id','relevance'],axis=1).valuesX_test = df_test.drop(['id','relevance'],axis=1).values# 建立模型---用个最简单的模型：RandomForest回归模型from sklearn.ensemble import RandomForestRegressorfrom sklearn.model_selection import cross_val_score# 用cv结果保证公正客观性；并调试不同的alpha值params = [1,3,5,6,7,8,9,10]test_scores = []for param in params:clf = RandomForestRegressor(n_estimators=30, max_depth=param)test_score = np.sqrt(-cross_val_score(clf, X_train, y_train, cv=5, scoring='neg_mean_squared_error')) # 5折交叉验证test_scores.append(np.mean(test_score))# 可视化import matplotlib.pyplot as pltplt.plot(params, test_scores)plt.title("Param vs CV Error")# 用我们测试出的最优解建立模型，并跑跑测试集rf = RandomForestRegressor(n_estimators=30, max_depth=6)rf.fit(X_train, y_train)y_pred = rf.predict(X_test)# 把拿到的结果，放进PD，做成CSV上传pd.DataFrame({"id": test_idx, "relevance": y_pred}).to_csv('submission.csv',index=False)

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。