700字范文 > 基于朴素贝叶斯+Python实现垃圾邮件分类和结果分析

基于朴素贝叶斯+Python实现垃圾邮件分类和结果分析

时间：2022-05-13 20:15:30

基于朴素贝叶斯+Python实现垃圾邮件分类

朴素贝叶斯原理

请参考：贝叶斯推断及其互联网应用（二）：过滤垃圾邮件

Python实现

源代码主干来自： python实现贝叶斯推断——垃圾邮件分类

我只是加了注释，然后做了对结果的分析统计的输出添加。

源码下载： GitHub：下载NaiveBayesEmail.py

本文原载：基于朴素贝叶斯+Python实现垃圾邮件分类

结果分析

仅出现在垃圾邮件（或非垃圾邮件）中的单词在非垃圾邮件（或垃圾邮件）中的概率设为P(not_appear)

1）P(not_appear) = 0.01时的结果：

去停用词结果：

不去停用词结果：

2）P(not_appear) = 0.05时的结果：

去停用词结果：

不去停用词结果：

可见，

去不去停用词差别不大；P(not_appear) 越大越会把spam误判成ham。

3）[把垃圾邮件误判成非垃圾邮件的次数，总误判次数] 对应关系查看

Rate of mistaking spam for ham in 100 times when P(not_appear) = 0.05 without stopwords removal.

[wrong_spamToham, wrong]

结果1：

[[1, 1], [1, 1], [2, 2], [2, 2], [2, 2], [3, 3], [1, 1], [2, 2], [2, 2], [4, 4], [3, 3], [1, 1], [1, 1], [2, 2], [5, 5], [1, 1], [1, 1], [1, 1], [1, 1], [2, 2], [2, 2], [-1], [2, 2], [1, 1], [2, 2], [-1], [3, 3], [2, 2], [1, 1], [1, 1], [2, 2], [-1], [4, 4], [1, 1], [3, 3], [2, 2], [2, 2], [3, 3], [2, 2], [3, 3], [2, 2], [2, 2], [1, 1], [1, 1], [-1], [1, 1], [1, 1], [2, 2], [-1], [2, 2], [1, 1], [2, 2], [1, 1], [-1], [2, 2], [2, 2], [2, 2], [3, 3], [4, 4], [1, 1], [2, 2], [1, 1], [2, 2], [3, 3], [3, 3], [-1], [3, 3], [2, 2], [2, 2], [2, 2], [2, 2], [3, 3], [3, 3], [2, 2], [5, 5], [2, 2], [-1], [4, 4], [3, 3], [4, 4], [1, 1], [3, 3], [1, 1], [1, 1], [-1], [1, 1], [1, 1], [1, 1], [3, 3], [2, 2], [1, 1], [2, 2], [4, 4], [2, 2], [3, 3], [3, 3], [2, 2], [1, 1], [2, 2], [1, 1]]

结果2：

[[1, 1], [1, 1], [4, 4], [1, 1], [2, 2], [1, 1], [3, 3], [-1], [-1], [4, 4], [1, 1], [2, 2], [-1], [3, 3], [5, 5], [2, 2], [1, 1], [1, 1], [4, 4], [2, 2], [2, 2], [3, 3], [-1], [1, 1], [2, 2], [3, 3], [4, 4], [1, 1], [3, 3], [2, 2], [2, 2], [-1], [1, 1], [3, 3], [1, 1], [2, 2], [1, 1], [3, 3], [1, 1], [2, 2], [1, 1], [2, 2], [2, 2], [3, 3], [3, 3], [1, 1], [4, 4], [2, 2], [-1], [3, 3], [-1], [-1], [1, 1], [1, 1], [-1], [3, 3], [1, 1], [1, 1], [3, 3], [2, 2], [4, 4], [1, 1], [-1], [-1], [6, 6], [1, 1], [3, 3], [3, 3], [1, 1], [2, 2], [2, 2], [1, 1], [2, 2], [3, 3], [2, 2], [1, 1], [2, 2], [1, 1], [2, 2], [-1], [1, 1], [3, 3], [2, 2], [2, 2], [1, 1], [1, 1], [-1], [1, 1], [1, 1], [-1], [1, 1], [2, 2], [2, 2], [1, 1], [1, 1], [2, 2], [1, 1], [3, 3], [1, 1], [2, 2]]

可见该算法误判的案例中99.9999999%都是把垃圾邮件误判成非垃圾邮件= =

4）被误判的频率很高的25_spam.txt成分分析

当P(not_appear) = 0.05 without stopwords removal时，随机抽的两次对25_spam.txt误判时25_spam.txt词频情况：

spam_25.txt ham 0.014

{‘bettererections’: 0.4,

‘penisen1argement’: 0.4,

‘supplement’: 0.4,

‘with’: 0.13043478260869568,

‘safest’: 0.4,

‘most’: 0.5128205128205128,

‘effective’: 0.4,

‘methods’: 0.4,

‘today’: 0.34426229508196726,

‘and’: 0.46478873239436624,

‘trusted’: 0.4,

‘inches’: 0.8633093525179857,

‘buy’: 0.8633093525179857,

‘millions’: 0.4,

‘more\n’: 0.4,

‘time’: 0.34426229508196726,

‘products’: 0.5121951219512195,

‘save’: 0.6779661016949153,

‘money’: 0.4,

‘biggerpenis’: 0.4,

‘the’: 0.1914893617021277,

‘experience’: 0.8633093525179857,

‘your’: 0.3559322033898305,

‘ma1eenhancement’: 0.4,

‘grow’: 0.4}

spam_25.txt ham 0.0166

{‘bettererections’: 0.4,

‘penisen1argement’: 0.4,

‘supplement’: 0.4,

‘with’: 0.15492957746478875,

‘safest’: 0.4,

‘most’: 0.5263157894736842,

‘effective’: 0.4,

‘methods’: 0.4,

‘today’: 0.3793103448275862,

‘and’: 0.5301204819277108,

‘trusted’: 0.4,

‘inches’: 0.8860759493670887,

‘buy’: 0.8695652173913043,

‘millions’: 0.4,

‘more\n’: 0.4,

‘time’: 0.26829268292682934,

‘products’: 0.4,

‘save’: 0.6896551724137931,

‘money’: 0.4,

‘biggerpenis’: 0.4,

‘the’: 0.20754716981132076,

‘experience’: 0.8860759493670887,

‘your’: 0.2894736842105263,

‘ma1eenhancement’: 0.4,

‘grow’: 0.4}

可见，基本是因为25_spam和大多数邮件不一样（我们不一样~~），他的很多单词都是从没出现在训练集中的，所以假设的从没出现的单词的概率0.4多了，就容易误判成ham。

完整代码如下：

# !/usr/bin/python# -*- coding: utf-8 -*-# @Date: -05-09 09:29:13# @Author : Alan Lau (rlalan@)# @Language : Python3.5# @EditTime : -04-09 13:04:13# @Editor : Galo# @Function : 1.遍历包含50条数据的email文件夹，获取文件列表# 2.使用random.shuffle()函数打乱列表# 3.截取乱序后的文件列表前10个文件路径，并转移到test文件夹下，作为测试集。# 4.Bayes 垃圾邮件分类及100次结果及分析统计图表# from fwalker import fun# from reader import readtxtimport osimport shutil # 移动文件import random # 随机化抽取文件import numpy as np # 画图import matplotlib.pyplot as plt # 画图from nltk.corpus import stopwords # 去停用词cachedStopWords = stopwords.words("english") # 选用英文停用词词典def fileWalker(path):# 遍历语料目录，将所有语料文件绝对路径存入列表fileArrayfileArray = []for root, dirs, files in os.walk(path):for fn in files:eachpath = str(root+'\\'+fn)fileArray.append(eachpath)return fileArraydef test_set_select():# 从spam和ham集中随机选10封移动到test集中作为测试集filepath = r'..\email'testpath = r'..\email\test'files = fileWalker(filepath)random.shuffle(files)top10 = files[:10]for ech in top10:ech_name = testpath+'\\'+('_'.join(ech.split('\\')[-2:])) # 取分割后的后两项用_拼接shutil.move(ech, testpath) # 把ech移动到testpath文件夹下os.rename(testpath+'\\'+ech.split('\\')[-1], ech_name) # 把ech更名为ech_name,其实可以和上一步合并# print('%s moved' % ech_name)returndef test_set_clear():# 移动test测试集中文件回spam和ham中，等待重新抽取测试集filepath = r'..\email'testpath = r'..\email\test'files = fileWalker(testpath)for ech in files:ech_initial = filepath + '\\' + '\\'.join(' '.join(ech.split('\\')[-1:]).split('_')) # 分析出文件移入测试集前的目录及名称ech_move = filepath + '\\' + (' '.join(ech.split('\\')[-1:]).split('_'))[0] # 分析出文件移入测试集前的目录shutil.move(ech, ech_move) # 把ech移动到ech_move文件夹下os.rename(ech_move+'\\'+' '.join(ech.split('\\')[-1:]), ech_initial) # 恢复原名称# print('%s moved' % ech)returndef readtxt(path, encoding):# 按encoding方式按行读取path路径文件所有行，返回行列表lineswith open(path, 'r', encoding=encoding) as f:lines = f.readlines()return linesdef fileWalker(path):# 获取path路径下所有文件的绝对路径列表fileArrayfileArray = []for root, dirs, files in os.walk(path):for fn in files:eachpath = str(root+'\\'+fn)fileArray.append(eachpath)return fileArraydef email_parser(email_path):# 去特殊字符标点符号，返回纯单词列表clean_wordpunctuations = """,.<>()*&^%$#@!'";~`[]{}|、\\/~+_-=?"""content_list = readtxt(email_path, 'gbk')content = (' '.join(content_list)).replace('\r\n', ' ').replace('\t', ' ')clean_word = []for punctuation in punctuations:content = (' '.join(content.split(punctuation))).replace(' ', ' ')clean_word = [word.lower()for word in content.split(' ') if word.lower() not in cachedStopWords and len(word) > 2]# 此处去了停用词，可不去，影响不大return clean_worddef get_word(email_file):# 获取email_file路径下所有文件的总单词列表，append入word_list，extend入word_set并去重转为setword_list = []word_set = []email_paths = fileWalker(email_file)for email_path in email_paths:clean_word = email_parser(email_path)word_list.append(clean_word)word_set.extend(clean_word)# print(set(word_set))return word_list, set(word_set)def count_word_prob(email_list, union_set):# 返回训练集词频字典word_probword_prob = {}for word in union_set:counter = 0for email in email_list:if word in email:counter += 1else:continueprob = 0.0if counter != 0:prob = counter/len(email_list)else:prob = 0.05 # 进在某一分类中未出现则令该分类下该词词频TF=0.01，0.05，……，越大越会把spam误判成hamword_prob[word] = probreturn word_probdef filter(ham_word_pro, spam_word_pro, test_file):# 进行一次对测试集(10封邮件)的测试，输出对测试集的判断结果# 并返回准确率right_rate，以及把spam误判成ham和总误判次数对应情况right = 0wrong = 0wrong_spam = 0test_paths = fileWalker(test_file)for test_path in test_paths:# 贝叶斯推断计算与判别实现email_spam_prob = 0.0spam_prob = 0.5 # 假设P(spam) = 0.5ham_prob = 0.5 # P(ham) = 0.5file_name = test_path.split('\\')[-1]prob_dict = {}words = set(email_parser(test_path))for word in words: # 统计测试集所出现单词word的P(spam|word)Psw = 0.0if word not in spam_word_pro:Psw = 0.4 # 第一次出现的新单词设P(spam|new word) = 0.4 by Paul Grahamelse:Pws = spam_word_pro[word] # P(word|spam)Pwh = ham_word_pro[word] # P(word|ham)Psw = spam_prob*(Pws/(Pwh*ham_prob+Pws*spam_prob))# P(spam|word) = P(spam)*P(word|spam)/P(word)# = P(spam)*P(word|spam)/(P(word|ham)*P(ham)+P(word|spam)*P(spam))prob_dict[word] = Pswnumerator = 1denominator_h = 1for k, v in prob_dict.items():numerator *= v # P1P2…Pn = P(spam|word1)*P(spam|word2)*…*P(spam|wordn)denominator_h *= (1-v) # (1-P1)(1-P2)…(1-Pn) = (1-P(spam|word1))*(1-P(spam|word2))*…*(1-P(spam|wordn))email_spam_prob = round(numerator/(numerator+denominator_h), 4)# P(spam|word1word2…wordn) = P1P2…Pn/(P1P2…Pn+(1-P1)(1-P2)…(1-Pn))if email_spam_prob > 0.9: # P(spam|word1word2…wordn) > 0.9 认为是spam垃圾邮件print(file_name, 'spam', email_spam_prob)if file_name.split('_')[1] == '25.txt':print(prob_dict)if file_name.split('_')[0] == 'spam': # 记录是否判断准确right += 1else:wrong += 1print('***********************Wrong Prediction***********************')else:print(file_name, 'ham', email_spam_prob)if file_name.split('_')[1] == '25.txt':print(prob_dict)if file_name.split('_')[0] == 'ham': # 记录是否判断准确right += 1else:wrong += 1wrong_spam += 1 # 记录把spam误判成ham的次数print('***********************Wrong Prediction***********************')# print(prob_dict)right_rate = right/(right+wrong) # 计算一个测试集的准确率if wrong != 0:wrong_spam_rate = [wrong_spam, wrong] # [把spam误判成ham的次数，总误判次数]else:wrong_spam_rate = [-1] # 表示总误判次数为0return right_rate, wrong_spam_ratedef main():# 主函数right_rate_list = []wrong_spam_rate_list = []ham_file = r'..\email\ham'spam_file = r'..\email\spam'test_file = r'..\email\test'for i in range(100):# 进行100次抽取测试集，测试并记录准确率，注意训练集应不包含测试集test_set_select() # 构造测试集ham_list, ham_set = get_word(ham_file)spam_list, spam_set = get_word(spam_file)union_set = ham_set | spam_set # 合并纯单词集合ham_word_pro = count_word_prob(ham_list, union_set) # 单词在ham中的出现频率字典spam_word_pro = count_word_prob(spam_list, union_set) # 单词在spam里的出现频率字典rig, wrg = filter(ham_word_pro, spam_word_pro, test_file)right_rate_list.append(rig) # 返回正确率wrong_spam_rate_list.append(wrg) # 返回误报spam->ham占比test_set_clear() # 还原测试集# 画出100次判别的准确率散点图x = range(100)y = right_rate_listplt.scatter(x, y)plt.title('Correct Rate of 100 Times')plt.show()# 输出100次误报spam->ham占比列表print(wrong_spam_rate_list)returnif __name__ == '__main__':main()

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。