700字范文 > 利用朴素贝叶斯算法识别垃圾邮件

利用朴素贝叶斯算法识别垃圾邮件

时间：2024-05-02 09:00:56

相关推荐

利用朴素贝叶斯算法识别垃圾邮件

转载自：/wowcplusplus/article/details/25190809

朴素贝叶斯算法是被工业界广泛应用的机器学习算法，它有较强的数学理论基础，在一些典型的应用中效果显著。朴素贝叶斯算法基于概率论的贝叶斯理论。该理论的核心公式如下：

式中，表示某种分类，则表示已知的情况下类型为的条件概率。我们求出各个类别下的，然后比较它们的大小，以概率最大的作为最后的类别，以此达到分类的目的。下面我们来看如何计算这些条件概率。

已知，则。朴素贝叶斯假定互为独立变量，则。而（为指示函数，存在则为1，不存在则为0），和都可用训练数据直接统计得出。故可依据上述分析求得的大小。又由于对所有类别都是固定大小，所以比较条件概率的大小等同于比较的大小。这就是朴素贝叶斯的数学原理。

下面，我们以朴素贝叶斯的一个典型应用——过滤垃圾邮件来展示该算法的python实现。

现在，我们有25件垃圾邮件和25件正常邮件，如何使用这些邮件作为训练数据得到过滤垃圾邮件的朴素贝叶斯模型呢？首先，我们用各邮件的词组成词向量，表示在邮件中出现过的词，再计算与孰大孰小即可决定这是封正常邮件（ham）还是封垃圾邮件（spam）。

第一步，我们将邮件转换为numpy的array形式，使用如下函数：

[python]view plaincopydeffile2array(filename): fileReader=open(filename,'r').read() listOfWord=re.split(r'\W*',fileReader) fileArray=[word.lower()forwordinlistOfWordiflen(word)>3] returnfileArray 然后我们将所有邮件合在一个array里面，并在其中随机选取5封作为测试集：[python]view plaincopydefgetAllInfo(): allTextMat=[] allTypeArray=[] testMat=[] testTypeArray=[] foriinrange(1,26): allTextMat.append(file2array('email/spam/%d.txt'%i)) allTypeArray.append(1) allTextMat.append(file2array('email/ham/%d.txt'%i)) allTypeArray.append(0) foriinrange(5): randIndex=int(random.uniform(0,len(allTextMat))) testMat.append(allTextMat[randIndex]) testTypeArray.append(allTypeArray[randIndex]) del(allTextMat[randIndex]) del(allTypeArray[randIndex]) returnallTextMat,allTypeArray,testMat,testTypeArray 接着我们计算所有词的出现次数和在垃圾邮件、正常邮件中分别出现的次数：[python]view plaincopydefgetWordList(allTextMat): wordSet=set() fortextVecinallTextMat: wordSet|=set(textVec) returnlist(wordSet) defgetCountList(wordList,allTextMat,allTypeArray): wordListLen=len(wordList) totalCntList=ones(wordListLen) totalCntList*=2 p0CntList=ones(wordListLen) p1CntList=ones(wordListLen) order=0 p0Cnt=0 p1Cnt=0 fortextVecinallTextMat: forwordintextVec: wordPos=wordList.index(word) totalCntList[wordPos]+=1 ifallTypeArray[order]==1: p1CntList[wordPos]+=1 p1Cnt+=1 elifallTypeArray[order]==0: p0CntList[wordPos]+=1 p0Cnt+=1 order+=1 p0=float(p0Cnt)/(p0Cnt+p1Cnt) p1=1-p0 returntotalCntList,p0CntList,p1CntList,p0,p1 最后我们进行贝叶斯分类：[python]view plaincopydefbayesClassify(testMat,testTypeArray,totalCntList,p0CntList,p1CntList,p0,p1,wordList): docIndex=0 errorCnt=0 fortestVecintestMat: sum0=0.0 sum1=0.0 forwordintestVec: ifwordnotinwordList: continue wordPos=wordList.index(word) sum0+=log(float(p0CntList[wordPos]/totalCntList[wordPos])) sum1+=log(float(p1CntList[wordPos]/totalCntList[wordPos])) sum0+=log(p0) sum1+=log(p1) decType=0ifsum0>sum1else1 ifdecType!=testTypeArray[docIndex]: errorCnt+=1 printsum0,sum1 docIndex+=1 returnerrorCnt 整体的代码如下：

[python]view plaincopyimportos importre fromnumpyimport* deffile2array(filename): fileReader=open(filename,'r').read() listOfWord=re.split(r'\W*',fileReader) fileArray=[word.lower()forwordinlistOfWordiflen(word)>3] returnfileArray defgetAllInfo(): allTextMat=[] allTypeArray=[] testMat=[] testTypeArray=[] foriinrange(1,26): allTextMat.append(file2array('email/spam/%d.txt'%i)) allTypeArray.append(1) allTextMat.append(file2array('email/ham/%d.txt'%i)) allTypeArray.append(0) foriinrange(5): randIndex=int(random.uniform(0,len(allTextMat))) testMat.append(allTextMat[randIndex]) testTypeArray.append(allTypeArray[randIndex]) del(allTextMat[randIndex]) del(allTypeArray[randIndex]) returnallTextMat,allTypeArray,testMat,testTypeArray defgetWordList(allTextMat): wordSet=set() fortextVecinallTextMat: wordSet|=set(textVec) returnlist(wordSet) defgetCountList(wordList,allTextMat,allTypeArray): wordListLen=len(wordList) totalCntList=ones(wordListLen) totalCntList*=2 p0CntList=ones(wordListLen) p1CntList=ones(wordListLen) order=0 p0Cnt=0 p1Cnt=0 fortextVecinallTextMat: forwordintextVec: wordPos=wordList.index(word) totalCntList[wordPos]+=1 ifallTypeArray[order]==1: p1CntList[wordPos]+=1 p1Cnt+=1 elifallTypeArray[order]==0: p0CntList[wordPos]+=1 p0Cnt+=1 order+=1 p0=float(p0Cnt)/(p0Cnt+p1Cnt) p1=1-p0 returntotalCntList,p0CntList,p1CntList,p0,p1 defbayesClassify(testMat,testTypeArray,totalCntList,p0CntList,p1CntList,p0,p1,wordList): docIndex=0 errorCnt=0 fortestVecintestMat: sum0=0.0 sum1=0.0 forwordintestVec: ifwordnotinwordList: continue wordPos=wordList.index(word) sum0+=log(float(p0CntList[wordPos]/totalCntList[wordPos])) sum1+=log(float(p1CntList[wordPos]/totalCntList[wordPos])) sum0+=log(p0) sum1+=log(p1) decType=0ifsum0>sum1else1 ifdecType!=testTypeArray[docIndex]: errorCnt+=1 printsum0,sum1 docIndex+=1 returnerrorCnt defmain(): allTextMat,allTypeArray,testMat,testTypeArray=getAllInfo() wordList=getWordList(allTextMat) totalCntList,p0CntList,p1CntList,p0,p1=getCountList(wordList,allTextMat,allTypeArray) printbayesClassify(testMat,testTypeArray,totalCntList,p0CntList,p1CntList,p0,p1,wordList) if__name__=='__main__': main() 我使用的邮件数据来源于《机器学习实战》第四章。感兴趣的同学可以去它官网/pharrington/下载数据集。

以上就是贝叶斯算法的基本介绍。作为本系列的开篇之作，我在表述上可能会有不当之处，还请各位同学在评论中指正。

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。