700字范文 > 朴素贝叶斯(Naive Bayesian)

朴素贝叶斯(Naive Bayesian)

时间：2019-08-19 19:48:37

简介

Naive Bayesian算法也叫朴素贝叶斯算法（或者称为傻瓜式贝叶斯分类）

朴素（傻瓜）：特征条件独立假设

贝叶斯：基于贝叶斯定理

这个算法确实十分朴素（傻瓜），属于监督学习,它是一个常用于寻找决策面的算法。

基本思想

（1）病人分类举例

有六个病人他们的情况如下：

根据这张表如果来了第七个病人他是一个打喷嚏的建筑工人

那么他患上感冒的概率是多少

根据贝叶斯定理：

P(A|B) = P(B|A) P(A) / P(B)

可以得到：

P(感冒|打喷嚏x建筑工人) = P(打喷嚏x建筑工人|感冒) x P(感冒) / P(打喷嚏x建筑工人)

假定感冒与打喷嚏相互独立那么上面的等式变为：

P(感冒|打喷嚏x建筑工人) = P(打喷嚏|感冒) x P(建筑工人|感冒) x P(感冒) / （ P(打喷嚏) x P(建筑工人) ）

P(感冒|打喷嚏x建筑工人) = 2/3 x 1/3 x 1/2 /（ 1/2 x 1/3 ）= 2/3

因此这位打喷嚏的建筑工人患上感冒的概率大约是66%

（2）朴素贝叶斯分类器公式

假设某个体有n项特征，分别为F1、F2、…、Fn。现有m个类别，分别为C1、C2、…、Cm。贝叶斯分类器就是计算出概率最大的那个分类，也就是求下面这个算式的最大值：

P(C|F1 x F2 ...Fn) = P(F1 x F2 ... Fn|C) x P(C) / P(F1 x F2 ... Fn)

由于 P(F1xF2 … Fn) 对于所有的类别都是相同的，可以省略，问题就变成了求

P(F1 x F2 ... Fn|C)P(C)

的最大值

根据朴素贝叶斯的朴素特点（特征条件独立假设），因此：

P(F1 x F2 ... Fn|C)P(C) = P(F1|C) x P(F2|C) ... P(Fn|C)P(C)

上式等号右边的每一项，都可以从统计资料中得到，由此就可以计算出每个类别对应的概率，从而找出最大概率的那个类。

代码实现

环境：MacOS mojave10.14.3

Python3.7.0

使用库：scikit-learn 0.19.2

在终端输入下面的代码安装sklearn

pip install sklearn

sklearn库官方文档http://scikit-/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html

>>> import numpy as np>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])>>> Y = np.array([1, 1, 1, 2, 2, 2])#生成六个训练点，其中前三个属于标签（分类）1 后三个属于标签（分类）2>>> from sklearn.naive_bayes import GaussianNB#导入外部模块>>> clf = GaussianNB()#创建高斯分类器，把GaussianNB赋值给clf（分类器）>>> clf.fit(X, Y)#开始训练#它会学习各种模式，然后就形成了我们刚刚创建的分类器（clf）#我们在分类器上调用fit函数，接下来将两个参数传递给fit函数，一个是特征x 一个是标签y#最后我们让已经完成了训练的分类器进行一些预测，我们为它提供一个新点[-0.8,-1]>>> print(clf.predict([[-0.8, -1]]))[1]

上面的流程为：创建训练点->创建分类器->进行训练->对新的数据进行分类

上面的新的数据属于标签（分类）2

绘制决策面

对于给定的一副散点图，其中蓝色是慢速区红色是快速区，如何画出一条线将点分开

perp_terrain_data.py

生成训练点

import randomdef makeTerrainData(n_points=1000):################################################################################## make the toy datasetrandom.seed(42)grade = [random.random() for ii in range(0,n_points)]bumpy = [random.random() for ii in range(0,n_points)]error = [random.random() for ii in range(0,n_points)]y = [round(grade[ii]*bumpy[ii]+0.3+0.1*error[ii]) for ii in range(0,n_points)]for ii in range(0, len(y)):if grade[ii]>0.8 or bumpy[ii]>0.8:y[ii] = 1.0### split into train/test setsX = [[gg, ss] for gg, ss in zip(grade, bumpy)]split = int(0.75*n_points)X_train = X[0:split]X_test = X[split:]y_train = y[0:split]y_test = y[split:]grade_sig = [X_train[ii][0] for ii in range(0, len(X_train)) if y_train[ii]==0]bumpy_sig = [X_train[ii][1] for ii in range(0, len(X_train)) if y_train[ii]==0]grade_bkg = [X_train[ii][0] for ii in range(0, len(X_train)) if y_train[ii]==1]bumpy_bkg = [X_train[ii][1] for ii in range(0, len(X_train)) if y_train[ii]==1]# training_data = {"fast":{"grade":grade_sig, "bumpiness":bumpy_sig}# , "slow":{"grade":grade_bkg, "bumpiness":bumpy_bkg}}grade_sig = [X_test[ii][0] for ii in range(0, len(X_test)) if y_test[ii]==0]bumpy_sig = [X_test[ii][1] for ii in range(0, len(X_test)) if y_test[ii]==0]grade_bkg = [X_test[ii][0] for ii in range(0, len(X_test)) if y_test[ii]==1]bumpy_bkg = [X_test[ii][1] for ii in range(0, len(X_test)) if y_test[ii]==1]test_data = {"fast":{"grade":grade_sig, "bumpiness":bumpy_sig}, "slow":{"grade":grade_bkg, "bumpiness":bumpy_bkg}}return X_train, y_train, X_test, y_test# return training_data, test_data

ClassifyNB.py

高斯分类

def classify(features_train, labels_train): ### import the sklearn module for GaussianNB### create classifier### fit the classifier on the training features and labels### return the fit classifierfrom sklearn.naive_bayes import GaussianNBclf = GaussianNB()clf.fit(features_train, labels_train)return clfpred = clf.predict(features_test)

class_vis.py

绘图与保存图像

import warningswarnings.filterwarnings("ignore")import matplotlib matplotlib.use('agg')import matplotlib.pyplot as pltimport pylab as plimport numpy as np#import numpy as np#import matplotlib.pyplot as plt#plt.ioff()def prettyPicture(clf, X_test, y_test):x_min = 0.0; x_max = 1.0y_min = 0.0; y_max = 1.0# Plot the decision boundary. For that, we will assign a color to each# point in the mesh [x_min, m_max]x[y_min, y_max].h = .01 # step size in the meshxx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])# Put the result into a color plotZ = Z.reshape(xx.shape)plt.xlim(xx.min(), xx.max())plt.ylim(yy.min(), yy.max())plt.pcolormesh(xx, yy, Z, cmap=pl.cm.seismic)# Plot also the test pointsgrade_sig = [X_test[ii][0] for ii in range(0, len(X_test)) if y_test[ii]==0]bumpy_sig = [X_test[ii][1] for ii in range(0, len(X_test)) if y_test[ii]==0]grade_bkg = [X_test[ii][0] for ii in range(0, len(X_test)) if y_test[ii]==1]bumpy_bkg = [X_test[ii][1] for ii in range(0, len(X_test)) if y_test[ii]==1]plt.scatter(grade_sig, bumpy_sig, color = "b", label="fast")plt.scatter(grade_bkg, bumpy_bkg, color = "r", label="slow")plt.legend()plt.xlabel("bumpiness")plt.ylabel("grade")plt.savefig("test.png")

Main.py

主程序

from prep_terrain_data import makeTerrainDatafrom class_vis import prettyPicturefrom ClassifyNB import classifyimport numpy as npimport pylab as plfeatures_train, labels_train, features_test, labels_test = makeTerrainData()### the training data (features_train, labels_train) have both "fast" and "slow" points mixed### in together--separate them so we can give them different colors in the scatterplot,### and visually identify themgrade_fast = [features_train[ii][0] for ii in range(0, len(features_train)) if labels_train[ii]==0]bumpy_fast = [features_train[ii][1] for ii in range(0, len(features_train)) if labels_train[ii]==0]grade_slow = [features_train[ii][0] for ii in range(0, len(features_train)) if labels_train[ii]==1]bumpy_slow = [features_train[ii][1] for ii in range(0, len(features_train)) if labels_train[ii]==1]clf = classify(features_train, labels_train)### draw the decision boundary with the text points overlaidprettyPicture(clf, features_test, labels_test)

运行得到分类完成图像：

可以看到并不是所有的点都正确分类了，还有一小部分点被错误分类了

计算分类正确率：

accuracy.py

from class_vis import prettyPicturefrom prep_terrain_data import makeTerrainDatafrom classify import NBAccuracyimport matplotlib.pyplot as pltimport numpy as npimport pylab as plfeatures_train, labels_train, features_test, labels_test = makeTerrainData()def submitAccuracy():accuracy = NBAccuracy(features_train, labels_train, features_test, labels_test)return accuracy

在主程序Main结尾加入一段：

from studentCode import submitAccuracyprint(submitAccuracy())

得到正确率：0.884

朴素贝叶斯的优势与劣势

优点：1、非常易于执行2、它的特征空间非常大3、运行非常容易、非常有效

缺点：它会与间断、由多个单词组成且意义明显不同的词语不太适合（eg：芝加哥公牛）

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。