朴素贝叶斯是直接衡量标签和特征之间的概率关系的有监督学习算法
分类原理:通过某对象的先验概率,利用贝叶斯公式计算出其后验概率,即该对象属于某一类的概率,选择具有最大后验概率的类作为该对象的类。
import pandas as pdimport numpy as npfrom sklearn.datasets import load_breast_cancer, load_winefrom sklearn.model_selection import train_test_splitfrom sklearn.naive_bayes import GaussianNBfrom sklearn.naive_bayes import MultinomialNBfrom sklearn.naive_bayes import BernoulliNBfrom sklearn.naive_bayes import ComplementNB
cancer = load_breast_cancer()data_train, data_test, target_train, target_test = train_test_split(cancer.data, cancer.target, test_size=0.2, random_state=0)
朴素贝叶斯
朴素贝叶斯法假设:在分类确定的条件下,用于分类的特征是条件独立的。
朴素贝叶斯学习的参数是先验概率和条件概率,通常采用极大似然估计这两种概率
高斯朴素贝叶斯
假设特征的条件概率分布满足高斯分布
sklearn.naive_bayes.GaussianNB
classsklearn.naive_bayes.GaussianNB(*,priors=None,var_smoothing=1e-09)
prior:类的先验概率,如果不指定,则自行根据数据计算先验概率
var_smoothing:浮点数,默认1e-9
GaussianNB_model = GaussianNB()GaussianNB_model.fit(data_train, target_train)train_score = GaussianNB_model.score(data_train, target_train)print('train score:',train_score)test_score = GaussianNB_model.score(data_test, target_test)print('test score:',test_score)
多项式朴素贝叶斯
假设特征的条件概率分布满足多项式分布
sklearn.naive_bayes.MultinomialNB
classsklearn.naive_bayes.MultinomialNB(*,alpha=1.0,fit_prior=True,class_prior=None)
alpha:浮点数,用于指定α的值
fit_prior:bool,如果为True,则不学习概率值,代以均匀分布
class_prior:数组,指定每个分类的先验概率
MultiomiaNB_model = MultinomialNB()MultiomiaNB_model.fit(data_train, target_train)train_score = MultiomiaNB_model.score(data_train, target_train)print('train score:',train_score)test_score = MultiomiaNB_model.score(data_test, target_test)print('test score:',test_score)
绘制准确率与学习率的学习曲线
def test_MultinomialNB_alpha(*data):'''测试 MultinomialNB 的预测性能随 alpha 参数的影响:param data: 可变参数。它是一个元组,这里要求其元素依次为:训练样本集、测试样本集、训练样本的标记、测试样本的标记:return: None'''X_train,X_test,y_train,y_test=dataalphas=np.logspace(-2,5,num=200)train_scores=[]test_scores=[]for alpha in alphas:cls=MultinomialNB(alpha=alpha)cls.fit(X_train,y_train)train_scores.append(cls.score(X_train,y_train))test_scores.append(cls.score(X_test, y_test))## 绘图fig=plt.figure()ax=fig.add_subplot(1,1,1)ax.plot(alphas,train_scores,label="Training Score")ax.plot(alphas,test_scores,label="Testing Score")ax.set_xlabel(r"$\alpha$")ax.set_ylabel("score")ax.set_ylim(0,1.0)ax.set_title("MultinomialNB")ax.set_xscale("log")plt.show()
伯努利贝叶斯分类器
假设特征的条件概率分布满足二项分布
sklearn.naive_bayes.BernoulliNB
classsklearn.naive_bayes.BernoulliNB(*,alpha=1.0,binarize=0.0,fit_prior=True,class_prior=None)
alpha=1.0:浮点数,α值
binarize:将特征二值化的阈值
处理二项分布的朴素贝叶斯,需要先对数据二值化
BernoulliNB_model = BernoulliNB()BernoulliNB_model.fit(data_train, target_train)train_score = BernoulliNB_model.score(data_train, target_train)print('train score:',train_score)test_score = BernoulliNB_model.score(data_test, target_test)print('test score:',test_score)
测试 BernoulliNB 的预测性能随 binarize 参数的影响
作为经验值,可以将binarize取(所有特征中的最小值 + 所有特征中的最大值)/ 2
def test_BernoulliNB_binarize(*data):'''测试 BernoulliNB 的预测性能随 binarize 参数的影响:param data: 可变参数。它是一个元组,这里要求其元素依次为:训练样本集、测试样本集、训练样本的标记、测试样本的标记:return: None'''X_train,X_test,y_train,y_test=datamin_x=min(np.min(X_train.ravel()),np.min(X_test.ravel()))-0.1max_x=max(np.max(X_train.ravel()),np.max(X_test.ravel()))+0.1binarizes=np.linspace(min_x,max_x,endpoint=True,num=100)train_scores=[]test_scores=[]for binarize in binarizes:cls=BernoulliNB(binarize=binarize)cls.fit(X_train,y_train)train_scores.append(cls.score(X_train,y_train))test_scores.append(cls.score(X_test, y_test))## 绘图fig=plt.figure()ax=fig.add_subplot(1,1,1)ax.plot(binarizes,train_scores,label="Training Score")ax.plot(binarizes,test_scores,label="Testing Score")ax.set_xlabel("binarize")ax.set_ylabel("score")ax.set_ylim(0,1.0)ax.set_xlim(min_x-1,max_x+1)ax.set_title("BernoulliNB")ax.legend(loc="best")plt.show()test_BernoulliNB_binarize(data_train,data_test,target_train,target_test) # 调用 test_BernoulliNB_alpha
补集朴素贝叶斯
plementNB
classplementNB(*,alpha=1.0,fit_prior=True,class_prior=None,norm=False)
多项式朴素贝叶斯算法的改进,可以用于捕捉少数类
ComplementNB_model = ComplementNB()ComplementNB_model.fit(data_train, target_train)train_score = ComplementNB_model.score(data_train, target_train)print('train score:',train_score)test_score = ComplementNB_model.score(data_test, target_test)print('test score:',test_score)