700字范文 > 【Python机器学习】决策树逻辑回归神经网络等模型对电信用户流失分类实战（附源码

【Python机器学习】决策树逻辑回归神经网络等模型对电信用户流失分类实战（附源码

时间：2019-10-05 16:35:36

需要源码和数据集请点赞关注收藏后评论区留言私信~~~

电信用户流失分类

该实例数据来自kaggle，它的每一条数据为一个用户的信息，共有21个有效字段，其中最后一个字段Churn标志该用户是否流失

1：数据初步分析

可用pandas的read_csv()函数来读取数据，用DataFrame的head()、shape、info()、duplicated()、nunique()等来初步观察数据。

用户信息可分为个人信息、服务订阅信息和帐单信息三类。

1）个人信息包括gender（性别）、SeniorCitizen（是否老年用户）、Partner（是否伴侣用户）和Dependents（是否亲属用户）。

2）服务订阅信息包括tenure（在网时长）、PhoneService（是否开通电话服务业务）、MultipleLines（多线业务服务：Yes，No或No phoneservice）、InternetService（互联网服务：No、DSL数字网络或光纤网络）、OnlineSecurity（网络安全服务：Yes、No或No internetserive）、OnlineBackup（在线备份业务服务：Yes、No或No internetserive）、DeviceProtection（设备保护业务服务：Yes、No或No internetserive）、TechSupport（技术支持服务：Yes、No或No internetserive）、StreamingTV（网络电视服务：Yes、No或No internetserive）、StreamingMovies（网络电影服务：Yes、No或No internetserive）。

3）帐单信息包括Contract（签订合同方式：月、一年或两年）、PaperlessBilling（是否开通电子账单）、PaymentMethod（付款方式：bank transfer、credit card、electronic check或mailed check）、MonthlyCharges（月费用）、TotalCharges（总费用）。

2.流失用户与非流失用户特征分析

1）对于用来描述分类的对象型特征的分布，可用统计图来直观显示。

部分代码如下

import matplotlib.pyplot as pltimport seaborn as snsfig, axes = plt.subplots(2, 2, figsize=(10, 8))sns.countplot(x='gender', data=df, hue='Churn', ax=axes[0][0])sns.countplot(x='SeniorCitizen', data=df, hue='Churn', ax=axes[0][1])sns.countplot(x='Partner', data=df, hue='Churn', ax=axes[1][0])sns.countplot(x='Dependents', data=df, hue='Churn', ax=axes[1][1])

2）对于数值型特征的分布，可用密度图来直观显示

实线表示流失用户，虚线表示非流失用户，可见新用户流失率要高一些

3：分类预测

数据的类型分为对象型和数值型两类。对象型是离散的类别数据，需要对它们进行编码才能形成训练模型的特征。

如果是二值的对象型数据，可以直接用0和1来对它们进行编码。如果取值类别个数多于2，一般可用独热编码。

对于需要进行距离计算的模型，一般还需要对数值型特征进行归一化处理或标准化处理。

经过上述处理后，采用保持法将训练样本切分为训练集和验证集，用来建模并验证模型。

各种方法的准确度及AUC如下

部分代码如下

#!/usr/bin/env python# coding: utf-8# ## 1.加载数据，初步观察# In[1]:import pandas as pdimport numpy as np# 观察各特征的类型，是否有缺失值df.info()# 各特征含义为：customerID：用户ID；# gender：性别（Female & Male）；# SeniorCitizen：老年用户（1表示是，0表示不是）；# Partner：伴侣用户（Yes or No）；# Dependents：亲属用户（Yes or No）；# tenure：在网时长（0-72月）；# PhoneService：是否开通电话服务业务（Yes or No）；# MultipleLines：是否开通了多线业务（Yes 、No or No phoneservice 三种）；# InternetService：是否开通互联网服务（No, DSL数字网络，fiber optic光纤网络三种）；# OnlineSecurity：是否开通网络安全服务（Yes，No，No internetserive 三种）；# OnlineBackup：是否开通在线备份业务（Yes，No，No internetserive 三种）；# DeviceProtection：是否开通了设备保护业务（Yes，No，No internetserive 三种）；# TechSupport：是否开通了技术支持服务（Yes，No，No internetserive 三种）；# StreamingTV：是否开通网络电视（Yes，No，No internetserive 三种）；# StreamingMovies：是否开通网络电影（Yes，No，No internetserive 三种）；# Contract：签订合同方式（按月，一年，两年）；# PaperlessBilling：是否开通电子账单（Yes or No）；# PaymentMethod：付款方式（bank transfer，credit card，electronic check，mailed check）；# MonthlyCharges：月费用；# TotalCharges：总费用；# Churn：该用户是否流失（Yes or No）。# In[6]:# 观察是否有重复值df.customerID.duplicated().sum()# In[7]:# 观察特征的取值情况df.nunique()# In[8]:# 观察下各对象型特征的取值print('gender : ', set(df['gender']))print('Partner : ', set(df['Partner']))print('Dependents : ', set(df['Dependents']))print('PhoneService : ', set(df['PhoneService']))print('MultipleLines : ', set(df['MultipleLines']))print('InternetService : ', set(df['InternetService']))print('OnlineSecurity : ', set(df['OnlineSecurity']))print('OnlineBackup : ', set(df['OnlineBackup']))print('DeviceProtection : ', set(df['DeviceProtection']))print('TechSupport : ', set(df['TechSupport']))print('StreamingTV : ', set(df['StreamingTV']))print('StreamingMovies : ', set(df['StreamingMovies']))print('Contract : ', set(df['Contract']))print('PaperlessBilling : ', set(df['PaperlessBilling']))print('PaymentMethod : ', set(df['PaymentMethod']))print('Churn : ', set(df['Churn']))# ## 2.流失用户与非流失用户特征分析# ### 2.1流失用户与非流失用户的个人信息对比# In[9]:import matplotlib.pyplot as pltimport seaborn as snsfig, axes = plt.subplots(2, 2, figsize=(10, 8))sns.countplot(x='gender', data=df, hue='Churn', ax=axes[0][0])sns.countplot(x='SeniorCitizen', data=df, hue='Churn', ax=axes[0][1])sns.countplot(x='Partner', data=df, hue='Churn', ax=axes[1][0])sns.countplot(x='Dependents', data=df, hue='Churn', ax=axes[1][1])# ### 2.2流失用户与非流失用户的服务订阅信息对比# In[10]:plt.rc('font', family='SimHei')plt.title("在网时长密度图")ax1 = sns.kdeplot(df[df['Churn'] == 'Yes']['tenure'], color='r', linestyle='-', label='Churn:Yes')ax1 = sns.kdeplot(df[df['Churn'] == 'No']['tenure'], color='b', linestyle='--', label='Churn:No')# In[11]:fig, axes = plt.subplots(3, 3, figsize=(10, 8))sns.countplot(x='PhoneService', data=df, hue='Churn', ax=axes[0][0])sns.countplot(x='MultipleLines', data=df, hue='Churn', ax=axes[0][1])sns.countplot(x='InternetService', data=df, hue='Churn', ax=axes[0][2])sns.countplot(x='OnlineSecurity', data=df, hue='Churn', ax=axes[1][0])sns.countplot(x='OnlineBackup', data=df, hue='Churn', ax=axes[1][1])sns.countplot(x='DeviceProtection', data=df, hue='Churn', ax=axes[1][2])sns.countplot(x='TechSupport', data=df, hue='Churn', ax=axes[2][0])sns.countplot(x='StreamingTV', data=df, hue='Churn', ax=axes[2][1])sns.countplot(x='StreamingMovies', data=df, hue='Churn', ax=axes[2][2])# ### 2.3流失用户与非流失用户帐单信息对比# In[12]:fig, axes = plt.subplots(1, 2, figsize=(10, 4))sns.countplot(x='Contract', data=df, hue='Churn', ax=axes[0])sns.countplot(x='PaperlessBilling', data=df, hue='Churn', ax=axes[1])# In[13]:sns.countplot(x='PaymentMethod', data=df, hue='Churn')# In[14]:plt.title("月费用密度图")ax1 = sns.kdeplot(df[df['Churn'] == 'Yes']['MonthlyCharges'], color='r', linestyle='-', label='Churn:Yes')ax1 = sns.kdeplot(df[df['Churn'] == 'No']['MonthlyCharges'], color='b', linestyle='--', label='Churn:No')# In[15]:# 尝试转化为数值df.TotalCharges = pd.to_numeric(df.TotalCharges, errors="raise")# In[16]:# 根据报错提示，观察下空格字符串的分布df[df.TotalCharges == " "]# In[17]:# 用0代替空字符串，重新尝试df['TotalCharges'] = df['TotalCharges'].replace(" ", 0)df.TotalCharges = pd.to_numeric(df.TotalCharges, errors="raise")# In[18]:plt.title("总费用密度图")ax1 = sns.kdeplot(df[df['Churn'] == 'Yes']['TotalCharges'], color='r', linestyle='-', label='Churn:Yes')ax1 = sns.kdeplot(df[df['Churn'] == 'No']['TotalCharges'], color='b', linestyle='--', label='Churn:No')# ## 3.分类预测# ### 3.1 编码，提取特征# In[19]:df_clu = df.drop(['Unnamed: 0', 'customerID', 'Churn'], axis=1)labels = df['Churn']# In[20]:# 二值对象型特征转换成数值型df_clu['gender'] = df_clu['gender'].replace('Male', 1).replace('Female', 0)df_clu['Partner'] = df_clu['Partner'].replace('Yes', 1).replace('No', 0)df_clu['Dependents'] = df_clu['Dependents'].replace('Yes', 1).replace('No', 0)df_clu['PhoneService'] = df_clu['PhoneService'].replace('Yes', 1).replace('No', 0)df_clu['PaperlessBilling'] = df_clu['PaperlessBilling'].replace('Yes', 1).replace('No', 0)labels = labels.replace('Yes', 1).replace('No', 0)# In[21]:# 离散的，可用距离度量的对象型特征转化为数值型df_clu['Contract'] = df_clu['Contract'].replace("Month-to-month", 1).replace("One year", 12).replace("Two year", 24)# In[22]:# 离散的，不宜用距离度量的特征用one-hot编码df_clu = pd.get_dummies(df_clu)df_clu.info()# In[23]:df_clu.max()# In[24]:# 数据归一化df_clu['tenure'] = ( df_clu['tenure'] - df_clu['tenure'].min() )/( df_clu['tenure'].max() - df_clu['tenure'].min() )df_clu['Contract'] = ( df_clu['Contract'] - df_clu['Contract'].min() )/( df_clu['Contract'].max() - df_clu['Contract'].min() )df_clu['MonthlyCharges'] = ( df_clu['MonthlyCharges'] - df_clu['MonthlyCharges'].min() )/( df_clu['MonthlyCharges'].max() - df_clu['MonthlyCharges'].min() )df_clu['TotalCharges'] = ( df_clu['TotalCharges'] - df_clu['TotalCharges'].min() )/( df_clu['TotalCharges'].max() - df_clu['TotalCharges'].min() )# ### 3.2 保持法切分训练集和验证集 # In[25]:from sklearn.model_selection import train_test_splitfrom sklearn.metrics import accuracy_score, classification_report, roc_auc_score# 将数据集分成训练集和验证集X_train, X_test, y_train, y_test = train_test_split(df_clu, labels, test_size=0.3, random_state = 1026)# ### 3.3 建模并验证 # In[26]:# 多项式朴素贝叶斯模型from sklearn.naive_bayes import MultinomialNBmodel = MultinomialNB()model.fit(X_train, y_train)predictions = model.predict(X_test)print('Test set accuracy score: ', accuracy_score(y_test, predictions))print('Area under the ROC curve: ', roc_auc_score(y_test, predictions))print(classification_report(y_test, predictions))# In[27]:# 高期朴素贝叶斯模型from sklearn.naive_bayes import GaussianNBmodel = GaussianNB()model.fit(X_train, y_train)predictions = model.predict(X_test)print('Test set accuracy score: ', accuracy_score(y_test, predictions))print('Area under the ROC curve: ', roc_auc_score(y_test, predictions))print(classification_report(y_test, predictions))# In[28]:# 逻辑回归模型from sklearn.linear_model import LogisticRegressionmodel = LogisticRegression(solver='liblinear', penalty='l1')'''solver：一个字符串，指定了求解最优化问题的算法，可以为如下的值:'newton-cg'：使用牛顿法。'lbfgs'：使用L-BFGS拟牛顿法。'liblinear' ：使用 liblinear。'sag'：使用 Stochastic Average Gradient descent 算法。'''model.fit(X_train, y_train)predictions = model.predict(X_test)print('Test set accuracy score: ', accuracy_score(y_test, predictions))print('Area under the ROC curve: ', roc_auc_score(y_test, predictions))print(classification_report(y_test, predictions))# In[29]:# 决策树模型from sklearn.tree import DecisionTreeClassifiermodel = DecisionTreeClassifier(random_state=1026)model.fit(X_train, y_train)predictions = model.predict(X_test)print('Test set accuracy score: ', accuracy_score(y_test, predictions))print('Area under the ROC curve: ', roc_auc_score(y_test, predictions))print(classification_report(y_test, predictions))# In[30]:# 决策树模型给出的特征重要性model.feature_importances_# In[31]:# 随机森林模型from sklearn.ensemble import RandomForestClassifiermodel = RandomForestClassifier(n_estimators=10, random_state=1026)model.fit(X_train, y_train)predictions = model.predict(X_test)print('Test set accuracy score: ', accuracy_score(y_test, predictions))print('Area under the ROC curve: ', roc_auc_score(y_test, predictions))print(classification_report(y_test, predictions))# In[32]:# 随机森林模型给出的特征重要性model.feature_importances_# In[33]:# 装袋决策树模型from sklearn.ensemble import BaggingClassifiermodel = BaggingClassifier(n_estimators=10, random_state=1026)model.fit(X_train, y_train)predictions = model.predict(X_test)print('Test set accuracy score: ', accuracy_score(y_test, predictions))print('Area under the ROC curve: ', roc_auc_score(y_test, predictions))print(classification_report(y_test, predictions))# In[34]:# 极端随机树模型from sklearn.ensemble import ExtraTreesClassifiermodel = ExtraTreesClassifier(n_estimators=10, random_state=1026)model.fit(X_train, y_train)predictions = model.predict(X_test)print('Test set accuracy score: ', accuracy_score(y_test, predictions))print('Area under the ROC curve: ', roc_auc_score(y_test, predictions))print(classification_report(y_test, predictions))# In[35]:# 梯度提升树模型from sklearn.ensemble import GradientBoostingClassifiermodel = GradientBoostingClassifier(n_estimators=10, random_state=1026)model.fit(X_train, y_train)predictions = model.predict(X_test)print('Test set accuracy score: ', accuracy_score(y_test, predictions))print('Area under the ROC curve: ', roc_auc_score(y_test, predictions))print(classification_report(y_test, predictions))# In[162]:# 多层全连接层神经网络模型-TensorFlow2框架下实现import tensorflow as tftf_model = tf.keras.Sequential([tf.keras.layers.Dense(100, activation='relu', input_shape=(38,), kernel_initializer='random_uniform'),tf.keras.layers.Dense(100, activation='relu', kernel_initializer='random_uniform'),tf.keras.layers.Dense(1, activation='sigmoid', kernel_initializer='random_uniform')])# In[163]:batch_size = 100 # 每批训练样本数（批梯度下降法）tf_epoch = pile(optimizer='adam', loss='mse', metrics=['accuracy'])tf_model.summary()tf_model.fit(np.array(X_train), np.array(y_train), validation_data=(np.array(X_test), np.array(y_test)), batch_size=batch_size, epochs=tf_epoch, verbose=1)# In[164]:predictions = tf_model.predict(X_test)predictions = list(np.round(predictions).reshape(-1,1)) # 四舍五入得到预测值print('Test set accuracy score: ', accuracy_score(y_test, predictions))print('Area under the ROC curve: ', roc_auc_score(y_test, predictions))print(classification_report(y_test, predictions))# In[173]:# 多层全连接层神经网络模型-MindSpore框架下实现import mindspore as mscldef construct(self, x):x = self.relu(self.fc1(x))x = self.relu(self.fc2(x))x = self.sigmoid(self.fc3(x))return xnet = ms_mode() # 实例化net_loss = ms.nn.loss.MSELoss() # 定义损失函数opt = ms.nn.Adam(params=net.trainable_params(), learning_rate=0.00005) # 定义优化方法ms_model = ms.Model(net, net_loss, opt) # 将网络结构、损失函数和优化方法进行关联# In[167]:yy.append([0, 1])else:yy.append([1, 0])yy = np.array(yy).astype(np.float32)# In[176]:class DatasetGenerator:def __init__(self, X, y):self.data = Xself.label = ydef __getitem__(self, index):return self.data[index], self.label[index]def __len__(self):return len(self.data)batch_size = 100 # 每批训练样本数rain.batch(batch_size)ds_train = ds_train.repeat(repeat_size)from mindspore.train.callback import LossMonitor, TimeMonitorloss_cb = LossMonitor(per_print_times=1)time_cb = TimeMonitor(data_size=ds_train.get_dataset_size())ms_epoch = 30ms_model.train(ms_epoch, ds_train, dataset_sink_mode=False, callbacks=[loss_cb, time_cb])# In[177]:# 预测predictions = []#xx = X_test.valuesfor i in range(len(X_test)):y_p = ms_model.predict(ms.Tensor([X_test[i]], ms.float32))predictions.append(y_p.asnumpy()[0][0]) # 直接取独热编码的第一个值predictions = roc_auc_score(y_test, predictions))print(classification_report(y_test, predictions))# In[ ]:

创作不易觉得有帮助请点赞关注收藏~~~

【Python机器学习】决策树逻辑回归神经网络等模型对电信用户流失分类实战（附源码和数据集）

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。