当前位置:首页 » 《随便一记》 » 正文

如何建立和评测基于逻辑回归的多分类预测模型

23 人参与  2024年04月26日 18:52  分类 : 《随便一记》  评论

点击全文阅读


此次练习中,我们使用Human Activity Recognition Using Smartphones数据集。它通过对参加测试者的智能手机上安装里一个传感器而采集了参加测试者每天的日常活动(ADL)。目标是将日常活动分成六类(walking, walking upstairs, walking downstairs, sitting, standing, and laying)。

该数据集也可以在Kaggle网站上获得:https://www.kaggle.com/uciml/human-activity-recognition-with-smartphones/downloads/human-activity-recognition-with-smartphones.zip

把训练文件重新命名为 Human_Activity_Recognition_Using_Smartphones_Data.csv

第一步:导入数据

查看数据类型—因为有太多的列,所以最好使用value_counts判断其中的小数数值是否需要尺度缩放检查数据中各活动类型的划分把活动类型标签编码成一个整数
import pandas as pdimport numpy as npfilepath = 'Human_Activity_Recognition_Using_Smartphones_Data.csv'data = pd.read_csv(filepath)

所有列的数据类型都是浮点数,除了活动标签列。

data.dtypes.value_counts()
float64    561int64        1object       1dtype: int64
data.dtypes.tail()
angle(X,gravityMean)    float64angle(Y,gravityMean)    float64angle(Z,gravityMean)    float64subject                   int64Activity                 objectdtype: object

数据都已经全部被缩放到-1到1之间了。

data.iloc[:, :-1].min().value_counts()
-1.000000    466-0.995377      2-0.999996      2-0.999893      2-1.000000      2            ... -0.999983      1-0.943439      1-0.998014      1-0.999915      1 1.000000      1Length: 93, dtype: int64
data.iloc[:, :-1].max().value_counts()
1.000000     4520.994731       20.805064       10.908361       10.891736       1            ... 0.990935       10.979031       10.928416       10.848031       130.000000      1Length: 110, dtype: int64

检查数据中各活动类型的划分—已经比较平衡了。

data.Activity.value_counts()
LAYING                1407STANDING              1374SITTING               1286WALKING               1226WALKING_UPSTAIRS      1073WALKING_DOWNSTAIRS     986Name: Activity, dtype: int64

Scikit learn的分类器不接受一个稀疏矩阵作为预测列。所以,可以使用LabelEncoder将活动标签编码为整数。

from sklearn.preprocessing import LabelEncoderle = LabelEncoder()data['Activity'] = le.fit_transform(data.Activity)data['Activity'].sample(5)
2098    15517    16369    02210    54029    1Name: Activity, dtype: int32

第二步:划分训练数据和测试数据

可以考虑使用Scikit-learn中的StratifiedShuffleSplit,以保证划分后的数据集中每个类别个案的比例与整个数据集相同。
feature_cols = data.columns[:-1]from sklearn.model_selection import StratifiedShuffleSplit# Get the split indexesstrat_shuf_split = StratifiedShuffleSplit(n_splits=1,test_size=0.3, random_state=42)train_idx, test_idx = next(strat_shuf_split.split(data[feature_cols], data.Activity))# Create the dataframesX_train = data.loc[train_idx, feature_cols]y_train = data.loc[train_idx, 'Activity']X_test  = data.loc[test_idx, feature_cols]y_test  = data.loc[test_idx, 'Activity']
y_train.value_counts(normalize=True)
0    0.1914112    0.1869411    0.1748933    0.1667315    0.1459394    0.134085Name: Activity, dtype: float64
y_test.value_counts(normalize=True)
0    0.1912962    0.1867631    0.1749773    0.1668185    0.1459664    0.134180Name: Activity, dtype: float64

第三步:训练模型

用所有特征训练一个基本的使用缺省参数的逻辑回归模型。分别用L1和L2正则化来训练一个模型,使用交叉验证确定超参数的值。注意,正则化模型,尤其是L1模型可能需要一定训练时间。
# 请在此处填写你的代码(训练一个基本的使用缺省参数的逻辑回归模型)from sklearn.linear_model import LogisticRegressionlr = LogisticRegression(C=1e9)lr.fit(X_train, y_train)y1_pred_class = lr.predict(X_test)
from sklearn.linear_model import LogisticRegressionCV# L1 正则化的逻辑回归lr_l1 = LogisticRegressionCV(Cs=10, cv=4, penalty='l1', solver='liblinear').fit(X_train, y_train)
# 请在此处填写你的代码(L2 正则化的逻辑回归)lr_l2 = LogisticRegressionCV(Cs = 10, cv = 4, penalty = "l2", solver = "liblinear").fit(X_train, y_train)

第四步:

输出上面训练出的三个模型中每个特征的系数;并绘制成图来比较它们的差异 (每个类别一张图)
# 请在此处填写你的代码(输出各模型训练到的特征系数值)print("lr_coef_:", lr.coef_)print("lr_l1_coef_:", lr_l1.coef_)print("lr_l2_coef_:", lr_l2.coef_)
lr_coef_: [[-4.12196415e-01  5.06623963e-02  2.03181674e-01 ... -2.54567346e+00  -1.26940147e+00 -1.09109777e+00] [-1.97604207e-01  4.95823843e-02  1.21677586e-01 ... -1.58853770e+00  -5.62850705e-01 -7.64842812e-01] [-6.81924154e-02 -1.37914996e-03 -2.55900298e-04 ...  2.33229351e+00   4.40336018e-01 -8.25917629e-01] [ 1.07887742e-01 -5.50986235e-02 -7.41960666e-02 ...  4.70983186e-01   4.47067484e-01  9.09309679e-01] [ 5.21966134e-01  8.24998941e-02 -7.66338173e-04 ... -8.27678238e-03   1.19138181e-01  9.06582221e-01] [ 4.81391614e-02 -1.26266901e-01 -2.49640955e-01 ...  1.33921125e+00   8.25710489e-01  8.65966313e-01]]lr_l1_coef_: [[ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00   0.00000000e+00 -1.37423471e-02] [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00   0.00000000e+00  4.38176896e-02] [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00   0.00000000e+00 -5.13700279e-02] [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00   0.00000000e+00 -1.77831740e-03] [ 7.03980012e-01  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00   0.00000000e+00  2.01422949e-04] [-2.02008109e+00 -8.34909177e+00 -3.29578262e+00 ...  4.02040351e+00   3.20405969e+00  1.82634113e-02]]lr_l2_coef_: [[-1.03725936e-01  5.50632453e-03  6.33482873e-02 ... -3.88543365e-01  -1.92209170e-01 -5.84902049e-02] [ 1.05758725e-01 -1.17717948e-01 -3.03955359e-01 ... -2.31378407e+00  -2.32401778e-01  3.43900793e-02] [ 8.61767183e-03  2.40843471e-01  1.25792887e-01 ...  2.54368464e+00   3.31422840e-01 -4.26759836e-02] [-3.09551146e-01 -1.46430474e-01  2.88589498e-01 ... -4.62980616e-01   1.34359794e-01  1.33067417e-03] [ 6.71720746e-01  1.43032935e-01  2.24426200e-01 ... -4.58979021e-01  -1.02600733e-01  8.06089631e-03] [-4.68800629e-01 -7.34170993e-01 -8.65869923e-01 ...  1.86878293e+00   7.99875615e-01  1.58405200e-02]]
# 请在此处填写你的代码(绘制6张图)import matplotlib.pyplot as pltplt.figure(1, figsize = (30, 10))plt.subplot(231)plt.scatter(feature_cols, lr.coef_[0])plt.scatter(feature_cols, lr_l1.coef_[0])plt.scatter(feature_cols, lr_l2.coef_[0])plt.subplot(232)plt.scatter(feature_cols, lr.coef_[1])plt.scatter(feature_cols, lr_l1.coef_[1])plt.scatter(feature_cols, lr_l2.coef_[1])plt.subplot(233)plt.scatter(feature_cols, lr.coef_[2])plt.scatter(feature_cols, lr_l1.coef_[2])plt.scatter(feature_cols, lr_l2.coef_[2])plt.subplot(234)plt.scatter(feature_cols, lr.coef_[3])plt.scatter(feature_cols, lr_l1.coef_[3])plt.scatter(feature_cols, lr_l2.coef_[3])plt.subplot(235)plt.scatter(feature_cols, lr.coef_[4])plt.scatter(feature_cols, lr_l1.coef_[4])plt.scatter(feature_cols, lr_l2.coef_[4])plt.subplot(236)plt.scatter(feature_cols, lr.coef_[5])plt.scatter(feature_cols, lr_l1.coef_[5])plt.scatter(feature_cols, lr_l2.coef_[5])
<matplotlib.collections.PathCollection at 0x226adc34af0>

在这里插入图片描述

第五步:预测数据

将每个模型预测的类别和概率值都保存下来。
# 请在此处填写你的代码lr_pred_prob = lr.predict_proba(X_test)lr_l1_pred_prob = lr_l1.predict_proba(X_test)lr_l2_pred_prob = lr_l2.predict_proba(X_test)
array([[9.99999993e-001, 7.42909015e-009, 4.11242095e-022,        1.72199687e-110, 7.92318349e-117, 1.25352049e-118],       [2.28654820e-039, 1.12754892e-032, 1.49849650e-030,        3.04091495e-009, 7.21093140e-006, 9.99992786e-001],       [1.01631928e-023, 9.99517697e-001, 4.82303397e-004,        1.14668778e-090, 5.97293981e-097, 7.29740533e-101],       ...,       [5.82445006e-025, 1.57488491e-004, 9.99842512e-001,        2.47445986e-094, 1.43492169e-100, 5.22894472e-103],       [5.07457042e-028, 9.99375236e-001, 6.24763971e-004,        9.56127218e-090, 4.24679878e-094, 1.04128353e-097],       [1.07819125e-062, 3.31123144e-052, 1.86904909e-049,        9.99999819e-001, 1.81046086e-007, 1.23086753e-011]])

第六步:评价模型

对每个模型,分别计算下面的各评测指标值:

accuracyprecisionrecallfscoreconfusion matrix
# 请在此处填写你的代码from sklearn import metrics#基本的使用缺省参数的逻辑回归模型print("confusion matrix:", metrics.confusion_matrix(y_test, y1_pred_class))print("accruacy:", metrics.accuracy_score(y_test, y1_pred_class))print("precision:", metrics.precision_score(y_test, y1_pred_class, average = "micro"))print("recall:", metrics.recall_score(y_test, y1_pred_class, average = "micro"))print("fscore:", metrics.f1_score(y_test, y1_pred_class, average = "micro"))
confusion matrix: [[422   0   0   0   0   0] [  0 366  20   0   0   0] [  0  15 397   0   0   0] [  0   0   0 368   0   0] [  0   0   0   0 295   1] [  0   0   0   2   0 320]]accruacy: 0.9827742520398912precision: 0.9827742520398912recall: 0.9827742520398912fscore: 0.9827742520398912
# 使用L1 正则化的逻辑回归模型y2_pred_class = lr_l1.predict(X_test)print("confusion matrix:", metrics.confusion_matrix(y_test, y2_pred_class))print("accruacy:", metrics.accuracy_score(y_test, y2_pred_class))print("precision:", metrics.precision_score(y_test, y2_pred_class, average = "micro"))print("recall:", metrics.recall_score(y_test, y2_pred_class, average = "micro"))print("fscore:", metrics.f1_score(y_test, y2_pred_class, average = "micro"))
confusion matrix: [[422   0   0   0   0   0] [  0 370  16   0   0   0] [  0  15 397   0   0   0] [  0   0   0 368   0   0] [  0   0   0   0 296   0] [  0   0   0   0   1 321]]accruacy: 0.985494106980961precision: 0.985494106980961recall: 0.985494106980961fscore: 0.985494106980961
# 使用L2 正则化的逻辑回归模型y3_pred_class = lr_l2.predict(X_test)print("confusion matrix:", metrics.confusion_matrix(y_test, y3_pred_class))print("accruacy:", metrics.accuracy_score(y_test, y3_pred_class))print("precision:", metrics.precision_score(y_test, y3_pred_class, average = "micro"))print("recall:", metrics.recall_score(y_test, y3_pred_class, average = "micro"))print("fscore:", metrics.f1_score(y_test, y3_pred_class, average = "micro"))
confusion matrix: [[422   0   0   0   0   0] [  0 368  18   0   0   0] [  0  13 399   0   0   0] [  0   0   0 368   0   0] [  0   0   0   0 296   0] [  0   0   0   1   0 321]]accruacy: 0.985494106980961precision: 0.985494106980961recall: 0.985494106980961fscore: 0.985494106980961

点击全文阅读


本文链接:http://zhangshiyu.com/post/100802.html

<< 上一篇 下一篇 >>

  • 评论(0)
  • 赞助本站

◎欢迎参与讨论,请在这里发表您的看法、交流您的观点。

关于我们 | 我要投稿 | 免责申明

Copyright © 2020-2022 ZhangShiYu.com Rights Reserved.豫ICP备2022013469号-1