教女朋友学数据挖掘——泰坦尼克号获救预测( 九 )


解释混淆矩阵:来看第一个图
1)预测的正确率为491(死亡)+ 247(存活),平均CV准确率为(491+247)/ 891=82.8% 。
2)58和95都是咱们弄错了的 。
6.3 超参数整定
机器学习模型就像一个黑盒子 。这个黑盒有一些默认参数值,我们可以调整或更改以获得更好的模型 。比如支持向量机模型中的C和γ,我们称之为超参数,他们对结果可能产生非常大的影响 。
from sklearn.model_selection import GridSearchCVC=[0.05,0.1,0.2,0.3,0.25,0.4,0.5,0.6,0.7,0.8,0.9,1]gamma=[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]kernel=['rbf','linear']hyper={'kernel':kernel,'C':C,'gamma':gamma}gd=GridSearchCV(estimator=svm.SVC(),param_grid=hyper,verbose=True)gd.fit(X,Y)print(gd.best_score_)print(gd.best_estimator_)
Fitting 3 folds for each of 240 candidates, totalling 720 fits0.828282828283SVC(C=0.5, cache_size=200, class_weight=None, coef0=0.0,decision_function_shape=None, degree=3, gamma=0.1, kernel='rbf',max_iter=-1, probability=False, random_state=None, shrinking=True,tol=0.001, verbose=False)[Parallel(n_jobs=1)]: Done 720 out of 720 | elapsed:16.2s finished
n_estimators=range(100,1000,100)hyper={'n_estimators':n_estimators}gd=GridSearchCV(estimator=RandomForestClassifier(random_state=0),param_grid=hyper,verbose=True)gd.fit(X,Y)print(gd.best_score_)print(gd.best_estimator_)
Fitting 3 folds for each of 9 candidates, totalling 27 fits[Parallel(n_jobs=1)]: Done27 out of27 | elapsed:29.8s finished0.817059483726RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',max_depth=None, max_features='auto', max_leaf_nodes=None,min_impurity_split=1e-07, min_samples_leaf=1,min_samples_split=2, min_weight_fraction_leaf=0.0,n_estimators=900, n_jobs=1, oob_score=False, random_state=0,verbose=0, warm_start=False)
RBF支持向量机的最佳得分为82.82%,C=0.5,γ=0.1 。,成绩是81.8%
6.4 集成
集成是提高模型的精度和性能的一个很好的方式 。简单地说,是各种简单模型的结合创造了一个强大的模型 。
1)随机森林类型的,并行的集成
2)提升类型
3)堆叠类型
1)投票分类器
这是将许多不同的简单机器学习模型的预测结合起来的最简单方法 。它给出了一个平均预测结果基于各子模型的预测 。
from sklearn.ensemble import VotingClassifierensemble_lin_rbf=VotingClassifier(estimators=[('KNN',KNeighborsClassifier(n_neighbors=10)),('RBF',svm.SVC(probability=True,kernel='rbf',C=0.5,gamma=0.1)),('RFor',RandomForestClassifier(n_estimators=500,random_state=0)),('LR',LogisticRegression(C=0.05)),('DT',DecisionTreeClassifier(random_state=0)),('NB',GaussianNB()),('svm',svm.SVC(kernel='linear',probability=True))], voting='soft').fit(train_X,train_Y)print('The accuracy for ensembled model is:',ensemble_lin_rbf.score(test_X,test_Y))cross=cross_val_score(ensemble_lin_rbf,X,Y, cv = 10,scoring = "accuracy")print('The cross validated score is',cross.mean())
The accuracy for ensembled model is: 0.824626865672The cross validated score is 0.823766031097
from sklearn.ensemble import BaggingClassifiermodel=BaggingClassifier(base_estimator=KNeighborsClassifier(n_neighbors=3),random_state=0,n_estimators=700)model.fit(train_X,train_Y)prediction=model.predict(test_X)print('The accuracy for bagged KNN is:',metrics.accuracy_score(prediction,test_Y))result=cross_val_score(model,X,Y,cv=10,scoring='accuracy')print('The cross validated score for bagged KNN is:',result.mean())