教女朋友学数据挖掘——泰坦尼克号获救预测( 十 )


The accuracy for bagged KNN is: 0.835820895522The cross validated score for bagged KNN is: 0.814889342867
model=BaggingClassifier(base_estimator=DecisionTreeClassifier(),random_state=0,n_estimators=100)model.fit(train_X,train_Y)prediction=model.predict(test_X)print('The accuracy for bagged Decision Tree is:',metrics.accuracy_score(prediction,test_Y))result=cross_val_score(model,X,Y,cv=10,scoring='accuracy')print('The cross validated score for bagged Decision Tree is:',result.mean())
The accuracy for bagged Decision Tree is: 0.824626865672The cross validated score for bagged Decision Tree is: 0.820482635342
提升是一个逐步增强的弱模型:
首先对完整的数据集进行训练 。现在模型会得到一些实例,而有些错误 。现在,在下一次迭代中,学习者将更多地关注错误预测的实例或赋予它更多的权重
2)(自适应增强)
在这种情况下,弱学习或估计是一个决策树 。但我们可以改变缺省任何算法的选择 。
from sklearn.ensemble import AdaBoostClassifierada=AdaBoostClassifier(n_estimators=200,random_state=0,learning_rate=0.1)result=cross_val_score(ada,X,Y,cv=10,scoring='accuracy')print('The cross validated score for AdaBoost is:',result.mean())
The cross validated score for AdaBoost is: 0.824952616048
from sklearn.ensemble import GradientBoostingClassifiergrad=GradientBoostingClassifier(n_estimators=500,random_state=0,learning_rate=0.1)result=cross_val_score(grad,X,Y,cv=10,scoring='accuracy')print('The cross validated score for Gradient Boosting is:',result.mean())
The cross validated score for Gradient Boosting is: 0.818286233118
我们得到了最高的精度为 。我们将尝试用超参数调整来增加它 。
n_estimators=list(range(100,1100,100))learn_rate=[0.05,0.1,0.2,0.3,0.25,0.4,0.5,0.6,0.7,0.8,0.9,1]hyper={'n_estimators':n_estimators,'learning_rate':learn_rate}gd=GridSearchCV(estimator=AdaBoostClassifier(),param_grid=hyper,verbose=True)gd.fit(X,Y)print(gd.best_score_)print(gd.best_estimator_)
Fitting 3 folds for each of 120 candidates, totalling 360 fits[Parallel(n_jobs=1)]: Done 360 out of 360 | elapsed:6.0min finished0.83164983165AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,learning_rate=0.05, n_estimators=200, random_state=None)
我们可以从的最高精度是83.16%, = 200和 = 0.05
3)for the Best Model
ada=AdaBoostClassifier(n_estimators=200,random_state=0,learning_rate=0.05)result=cross_val_predict(ada,X,Y,cv=10)sns.heatmap(confusion_matrix(Y,result),cmap='winter',annot=True,fmt='2.0f')plt.show()
4)
f,ax=plt.subplots(2,2,figsize=(15,12))model=RandomForestClassifier(n_estimators=500,random_state=0)model.fit(X,Y)pd.Series(model.feature_importances_,X.columns).sort_values(ascending=True).plot.barh(width=0.8,ax=ax[0,0])ax[0,0].set_title('Feature Importance in Random Forests')model=AdaBoostClassifier(n_estimators=200,learning_rate=0.05,random_state=0)model.fit(X,Y)pd.Series(model.feature_importances_,X.columns).sort_values(ascending=True).plot.barh(width=0.8,ax=ax[0,1],color='#ddff11')ax[0,1].set_title('Feature Importance in AdaBoost')model=GradientBoostingClassifier(n_estimators=500,learning_rate=0.1,random_state=0)model.fit(X,Y)pd.Series(model.feature_importances_,X.columns).sort_values(ascending=True).plot.barh(width=0.8,ax=ax[1,0],cmap='RdYlGn_r')ax[1,0].set_title('Feature Importance in Gradient Boosting')model=xg.XGBClassifier(n_estimators=900,learning_rate=0.1)model.fit(X,Y)pd.Series(model.feature_importances_,X.columns).sort_values(ascending=True).plot.barh(width=0.8,ax=ax[1,1],color='#FD0F00')ax[1,1].set_title('Feature Importance in XgBoost')plt.show()