ccc-sklearn-1-决策树( 三 )


步骤三:实例化深度为2和5的模型
regr_1 = DecisionTreeRegressor(max_depth=2)regr_2 = DecisionTreeRegressor(max_depth=5)regr_1.fit(X, y)regr_2.fit(X, y)
步骤四:导入测试集并开始预测
X_test = np.arange(0.0,5.0,0.01)[:,np.newaxis] y_1 = regr_1.predict(X_test)y_2 = regr_2.predict(X_test)
步骤五:图像的绘制
plt.figure()plt.scatter(X, y, s=20, edgecolor="black",c="darkorange",label="data")plt.plot(X_test, y_1, color="cornflowerblue",label="max_depth=2",linewidth=2)plt.plot(X_test, y_2, color="yellowgreen",label="max_depth=5",linewidth=2)plt.xlabel("data")plt.ylabel("target")plt.title("Decision Tree Regression")plt.legend()plt.show()
可见,回归树学习了近似正弦曲线的局部线性回归 。如果树的最大深度设置得太高,则决策树学习得太精细,它从训练数据中学了很多细节,包括噪声得呈现,从而使模型偏离真实的正弦曲线,形成过拟合 。
10.泰坦尼克幸存者预测
步骤一:库与数据的导入
import pandas as pdfrom sklearn.tree import DecisionTreeClassifierimport matplotlib.pylab as pltfrom sklearn.model_selection import GridSearchCVfrom sklearn.model_selection import cross_val_scoredata = http://www.kingceram.com/post/pd.read_csv("./data/data.csv")data.info()
步骤二:数据预处理
Cabin、Name、无关紧要,删除!
data.drop(['Cabin','Name','Ticket'],inplace=True,axis=1)# 记住,axis=1表示列
说明:axis参数的理解
Age缺失200左右,且是连续值,需要填充!
只缺两个,直接删!
处理字符型的数据,转成数字
labels = data["Embarked"].unique().tolist()data["Embarked"] = data["Embarked"].apply(lambda x: labels.index(x))data["Sex"] = (data["Sex"] == "male").astype("int")data.head()
步骤三:划分数据,跑模型
x = data.iloc[:,data.columns != "Survived"]y = data.iloc[:,data.columns == "Survived"]from sklearn.model_selection import train_test_splitXtrain, Xtest ,Ytrain, Ytest = train_test_split(x,y,test_size=0.3)for i in [Xtrain, Xtest ,Ytrain, Ytest]:i.index = range(i.shape[0])clf = DecisionTreeClassifier(random_state=25)clf = clf.fit(Xtrain, Ytrain)score = clf.score(Xtest, Ytest)score
步骤四:绘制不同的拟合情况
【ccc-sklearn-1-决策树】tr = []te = []for i in range(10):clf = DecisionTreeClassifier(random_state=25,max_depth=i+1)clf = clf.fit(Xtrain,Ytrain)score_tr = clf.score(Xtrain,Ytrain)score_te = cross_val_score(clf,x,y,cv=10).mean()tr.append(score_tr)te.append(score_te)print(max(te))plt.plot(range(1,11),tr,color="red",label="train")plt.plot(range(1,11),te,color="blue",label="test")plt.xticks(range(1,11))plt.legend()plt.show()
步骤五:Grid网格搜索寻找最佳参数
import numpy as npgini_threholds = np.linspace(0,0.5,50)parameters = {"criterion":("gini","entropy"),"splitter":("best","random"),"max_depth":[*range(1,10)],"min_samples_leaf":[*range(1,50,5)],"min_impurity_decrease":[*np.linspace(0,0.5,50)]}clf = DecisionTreeClassifier(random_state=25)GS = GridSearchCV(clf,parameters,cv=10)GS = GS.fit(Xtrain,Ytrain)GS.best_params_GS.best_score_
决策树总结:
优点:
易于理解和解释,是一个白盒模型 。需要很少的数据准备 。其他很多算法通常都需要数据规范化,需要创建虚拟变量并删除空值等 。树的成本(比如说,在预测数据的时候)是用于训练树的数据点的数量的对数,相对其它算法成本很低 。能同时处理数字和分类数据,既可以做回归又可以做分类 。能含有多个标签的问题,注意与一个标签中含有多种标签分类的问题区别开 。