加州房价预测 [Hands On ML] 2. 一个完整的机器学习项目( 六 )


14. 交叉验证
K 折交叉验证(K-fold cross-):
from sklearn.model_selection import cross_val_scorescores = cross_val_score(tree_reg,housing_prepared,housing_label,scoring='neg_mean_squared_error', cv=10)tree_rmse_scores = np.sqrt(-scores) # sklearn用的是负数print(tree_rmse_scores)print(tree_rmse_scores.mean())print(tree_rmse_scores.std())[71214.492949871929.79930468 70914.76077221 69550.7256691271042.25558966 67279.14165025 73061.35854347 71568.8525624271380.99149371 69504.81637098]70744.719490630871522.0580698181013
再用 线性回归 模型,试下交叉验证的结果
lin_scores = cross_val_score(lin_reg, housing_prepared, housing_label,scoring='neg_mean_squared_error', cv=10)lin_rmse_scores = np.sqrt(-lin_scores)print(lin_rmse_scores)print(lin_rmse_scores.mean())print(lin_rmse_scores.std())[70987.24786319 66375.29508519 73837.53789445 69493.5958464269821.05544742 69047.06162451 65908.72602507 66979.3303266973036.00622233 67077.50225384]69256.335858911362610.121268165482
再试下 随机森林 模型
from sklearn.ensemble import RandomForestRegressorforest_reg = RandomForestRegressor()forest_scores = cross_val_score(forest_reg,housing_prepared,housing_label,scoring='neg_mean_squared_error',cv=10)forest_rmse_scores = np.sqrt(-forest_scores)print(forest_rmse_scores)print(forest_rmse_scores.mean())print(forest_rmse_scores.std())[51968.86058788 47122.75805482 48941.349267650877.9942948951200.95320051 49198.87467112 49401.27484477 48418.5361811553788.16232918 49438.88539583]50035.764882775161834.1856707471993
随机森林模型,比上面2个模型得到的结果的误差要小 。看起来是个不错的选择 。
在深入随机森林之前,应该尝试下机器学习算法的其它类型模型(不同核的支持向量机,神经网络,等等),不要在调节超参数上花费太多时间 。目标是列出一个可能模型的列表(两到五个) 。
提示:要保存每个试验过的模型,以便后续可以再用 。
要确保有超参数和训练参数,以及交叉验证评分,和实际的预测值 。
这可以让你比较不同类型模型的评分,还可以比较误差种类 。
你可以用的模块 ,非常方便地保存 -Learn 模型,或使用 ..,后者序列化大 NumPy 数组更有效率
from sklearn.externals import joblibjoblib.dump(forest_reg, "my_forest.pkl")my_forest_model = joblib.load("my_forest.pkl")
15. 微调模型
假设有几个有希望的模型 。现在需要对它们进行微调
15.1 网格搜索
你应该使用 -Learn 的来做这项搜索工作
from sklearn.model_selection import GridSearchCVparam_grid = [{'n_estimators' : [3,10,30],'max_features':[2,4,6,8]},{'bootstrap':[False], 'n_estimators' : [3,10],'max_features':[2,3,4]},]forest_reg = RandomForestRegressor()grid_search = GridSearchCV(forest_reg, param_grid, cv=5,scoring='neg_mean_squared_error')grid_search.fit(housing_prepared, housing_label)#----------------------------------GridSearchCV(cv=5, error_score=nan,estimator=RandomForestRegressor(bootstrap=True, ccp_alpha=0.0,criterion='mse', max_depth=None,max_features='auto',max_leaf_nodes=None,max_samples=None,min_impurity_decrease=0.0,min_impurity_split=None,min_samples_leaf=1,min_samples_split=2,min_weight_fraction_leaf=0.0,n_estimators=100, n_jobs=None,oob_score=False, random_state=None,verbose=0, warm_start=False),iid='deprecated', n_jobs=None,param_grid=[{'max_features': [2, 4, 6, 8],'n_estimators': [3, 10, 30]},{'bootstrap': [False], 'max_features': [2, 3, 4],'n_estimators': [3, 10]}],pre_dispatch='2*n_jobs', refit=True, return_train_score=False,scoring='neg_mean_squared_error', verbose=0)