加州房价预测 [Hands On ML] 2. 一个完整的机器学习项目( 八 )


RandomizedSearchCV(cv=5, error_score=nan,estimator=RandomForestRegressor(bootstrap=True,ccp_alpha=0.0,criterion='mse',max_depth=None,max_features='auto',max_leaf_nodes=None,max_samples=None,min_impurity_decrease=0.0,min_impurity_split=None,min_samples_leaf=1,min_samples_split=2,min_weight_fraction_leaf=0.0,n_estimators=100,n_jobs=None, oob_score=Fals...iid='deprecated', n_iter=10, n_jobs=None,param_distributions={'max_features': .stats._distn_infrastructure.rv_frozen object at 0x0000017B8CEFFF98>,'n_estimators': .stats._distn_infrastructure.rv_frozen object at 0x0000017B8C711BE0>},pre_dispatch='2*n_jobs', random_state=1, refit=True,return_train_score=False, scoring='neg_mean_squared_error',verbose=0)
search_result = rand_search.cv_results_for mean_score, params in zip(search_result['mean_test_score'], search_result['params']):print(np.sqrt(-mean_score),params)feature_importance = rand_search.best_estimator_.feature_importances_sorted(zip(feature_importance, attributes), reverse=True)
[(0.3266343000845554, 'median_income'),(0.14655427173815663, 'INLAND'),(0.10393984611581725, 'pop_per_hhold'),(0.07978993056196858, 'bedrooms_per_room'),(0.07850738357218873, 'longitude'),(0.06969306682249354, 'latitude'),(0.05962901176048446, 'rooms_per_hhold'),(0.04273624791392784, 'housing_median_age'),(0.018322580791300728, 'total_rooms'),(0.017916659917498783, 'population'),(0.017041691466405204, 'total_bedrooms'),(0.015954810875574967, 'households'),(0.013358851743037617, '<1H OCEAN'),(0.0059438389065000225, 'NEAR OCEAN'),(0.003697096190985686, 'NEAR BAY'),(0.0002804115391046209, 'ISLAND')]
15.3 集成方法
另一种微调系统的方法是将表现最好的模型组合起来 。
组合(集成)之后的性能通常要比单独的模型要好(就像随机森林要比单独的决策树要好),特别是当单独模型的误差类型不同时 。
16. 分析最佳模型的误差
通过分析最佳模型,常常可以获得对问题更深的了解 。
比如,r 可以指出每个属性对于做出准确预测的相对重要性:
feature_importances = grid_search.best_estimator_.feature_importances_feature_importancesarray([6.92844433e-02, 6.58717797e-02, 4.39855401e-02, 1.58155272e-02,1.54980143e-02, 1.58108677e-02, 1.41614038e-02, 3.52323623e-01,4.41751467e-02, 1.10901558e-01, 8.01958553e-02, 3.59928828e-03,1.63144911e-01, 2.10262727e-04, 1.92134774e-03, 3.10043063e-03])
把特征名字也打印出来
extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room"]cat_one_hot_attribs = list(encoder.classes_)cat_one_hot_attribs# ['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN']
attributes = num_attribs + extra_attribs + cat_one_hot_attribsattributes['longitude','latitude','housing_median_age','total_rooms','total_bedrooms','population','households','median_income','rooms_per_hhold','pop_per_hhold','bedrooms_per_room','<1H OCEAN','INLAND','ISLAND','NEAR BAY','NEAR OCEAN']
sorted(zip(feature_importances,attributes), reverse=True)[(0.3523236234176724, 'median_income'),(0.16314491099438777, 'INLAND'),(0.11090155811701467, 'pop_per_hhold'),(0.08019585526690289, 'bedrooms_per_room'),(0.06928444332660065, 'longitude'),(0.06587177968295425, 'latitude'),(0.04417514666849209, 'rooms_per_hhold'),(0.04398554014357369, 'housing_median_age'),(0.015815527234205994, 'total_rooms'),(0.015810867735057542, 'population'),(0.015498014277829743, 'total_bedrooms'),(0.014161403758405241, 'households'),(0.0035992882775714432, '