加州房价预测 [Hands On ML] 2. 一个完整的机器学习项目( 九 )

<1H OCEAN'),(0.0031004306281128724, 'NEAR OCEAN'),(0.0019213477443202828, 'NEAR BAY'),(0.00021026272689858585, 'ISLAND')]
有了这个信息,你就可以丢弃一些不那么重要的特征(比如,显然只要一个 的类型()就够了,所以可以丢弃掉其它的) 。
还应该看一下系统的特定误差,搞清为什么会有误差,如何改正问题(添加更多的特征,去掉没有什么信息的特征,清洗异常值等等) 。
17. 用测试集评估模型
final_model = grid_search.best_estimator_X_test = strat_test_set.drop("median_house_value", axis=1)y_test = strat_test_set["median_house_value"].copy()X_test_prepared = full_pipeline.transform(X_test)final_predict = final_model.predict(X_test_prepared)final_mse = mean_squared_error(y_test, final_predict)final_rmse = np.sqrt(final_mse)final_rmse # 47818.484839863646
18. 启动、监控、维护系统19. 练习 19.1 加入特征选择、预测
# 选择最好的k个特征from sklearn.base import BaseEstimator, TransformerMixindef indices_of_topK_feature(arr, k):return np.sort(np.argpartition(np.array(arr), -k)[-k:])class TopFeatureSelector(BaseEstimator,TransformerMixin):def __init__(self, feature_importance, k):self.feature_importance = feature_importanceself.k = kdef fit(self, X, y=None):self.feature_indices_ = indices_of_topK_feature(self.feature_importance, self.k)return selfdef transform(self,X):return X[:, self.feature_indices_]
k=5topK_features = indices_of_topK_feature(feature_importance,k)topK_features # array([ 0,7,9, 10, 12], dtype=int64)np.array(attributes)[topK_features]# array(['longitude', 'median_income', 'pop_per_hhold', 'bedrooms_per_room','INLAND'], dtype='
preparation_and_feature_selection_pipeline = Pipeline([('preparation', full_pipeline),('feature_selection', TopFeatureSelector(feature_importance, k))])housing_prepared_top_k_features = preparation_and_feature_selection_pipeline.fit_transform(housing)
prepare_select_and_predict_pipeline = Pipeline([('preparation', full_pipeline),('feature_selection', TopFeatureSelector(feature_importance, k)),('forst_reg', RandomForestRegressor())])prepare_select_and_predict_pipeline.fit(housing, housing_label)some_data = http://www.kingceram.com/post/housing.iloc[:4]some_label = housing_label.iloc[:4]print("Predictions:\t", prepare_select_and_predict_pipeline.predict(some_data))print("Labels:\t\t", list(some_label))
Predictions:[183259.209045.04 375686.02 296260.05]Labels:[184000.0, 172200.0, 359900.0, 258200.0]
param_grid = [{'preparation__num__imputer__strategy': ['mean', 'median', 'most_frequent'],'feature_selection__k': list(range(1, len(feature_importance) + 1))}]grid_search_prep = GridSearchCV(prepare_select_and_predict_pipeline, param_grid, cv=5,scoring='neg_mean_squared_error', verbose=2, n_jobs=4)grid_search_prep.fit(housing, housing_label)
【加州房价预测[Hands On ML] 2. 一个完整的机器学习项目】grid_search_prep.best_params_》》》{'feature_selection__k': 15,'preparation__num__imputer__strategy': 'most_frequent'}