加州房价预测 [Hands On ML] 2. 一个完整的机器学习项目( 五 )


构造器需要一个定义步骤顺序的名字/估计器对的列表
除了最后一个估计器,其余都要是转换器(它们都要有()方法) 。名字随意起
调用流水线的fit()方法,会对所有转换器顺序调用()方法,将每次调用的输出作为参数传递给下一个调用
一直到最后一个估计器,它只执行fit()方法
流水线暴露相同的方法作为最终的估计器 。在这个例子中,最后的估计器是一个,它是一个转换器,因此这个流水线有一个()方法,可以顺序对数据做所有转换(它还有一个方法可以使用,就不必先调用fit()再进行())
如果不需要手动将中的数值列转成Numpy数组的格式,而可以直接将输入中进行处理就好了 。
-Learn 没有工具来处理 ,因此我们需要写一个简单的自定义转换器来做这项工作:
from sklearn.base import BaseEstimator, TransformerMixinclass DataFrameSelector(BaseEstimator, TransformerMixin):def __init__(self,attribute_names):self.attribute_names = attribute_namesdef fit(self,X,y=None):return selfdef transform(self,X):return X[self.attribute_names].values
# 报错参考 https://blog.csdn.net/jasonzhoujx/article/details/82025571class MyLabelBinarizer(TransformerMixin):def __init__(self, *args, **kwargs):self.encoder = LabelBinarizer(*args, **kwargs)def fit(self, x, y=0):self.encoder.fit(x)return selfdef transform(self, x, y=0):return self.encoder.transform(x)from sklearn.pipeline import FeatureUnionnum_attribs = list(housing_num)cat_attribs = ["ocean_proximity"]num_pipeline = Pipeline([('selector',DataFrameSelector(num_attribs)),('Simpleimputer', SimpleImputer(strategy="median")),('attribs_adder', CombinedAttributesAdder()),('std_scaler', StandardScaler()),])cat_pipeline = Pipeline([('selector',DataFrameSelector(cat_attribs)),('label_binarizer',MyLabelBinarizer())])full_pipeline = FeatureUnion(transformer_list=[('num_pipeline',num_pipeline),('cat_pipeline',cat_pipeline),])# help(FeatureUnion)housing_prepared = full_pipeline.fit_transform(housing)housing_preparedarray([[-1.27826235,0.95445204,0.89646428, ...,0.,1.,0.],[-0.70432019,0.94509343,0.89646428, ...,0.,0.,0.],[ 0.59827896, -0.82368426, -0.45394013, ...,0.,0.,0.],...,[ 0.93765346, -0.95938413, -1.88378009, ...,0.,0.,0.],[ 1.13229471, -0.71606022,1.61138426, ...,0.,0.,0.],[-1.0985935 ,1.10418984, -1.96321564, ...,0.,0.,0.]])
13. 训练模型
初步选择 线性回归模型
from sklearn.linear_model import LinearRegressionlin_reg = LinearRegression()lin_reg.fit(housing_prepared, housing_label)somedata = http://www.kingceram.com/post/housing.iloc[:5]somelabel = housing_label.iloc[:5]somedata_prepared = full_pipeline.transform(somedata)print('predict:/t', lin_reg.predict(somedata_prepared))print('Labels:/t',list(somelabel))predict:[234956.84260842 303073.513104327746.46204573 355932.30741583210220.50294171]Labels:[184000.0, 172200.0, 359900.0, 258200.0, 239100.0]
from sklearn.metrics import mean_squared_errorhousing_predict = lin_reg.predict(housing_prepared)lin_mse = mean_squared_error(housing_predict, housing_label)lin_rmse = np.sqrt(lin_mse)lin_rmse68860.85279166883
误差很大,效果不是很理想
模型欠拟合:
修复欠拟合的主要方法:
先让我们尝试一个更为复杂的模型,看看效果 。
来训练一个r 。这是一个强大的模型,可以发现数据中复杂的非线性关系
from sklearn.tree import DecisionTreeRegressortree_reg = DecisionTreeRegressor()tree_reg.fit(housing_prepared, housing_label)housing_predictions = tree_reg.predict(housing_prepared)tree_mse = mean_squared_error(housing_label, housing_predictions)tree_rmse = np.sqrt(tree_mse)tree_rmse误差 0.0,太强了? 错了,上面使用了全部的训练集训练,然后在训练集上预测,产生了过拟合