加州房价预测 [Hands On ML] 2. 一个完整的机器学习项目( 四 )


housing_cat_1hot<16512x5 sparse matrix of type ''with 16512 stored elements in Compressed Sparse Row format>
输出结果是一个 SciPy 稀疏矩阵,而不是 NumPy 数组 。
housing_cat_1hot.toarray()array([[0., 0., 0., 1., 0.],[0., 1., 0., 0., 0.],[1., 0., 0., 0., 0.],...,[1., 0., 0., 0., 0.],[0., 1., 0., 0., 0.],[0., 1., 0., 0., 0.]])
from sklearn.preprocessing import LabelBinarizerencoder = LabelBinarizer()向构造器LabelBinarizer传递 sparse_output=True,就可以得到一个稀疏矩阵housing_cat_1hot = encoder.fit_transform(housing_cat)housing_cat_1hotarray([[0, 0, 0, 1, 0],[0, 1, 0, 0, 0],[1, 0, 0, 0, 0],...,[1, 0, 0, 0, 0],[0, 1, 0, 0, 0],[0, 1, 0, 0, 0]])
10. 自定义转换器
创建一个类并执行三个方法:fit()(返回self),(),和()
from sklearn.base import BaseEstimator, TransformerMixinrooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6class CombinedAttributesAdder(BaseEstimator, TransformerMixin):def __init__(self, add_bedrooms_per_room = True): # no *args or **kargsself.add_bedrooms_per_room = add_bedrooms_per_roomdef fit(self, X, y=None):return self# nothing else to dodef transform(self, X, y=None):rooms_per_household = X[:, rooms_ix] / X[:, household_ix]population_per_household = X[:, population_ix] / X[:, household_ix]if self.add_bedrooms_per_room:bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]return np.c_[X, rooms_per_household, population_per_household,bedrooms_per_room]else:return np.c_[X, rooms_per_household, population_per_household]attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)housing_extra_attribs = attr_adder.transform(housing.values)housing_extra_attribsarray([[-122.13, 37.67, 40.0, ..., 'NEAR BAY', 5.514195583596215,2.8832807570977916],[-120.98, 37.65, 40.0, ..., 'INLAND', 6.698412698412699,2.507936507936508],[-118.37, 33.87, 23.0, ..., '<1H OCEAN', 5.137640449438202,2.502808988764045],...,[-117.69, 33.58, 5.0, ..., '<1H OCEAN', 6.80040733197556,2.9297352342158858],[-117.3, 34.1, 49.0, ..., 'INLAND', 4.615384615384615,5.846153846153846],[-121.77, 37.99, 4.0, ..., 'INLAND', 7.853351955307263,3.392458100558659]], dtype=object)
11. 特征缩放
不同的特征量纲不一样,在基于距离的机器学习算法中,特征的权重不一样,会造成误差
线性函数归一化(归一化())很简单:通过减去最小值,然后再除以最大值与最小值的差值,来进行归一化 。
标准化:先减去平均值(所以标准化值的平均值总是 0),然后除以方差,使得到的分布具有单位方差 。
警告:与所有的转换一样,缩放器只能向训练集拟合,而不是向完整的数据集(包括测试集) 。只有这样,你才能用缩放器转换训练集和测试集(和新数据) 。
12. 转换流水线
存在许多数据转换步骤,需要按一定的顺序执行 。-Learn 提供了类,来进行这一系列的转换 。
from sklearn.pipeline import Pipelinefrom sklearn.preprocessing import StandardScalernum_pipeline = Pipeline([('imputer',SimpleImputer(strategy='median')),('attibs_adder',CombinedAttributesAdder()),('std_scaler',StandardScaler()),])housing_num_tr = num_pipeline.fit_transform(housing_num)housing_num_trarray([[-1.27826235,0.95445204,0.89646428, ...,0.04435599,-0.01693693, -0.49175254],[-0.70432019,0.94509343,0.89646428, ...,0.56563549,-0.05135459, -0.99646009],[ 0.59827896, -0.82368426, -0.45394013, ..., -0.12139949,-0.05182477, -0.5064297 ],...,[ 0.93765346, -0.95938413, -1.88378009, ...,0.61053242,-0.01267723, -0.96392659],[ 1.13229471, -0.71606022,1.61138426, ..., -0.35129083,0.25474742, -0.46992773],[-1.0985935 ,1.10418984, -1.96321564, ...,1.0740272 ,0.02975272, -1.15998515]])