机器学习 —— DecisionTree决策树( 五 ) _决策

随机森林
一个随机森林中包含多个决策树
分类问题: 对结果进行投票.
回归问题: 对结果进行平均
随机的含义:
1.对样本进行随机, 样本的个数是一样的.
2.对特征进行随机,特征数是一样的.
(1)数据的随机选取：
首先，从原始的数据集中采取有放回的抽样，构造子数据集，子数据集的数据是和原始数据集相同的。不同子数据集的元素可以重复，同一个子数据集中的元素也可以重复。第二，利用子数据集来构建子决策树，将这个数据放到每个子决策树中，每个子决策树输出一个结果。最后，如果有了新的数据需要通过随机森林得到分类结果，就可以通过对子决策树的判断结果的投票，得到随机森林的输出结果了。假设随机森林中有3棵子决策树，2棵子树的分类结果是A类，1棵子树的分类结果是B类，那么随机森林的分类结果就是A类。
(2)待选特征的随机选取
与数据集的随机选取类似，随机森林中的子树的每一个分裂过程并未用到所有的待选特征，而是从所有的待选特征中随机选取一定的特征，之后再在随机选取的特征中选取最优的特征。这样能够使得随机森林中的决策树都能够彼此不同，提升系统的多样性，从而提升分类性能。
随机森林算法
# ensemble：集成算法from sklearn.ensemble import RandomForestClassifier# n_estimators=100 : 决策树的数量，默认100个# max_features='auto' ：最大特征数#max_features : {"auto", "sqrt", "log2"}, int or float, default="auto"#The number of features to consider when looking for the best split:#- If int, then consider `max_features` features at each split.#- If float, then `max_features` is a fraction and#`round(max_features * n_features)` features are considered at each split.#- If "auto", then `max_features=sqrt(n_features)`.#- If "sqrt", then `max_features=sqrt(n_features)` (same as "auto").#- If "log2", then `max_features=log2(n_features)`.#- If None, then `max_features=n_features`.## bootstrap = True :有放回抽样## max_samples : int or float, default=None最大样本数#If bootstrap is True, the number of samples to draw from X#to train each base estimator.#- If None (default), then draw `X.shape[0]` samples.#- If int, then draw `max_samples` samples.#- If float, then draw `max_samples * X.shape[0]` samples. Thus,#`max_samples` should be in the interval `(0, 1)`.rfc = RandomForestClassifier(n_estimators=100)rfc.fit(data, target)
特征重要性
rfc.feature_importances_# array([0.0938488 , 0.01775377, 0.44296768, 0.44542975])