偏差与方差理论( 二 )


C V ( K ) = 1 K ∑ i = 1 K M S E i CV_{(K)} = \frac{1}{K} \sum_{i=1}^{K}MSE_{i} CV(K)?=K1?i=1∑K?MSEi?
K越大,每次投入的训练集的数据越多,模型的Bias越小 。但是K越大,又意味着每一次选取的训练集之前的相关性越大(考虑最极端的例子,当k=N,也就是在LOOCV里,每次都训练数据几乎是一样的) 。而这种大相关性会导致最终的test error具有更大的 。
一般来说,根据经验我们一般选择K=5或10 。
3、实例
通过pip安装相关包:
import statsmodels.api as smfrom statsmodels.formula.api import ols from sklearn import datasetsfrom sklearn import linear_modelimport pandas as pdboston = datasets.load_boston()X = boston.datay = boston.targetfeatures = boston.feature_namesboston_data = http://www.kingceram.com/post/pd.DataFrame(X,columns=features)boston_data["Price"] = y#定义向前逐步回归函数def forward_select(data,target):#去重variate=set(data.columns) #去掉因变量的字段名variate.remove(target)selected=[]#初始值都为无穷大(因为AIC越小越好)current_score,best_new_score=float('inf'),float('inf')while variate:aic_with_variate=[]for candidate in variate: formula="{}~{}".format(target,"+".join(selected+[candidate]))aic=ols(formula=formula,data=http://www.kingceram.com/post/data).fit().aic#利用ols训练模型得出aic值aic_with_variate.append((aic,candidate))#将第每一次的aic值放进空列表#降序排序aic值aic_with_variate.sort(reverse=True)#利用栈返回最好的aic值以及最好的自变量best_new_score,best_candidate=aic_with_variate.pop()#比较当前值和最优值if current_score>best_new_score:variate.remove(best_candidate)selected.append(best_candidate) current_score=best_new_scoreprint("aic is {},continuing!".format(current_score))else:print("for selection over!")breakformula="{}~{}".format(target,"+".join(selected))print("final formula is {}".format(formula))model=ols(formula=formula,data=http://www.kingceram.com/post/data).fit()return(model)%%timeitforward_select(data=boston_data,target="Price")aic is 3286.974956900157,continuing!aic is 3171.5423142992013,continuing!aic is 3114.0972674193326,continuing!aic is 3097.359044862759,continuing!aic is 3069.438633167217,continuing!aic is 3057.9390497191152,continuing!aic is 3048.438382711162,continuing!aic is 3042.274993098419,continuing!aic is 3040.154562175143,continuing!aic is 3032.0687017003256,continuing!aic is 3021.726387825062,continuing!for selection over!final formula is Price~LSTAT+RM+PTRATIO+DIS+NOX+CHAS+B+ZN+CRIM+RAD+TAXaic is 3286.974956900157,continuing!aic is 3171.5423142992013,continuing!aic is 3114.0972674193326,continuing!aic is 3097.359044862759,continuing!aic is 3069.438633167217,continuing!aic is 3057.9390497191152,continuing!aic is 3048.438382711162,continuing!aic is 3042.274993098419,continuing!aic is 3040.154562175143,continuing!aic is 3032.0687017003256,continuing!aic is 3021.726387825062,continuing!for selection over!final formula is Price~LSTAT+RM+PTRATIO+DIS+NOX+CHAS+B+ZN+CRIM+RAD+TAXaic is 3286.974956900157,continuing!aic is 3171.5423142992013,continuing!aic is 3114.0972674193326,continuing!aic is 3097.359044862759,continuing!aic is 3069.438633167217,continuing!aic is 3057.9390497191152,continuing!aic is 3048.438382711162,continuing!aic is 3042.274993098419,continuing!aic is 3040.154562175143,continuing!aic is 3032.0687017003256,continuing!aic is 3021.726387825062,continuing!for selection over!final formula is Price~LSTAT+RM+PTRATIO+DIS+NOX+CHAS+B+ZN+CRIM+RAD+TAXaic is 3286.974956900157,continuing!aic is 3171.5423142992013,continuing!aic is 3114.0972674193326,continuing!aic is 3097.359044862759,continuing!aic is 3069.438633167217,continuing!aic is 3057.9390497191152,continuing!aic is 3048.438382711162,continuing!aic is 3042.274993098419,continuing!aic is 3040.154562175143,continuing!aic is 3032.0687017003256,continuing!aic is 3021.726387825062,continuing!for selection over!final formula is Price~LSTAT+RM+PTRATIO+DIS+NOX+CHAS+B+ZN+CRIM+RAD+TAXaic is 3286.974956900157,continuing!aic is 3171.5423142992013,continuing!aic is 3114.0972674193326,continuing!aic is 3097.359044862759,continuing!aic is 3069.438633167217,continuing!aic is 3057.9390497191152,continuing!aic is 3048.438382711162,continuing!aic is 3042.274993098419,continuing!aic is 3040.154562175143,continuing!aic is 3032.0687017003256,continuing!aic is 3021.726387825062,continuing!for selection over!final formula is Price~LSTAT+RM+PTRATIO+DIS+NOX+CHAS+B+ZN+CRIM+RAD+TAXaic is 3286.974956900157,continuing!aic is 3171.5423142992013,continuing!aic is 3114.0972674193326,continuing!aic is 3097.359044862759,continuing!aic is 3069.438633167217,continuing!aic is 3057.9390497191152,continuing!aic is 3048.438382711162,continuing!aic is 3042.274993098419,continuing!aic is 3040.154562175143,continuing!aic is 3032.0687017003256,continuing!aic is 3021.726387825062,continuing!for selection over!final formula is Price~LSTAT+RM+PTRATIO+DIS+NOX+CHAS+B+ZN+CRIM+RAD+TAXaic is 3286.974956900157,continuing!aic is 3171.5423142992013,continuing!aic is 3114.0972674193326,continuing!aic is 3097.359044862759,continuing!aic is 3069.438633167217,continuing!aic is 3057.9390497191152,continuing!aic is 3048.438382711162,continuing!aic is 3042.274993098419,continuing!aic is 3040.154562175143,continuing!aic is 3032.0687017003256,continuing!aic is 3021.726387825062,continuing!for selection over!final formula is Price~LSTAT+RM+PTRATIO+DIS+NOX+CHAS+B+ZN+CRIM+RAD+TAXaic is 3286.974956900157,continuing!aic is 3171.5423142992013,continuing!aic is 3114.0972674193326,continuing!aic is 3097.359044862759,continuing!aic is 3069.438633167217,continuing!aic is 3057.9390497191152,continuing!aic is 3048.438382711162,continuing!aic is 3042.274993098419,continuing!aic is 3040.154562175143,continuing!aic is 3032.0687017003256,continuing!aic is 3021.726387825062,continuing!for selection over!final formula is Price~LSTAT+RM+PTRATIO+DIS+NOX+CHAS+B+ZN+CRIM+RAD+TAX585 ms ± 36.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)