GBDT笔记

@(机器学习、NLP和IR)[GBDT]

基础原理

DT

GB

首先Gradient Boosting是一种boosting方法，它通过汇聚多个弱分类器的能力，来提升学习的效果。
其基本思路是：

基础思想：所谓梯度下降，对残差持续学习，每次用 $H$ 去拟合 $y-F_{m-1}(x)$ ，然后更新 $F_m(x)=F_{m-1}(x)+H$
具体操作：既然我们拟合的目标是 $y-F_{m-1}(x)$ 而拟合本身也需要选择一个评估函数，标准方法使用MSE来作为拟合目标，也即：
对于第 $m$ 次学习，我们以MSE作为H的目标函数进行拟合： $argmin\frac{1}{2}(y-F)^2$
同时，
- 其中，H是根据之前学习的结果进行学习的树，学习的目标就是MSE。
- $theta$ 是一维搜索获取的值。

类比泰勒展开式的思路，泰勒展开式相当于用多项式作为弱分类器，来学习任意一个函数在某点附近的情况。

toy实现

根据上述简介，我们非常容易实现一个伪代码

for round in range(MaxRound):
    target_Y = map(lambda (x, y):y-F(x), XY)
    # learing tree from target Y.
    new_tree = Tree.learn(X, target_Y) 
    # fit a best theta for F' = F + theta * new_tree
    theta = _1d_search(F, new_tree, XY) 
    # make F'(x) = F(x) + theta * new_tree(x)
    F.update(theta, new_tree)

扩展

对于GBDT，在最早的Friedman的实现中，进行了一个修改，将 $F'=F+\theta * H(x)$ 进行拆解。
首先，引入position indicator我们假定 $H(x)=\sum_j b_{jm} I(x \in R_{jm})$ 则每次优化的不再是一维的 $theta$ 而是
$F'=F+\sum_{jm} \theta * b_{jm} I(x \in R{jm})$
(how to?)

延展知识

比较

和AdaBoost和RF(random forest)的性能比较

命名

根据wikipedia：

A popular open-source implementation[9] for R calls it “Generalized Boosting Model”. Sometimes the method is referred to as “functional gradient boosting” (this term was introduced in,[3][4]), “Gradient Boosted Models” and its tree version is also called “Gradient Boosted Decision Trees” (GBDT) or “Gradient Boosted Regression Trees” (GBRT). Commercial implementations from Salford Systems use the names “Multiple Additive Regression Trees” (MART) and TreeNet, both trademarked. Commercial implementations from FICO call the Boosted Tree Ensemble (BTE) method. The FICO implementation is available in FICO® Model Builder on the desktop and FICO® Analytic Modeler Scorecard in the FICO® Analytic Cloud

树的个数

$J$ the number of terminal nodes in trees, is the method’s parameter which can be adjusted for a data set at hand. It controls the maximum allowed level of interaction between variables in the model. With $J = 2$ (decision stumps), no interaction between variables is allowed. With $J = 3$ the model may include effects of the interaction between up to two variables, and so on.

Hastie et al.[6] comment that typically $4 \leq J \leq 8$ work well for boosting and results are fairly insensitive to the choice of J in this range, $J = 2$ is insufficient for many applications, and $J > 10$ is unlikely to be required.

寻找目标函数的方式

在最早介绍GBDT的论文中，主要论述了我们寻找目标函数的两种办法。问题定义是：我们想要得到 $F(x)$ 满足Loss function $L(y, F(x))$ 尽可能小。

第一种方法，是基于参数的优化
也即，我们假定 $F$ 是某种特殊形式的函数，由有限维的参数 $P$ 构成（比如n阶多项式）。而对于训练集合的 $X, Y$ 都是已知的值。那我们就很容易把问题看作是以参数 $P$ 的对函数 $L$ 的优化。
$g=\frac{\partial L(y,F(x),P)}{\partial P}$

第二种方法，是基于函数的优化
这种方法是认为 $F=\sum \theta g(x)$ 而 $g(x)$ 是根据之前的 $F_{m-1}$ 估计出的残差。其基本思想是：

当给定一个原始的 $F$ 的时候，我们可以获得 $L$ 对 $F$ 的梯度（也即 $F$ 应该往上面方向增长，最接近 $L$ 的最小值）

也即
$g=\frac{\partial L(F(x), y)}{\partial F(x)}_{F=F_m-1}$
$F_m=F_{m-1} + \theta g(F_{m-1})$

通过这种办法，逐步逼近最优化的F。gbdt也是基于这种办法。

参考资料

Written on February 20, 2015