递归的为每个节点寻找最好的划分特征k和划分特征的阈值t,CART Cost Function For classification J(k,tk)=mleftmGleft+mrightmGright(Gmeasurestheimpurityofthesubset)J(k, t_k)=\frac{m_{left}}{m}G_{left} + \frac{m_{right}}{m}G_{right} \;\; (G \; measures \; the \; impurity \; of \; the \; subset) J(k,tk)=mmleftGleft+mmrightGright(Gmeasurestheimpurityofthesubset) 处了GINI指数可以作为G,香农信息熵也是一种方法 Hi=−∑k=1,pi,k≠0npi,klog(pi,k)H_i=-\sum_{k=1,p_{i,k}\neq 0}^{n}p_{i,k}log(p_{i,k}) Hi=−k=1,pi,k=0∑npi,klog(pi,k) 默认选择GINI指数,计算复杂度低一些,二者训练出来的树差不多,Gini impurity tends to isolate the most frequent class in its own branch of the tree, while entropy tends to produce slightly more balanced trees
CART 全称是 Classifcation And Regression Tree
CART is a greedy algorithm 贪心算法
A greedy algorithm often produces a reasonably good solution,
but it is not guaranteed to be the optimal solution.
finding the optimal tree is known to be an NP-Complete problem
it requires O(exp(m)) time
mathematical question
P is the set of problems that can be solved in polynomial time
NP is the set of problems whose solutions can be verified in polynomial time
NP-Hard problem is a problem to which any NP problem can be reduced in polynomial time.
An NP-Complete problem is both NP and NP-Hard
4. Regularization Hyperparameters
a nonparametric model the number of parameters is not determined prior to training
a few parameters restrict the shape of the Decision Tree
min_samples_split
min_samples_leaf
min_weight_fraction_leaf, same as min_samples_leaf but expressed as a fraction of the total number of eighted instances
max_leaf_nodes
max_features, maximum number of features that are evaluated for splitting at each node
increasing min_* hyperparameters or reducing max_* hyperparameters will regularize the model
另可以先不加任何约束训练一棵树,完成后再对树进行裁剪的方式正则化
The computational complexity of training a Decision Tree is O(n × m log(m))
5. Regression
将混乱程度修改为均值平方差
from sklearn.tree import DecisionTreeRegressor
# setting min_samples_leaf=10 to obviously overfitting
tree_reg = DecisionTreeRegressor(max_depth=2)
tree_reg.fit(X, y)
返回的value值,是这一个区间内的所有samples的平均值
6. Instability 不确定性
优点 a lot going
simple to understand and interpret
easy to use
versatile, and powerful
缺点 a few limitations
orthogonal decision boundaries 对非线性的样本不好处理
very sensitive to small variations in the training data