2022-06-04

Adaboost

Difference between Bagging and Boosting

In random forest (bagging model), each tree is an independent tree. In boosting each tree is dependent on the earlier one
In random forest, each tree is made upto the leaf node (though this can be controlled via an hyper parameter). In boosting, each tree is usually a stump tree.
In random forest, each tree gets an equal say in the prediction. In boosting, each tree gets weighted say in the prediction. This is based on how well an individual tree is able to predict the output

Start with sample weights for each of the sample (data point) as $s_i = \frac{1}{N}\space \forall 0<i<=N$ where N is the number of samples.
1. Sample weight signifies the importance of each sample
2. All of the sample weights always add up to 1
Train a stump (single split decision tree) which has the lowest [[Weighted gini index]]. Call this tree $t_0$
Find out the weighted say $t_0$ has on the overall classification
1. $A = 1/2 * \log\frac{(1 - error)}{error}$ (A = weighted say)
2. $error = \sum_{i=1}^k s_i$, where |k| is the number of misclassified samples
We now update the sample weights by penalizing the incorrectly classified samples by increasing it’s sample weight and decreasing the sample weight of the correctly classified samples
1. For incorrectly classified samples, the new sample weight is: $s_i’ = s_i * e^A$
2. For correctly classified samples, the new sample weight is: $s_i’ = s_i * e^{(-A)}$
3. Normalize the new sample weights so that $\sum_{i}^N s_i = 1$
Repeat steps 2 to 4 by:
1. Either using [[Weighted gini index]] and updated sample weights at each iteration. Weighed gini index computation will change based on the sample weight
2. Or by creating a new data set where you sample the data points by their updated sample weights. Higher the sample weight, more is the probability of that sample occurring in the new dataset