Monday, October 8, 2018

Fast Threshold Estimation In Decision Trees

In decision trees and related methods such as random forests and gradient boosting, we need
to find thresholds in order to split the continuous attributes. Finding the optimal threshold on a feature using information gain can be computationally expensive. However, there is a method to perform this operation fast. One solution is the histogram method described in [1].

Algorithm Sketch

Basically we build a table where we bin each value of a certain attribute and save the number of occurrences of a label in said bin. Each bin represents a certain threshold we could apply. Once everything is indexed, we estimate the total number of entries for each class. In the example, class one occurs 33 times in total and class two 96 times. We then scan lowest threshold and keep track of all the values below the current.  We can then easily compute the number of each class below and
above the threshold and use these numbers to compute the information gain from that.

Implementation

First we need to be able to find a bin given a sorted kist of thresholds and given an attribute values. One algorithm is an adjusted binary search. We basically check the center bin in a range. If this is our bin, we are done. Otherwise, we continue search to the left or the right.
def bisection(val, th, start, stop):
    center = int((stop - start) / 2 + start)
    n = len(th) - 1
    if center <= 0:
        return 0
    elif center >= n:
        return n
    elif val > th[center - 1] and val <= th[center]:
        return center
    elif val < th[center]:
        return bisection(val, th, start, center)
    else:
        return bisection(val, th, center, stop)
The other thing we need is an entropy calculation. Since we are talking about a classification problem, we will implement the discrete entropy calculation.
def entropy(x):
    scaler = x.sum()
    if scaler == 0:
        return 0.0
    else:        
        logScaler = log2(scaler) 
        h = 0
        for i in range(0, len(x)):
            if x[i] > 0:
                p     = x[i] / scaler
                log_p = np.log2(x[i]) - logScaler
                h += p * log_p
        return -h
Finally, we can implement the algorithm sketch from above.
class LabelHistogram:
    
    def __init__(self, thresholds, n_classes):
        self.thresholds = thresholds
        self.n_bins     = len(thresholds)
        self.n_classes  = n_classes
        self.bins       = np.zeros((self.n_bins, n_classes))
                             
    def insert(self, attr, label):
        i = bisection(attr, self.thresholds, 0, self.n_bins)
        self.bins[i, label] += 1        
                
    def gain(self, parent_gain):
        total = np.sum(self.bins, axis=0)
        cur = np.zeros(self.n_classes)
        max_gain = float('-inf')
        max_th = None
        for i in range(0, self.n_bins):
            cur += self.bins[i]
            left    = cur
            right   = total - cur            
            cleft   = left.sum() 
            cright  = right.sum()
            total   = cleft + cright
            gain    = parent_gain - ((cleft / total) * entropy(left) + (cright / total) * entropy(right))
            if gain > max_gain:
                max_gain = gain
                max_th   = self.thresholds[i]
       return (max_gain, max_th)


References

[1] Criminisi, Shotton: "Decision Forests for Computer Vision and Medical Image Analysis", Springer, 2013