to find thresholds in order to split the continuous attributes. Finding the optimal threshold on a feature using information gain can be computationally expensive. However, there is a method to perform this operation fast. One solution is the histogram method described in [1].
Algorithm Sketch
Basically we build a table where we bin each value of a certain attribute and save the number of occurrences of a label in said bin. Each bin represents a certain threshold we could apply. Once everything is indexed, we estimate the total number of entries for each class. In the example, class one occurs 33 times in total and class two 96 times. We then scan lowest threshold and keep track of all the values below the current. We can then easily compute the number of each class below and
above the threshold and use these numbers to compute the information gain from that.
above the threshold and use these numbers to compute the information gain from that.
Implementation
First we need to be able to find a bin given a sorted kist of thresholds and given an attribute values.
One algorithm is an adjusted binary search. We basically check the center bin in a range. If this is
our bin, we are done. Otherwise, we continue search to the left or the right.
def bisection(val, th, start, stop): center = int((stop - start) / 2 + start) n = len(th) - 1 if center <= 0: return 0 elif center >= n: return n elif val > th[center - 1] and val <= th[center]: return center elif val < th[center]: return bisection(val, th, start, center) else: return bisection(val, th, center, stop)
The other thing we need is an entropy calculation. Since we are talking about a classification problem, we will implement
the discrete entropy calculation.
def entropy(x): scaler = x.sum() if scaler == 0: return 0.0 else: logScaler = log2(scaler) h = 0 for i in range(0, len(x)): if x[i] > 0: p = x[i] / scaler log_p = np.log2(x[i]) - logScaler h += p * log_p return -h
Finally, we can implement the algorithm sketch from above.
class LabelHistogram: def __init__(self, thresholds, n_classes): self.thresholds = thresholds self.n_bins = len(thresholds) self.n_classes = n_classes self.bins = np.zeros((self.n_bins, n_classes)) def insert(self, attr, label): i = bisection(attr, self.thresholds, 0, self.n_bins) self.bins[i, label] += 1 def gain(self, parent_gain): total = np.sum(self.bins, axis=0) cur = np.zeros(self.n_classes) max_gain = float('-inf') max_th = None for i in range(0, self.n_bins): cur += self.bins[i] left = cur right = total - cur cleft = left.sum() cright = right.sum() total = cleft + cright gain = parent_gain - ((cleft / total) * entropy(left) + (cright / total) * entropy(right)) if gain > max_gain: max_gain = gain max_th = self.thresholds[i] return (max_gain, max_th)
References
[1] Criminisi, Shotton: "Decision Forests for Computer Vision and Medical Image Analysis", Springer, 2013