Saturday, April 15, 2017

Three Algorithm Tricks in Data Mining

I really enjoy it when concepts can be reused or applied in several contexts. Three of the algorithms
that seem to be everywhere are quick sort, merge sort and binary search. In this post I will
just show some tricks I learned during coding. Using a similar algorithm than the merge method
of merge sort for fast sparse vector multiplication, using binary search to insert into a histogram in O(log(n)) time, and using the partition method of quick sort during C.45 decision tree training.


Fast Sparse Vector Multiplication

A sparse vector is a vector where most if the entries are zeros. A common data structure is an array saving all non zero entries. Each entry is a tuple of the position and the corresponding value. Also the array is sorted increasingly by the position. Given two such vectors, we can multiply two such vectors very efficiently. We use two variables i and j, indicating a position in one of the vectors' array x and y. The variables move at the same speed as long as the positions of x(i) and y(j) are the same. In this case, we can also aggregate the multiplication of the vector elements. If i points to a position larger than j, we only move j until i and j are equal and vice versa. Since in the previous case, x(i) or y(j) is a zero entry, we don't have to compute anything. The movement of i and j is the same in the merge method of merge sort.



object DotProduct {

  case class SparseNode(pos: Int, value: Double)

  type SVector = Array[SparseNode]  

  def dot(x: SVector, y: SVector): Double = {
    var i = 0
    var j = 0
    var product = 0.0
    while(i < x.length && j < y.length) {
      if(x(i).pos == y(j).pos) {
        product += x(i).value * y(j).value
        i += 1
        j += 1
      } else if(x(i).pos < y(j).pos) i += 1
      else j += 1
    }
    product
  }

}

Histogram Insertion

A histogram can be represented as an array that holds the counts of values that fall into 
a certain region. The range is another array containing the boundaries, which should be sorted
in increasing order. When we want to estimate the histogram, we need to find the region the value falls in. Naively that can be done by iterating the boundary array and checking if the value is in the range. However, we can perform this search using binary search in the boundary array. 
Instead of searching an element though, we are checking if the value is larger than boundary[middle - 1] and is smaller than boundary[middle], if so we can increase the count for the middle, else we recursively search the left or right side, depending if the value is smaller or larger than the middle.


class BinarySearchHistogram(val binBorders: Array[Double]) {

  final val bins = Array.fill(binBorders.length)(0.0)

  def findBin(x: Double, start: Int, stop: Int): Int = {   
    val center = ( (stop - start) / 2 ) + start
    if (center == 0 || center == binBorders.length - 1) center
    else if(x > binBorders(center - 1) && x <= binBorders(center)) center
    else if(x > binBorders(center - 1)) findBin(x, center, stop)
    else findBin(x, start, center)
  }

  def insert(x: Double): Unit = bins(findBin(x, 0, binBorders.length)) += 1.0

}

Decision Tree Learning

A dataset for decision tree learning can be represented as a 2d array X[N][D] of N vectors each of D 
dimensions. At each node in the decision tree, a node partitions the data points into two subsets, one contains all vectors whose dimension D is smaller than a threshold and the other subset contains the vectors where dimension D is larger. We can introduce an ordering array. The left side of the array from a pivot element contains all instances where the split decision is true and the right side contains all elements where the decision is false. We can swap the elements from the left to right whenever
a split decision is violated (the element should be left but is right). We can then call the partition
function recursively on the left and the right side. If we apply the function recursively, we partition the dataset into the split decisions as a decision tree does. Of course, we still have to select the correct
spit decision for every level but this is the material for another post. In the end, this function is very similar to the quick sorts partition method. Another note: The method only works well for binary decision trees, which is not a huge problem in practice.


class DataSet(val ordering: Array[Int], val data: Array[Array[Double]]) {

  def partition(dec: Decision, hi: Int, lo: Int): Int = {
    var i = lo
    var j = lo
    while(j < hi) {
      if( dec.f( data(ordering(j)) ) ) {
        val tmp = ordering(j)
        ordering(j) = ordering(i)
        ordering(i) = tmp
        i += 1
      }
      j += 1
    }
    i
  }

}

case class Decision(attr: Int, th: Double) {

  def f(x: Array[Double]): Boolean = x(attr) <= th

}