Thursday, October 10, 2013

My thoughts on the value of data processing in practical applications.

Just a quick post about a few thoughts about data engineering vs machine learning.
While I definitely like complex algorithms for Machine Learning, my recent results on applications such as Dolphin Communications or Typing Error Detection show me that most of the time the data preprocessing is doing a lot of the work while I get away with more or less simple models such as Gaussian Mixture Models, Hidden Markov Models, Decision Trees or Random Forests (okay the last one just works great in general). Now the question is where the tradeoff is between data engineering and actual learning. Deep Learning promises to extract features on their own and are on the learning heavy sides of things. I think Speech Recognition is an example that is traditionally approached with carefully crafted features such as Mel Cepstrum Components. I observe the same difference for carefully modeling such as Hidden Markov Models and more or less structureless models.
So for me it seems fitting an existing algorithm to new domains under heavy pre processing can lead to good results faster than the search for new models. Maybe I should call it Machine Learning Prototyping. While developing learning methods to solve new problems is the higher goal, whenever one needs a learning "prototype" for a specific domain, hacking the data and some existing model might be sufficient.


  1. "most of the time the data preprocessing is doing a lot of the work"

    This is the ugly secret about pattern rec. 90% of the work is in finding and generating the features.

  2. Did you use the same features for HMM and Decision Trees? That doesn't make sense to me.

    1. No that were different projects. I am not comparing methods here :)