The Care and Feeding of Machine Learning Algorithms

Machine Learning was an intimidating phrase to me for a long time. I always assumed it was a dark art accessible only to those with computer science degrees. 

Fortunately, in the past couple years there have been an increasing number of tutorials to describe the practical implementation of such algorithms. An excellent example is Joel Grus’ Hacking Hacker News article, which describes how he setup and trained a Bayesian algorithm to recommend upcoming articles that he would like, regardless of their popularity on HN itself.

One newbie mistake I made was focusing too much attention on which algorithm to use for a particular task. It’s true that algorithms like SVM, naive bayes, and neural network work best for certain types of tasks. However, it’s like building a house and obsessing over which type of hammer (ball peen, rubber mallet) to use: if the choice of hammer becomes more important than whether you’re building on a solid foundation, you’re doomed.

Machine learning is all about the care and feeding of the algorithm. The most important questions are (1) what’s your training set, and (2) how are you validating it? If you focus on those two questions, the rest falls into place easily.