Support Vector Machines in Data Science

Support Vector Machines in Data Science

Support Vector Machines in Data Science

By January 30, 2015 Data Science No Comments

Support Vector Machines in Data Science

By January 30, 2015 Data Science No Comments

Support Vector Machines (SVMs) may not be as popular as Neural Networks within data science, but they act as powerful, useful algorithms. One of the difficulties of SVMs has been the computational effort required to train them. However LIBSVM, which has been used for over a decade, can fairly easily handle the 30,000 training points in the National Data Science Bowl competition’s data set. That makes SVMs a viable tool for you to use both in general, and for the purpose of competing in the Data Science Bowl.

At their heart, SVMs find a linear solution to the problem presented from the data set, similar to the long-used Logistic Regression and Linear Regression. The notion behind what the SVM does differently is that it tries to maximize the distance between the classes involved. This is also known as margin maximization. The intuition being: If you find the line that maximizes the distance between two classes, it’s more likely to generalize well to unseen data.

What makes SVMs in data science truly interesting is what is known as the “kernel trick.” This allows one to have the algorithm transform your input features into a different feature space and find a linear solution, which may be non-linear in the original dimension. The form of a kernel trick is K(x,y) where x and y are two features, and it outputs a single, real value where larger values indicate similar and smaller values with less similarity between x and y. For example, if you wanted to include the interactions of every combination of two features, you might create extra features that represent the multiplication of those features together. This is equivalent to creating the degree-2 polynomial of the features, which the SVM can do as a kernel trick. However, since the SVM can do this without actually forming the features, you can raise the degree as high as you like and it will still work (if you tried this yourself you may very quickly find yourself unable to learn to model or even fit the data in memory).

Another commonly used kernel is the RBF kernel, and is probably the most widely used kernel. It has the nice property that using it with an SVM can be interpreted as a type of smarter nearest-neighbor search.

Continually, the SVM combined with Kernels have some very nice mathematical properties—one of which is you can add or multiply kernels together to get a new, valid kernel. We can use this to incorporate different types of features more elegantly into a single model.

In the case of the National Data Science Bowl data set, we could use a combination of three kernels—one for each feature type as Aaron Sander (one of my fellow data scientists) suggested in his post (say k1, k2, and k3). If I noticed spatial features (k1) tend to only give low scores when two inputs were definitely different classes, I could define K(x,y) = k1(k2 + k3). That way, when spatial features indicates a low match, it will strongly discourage the algorithm—even if the other set of features thought it was viable. I could even add some extra knobs to tune, making K(x,y) = k1(c2 k2 + c3 k3) allowing me to favor one set of features as better than the others.

This type of feature combination is especially useful for a data set that cannot be represented as a fixed-length feature vector. Kernels can be defined directly for text data, graphs (like the connections in a social network), and lots of various structured problems. By using a different kernel for each feature, we can use SVMs on more types of information than other algorithms can, and we can use them all simultaneously.

Hopefully, this has motivated you to explore the SVM as a potential option for you to solve this, and future, problems using a complex data set. For more information about the LIBSVM, the authors have a short guide on how to use their software, which includes good advice on using SVMs in general. It’s easily available in python (scikit-learn), R (the e1071 package), Java (Weka), and has been ported on its own and wrapped into numerous other programming languages and libraries as well.

Feel free to talk or ask me questions @EdwardRaffML. Good luck!

—Written by Edward Raff