I recently learned about SVM, support vector machines...for a tutoring lesson. This was for a graduate course in statistics. The math is quite complex, it involves (possibly) infinite dimensional spaces called Hilbert Spaces, where we can extend the concepts of Euclidean Geometry and Linear Algebra!

The problem involves a collection of sample or training (meant for machine learning) DATA that is classified in 1 of 2 possible ways. We represent that classification as a boolean variable, with +1 or -1. The data are some points x_i for some finite natural index I. The data points with their classifier are given. The aim is to segregate the entire spectrum of potential data (the domain or field of DATA space) using a pair of 'hyperplanes', so that if we acquire future data pts that are missing the classifiers, we can classify them easily! We also sometimes wish to segregate the data as much as possible, so as to easily identify future DATA anomalies. To do this we try to MAXIMIZE the distance between the hyperplanes.

Let's simplify the problem! For ease of understanding, lets limit the dimensions in the problem...as much as possible...assume all the x_i variables are simply REAL numbers. This puts our data space into 2 dimensions, one for the x_i value and one for the classifier, either +1 or -1. Graphically we see that this looks like 2 real number lines separated by a distance of 2, one is the equation y=+1 and the other y=-1. Can anyone guess the ideal choice for the hyperplanes??

A slightly more interesting example allows for the Data x_i to lie in a 2-dimensional plane. That is for each i in our Index I we have a data point x_i=(x_i1,x_i2) a coordinate pair, a point in the plane. Further, we are given a classifier variable to which we could color the points in the plane we are given, say red and blue!

Imagine if you will that all the blue points lie below the line y=x and all the red points lie above the line y=x AND the line y=-x. We can see graphically that there are no points to the left of the criss-cross 'X' the two lines make through the origin. All red points are in the Northern quadrant of the X and all blue are either to the East or South! A possible hyperplane for the blue data boundary is the line y=x, a good boundary for the red data boundary is the upper 'V' of the 'X'. Note that since all the points cannot lie on the lines y=+/-x, we can increase the thickness of the red boundary upward/downward and the blue boundary downward/upward, at least a little bit, without adding any training data points into this separation band. The boundaries of this gap or margin are two hyperplanes that separate the training data.

Formally, we would like equations for the best possible boundaries of this nature. The magic of linear algebra is that we can find these explicitly. Recall we have data triplets of the form (x_i1,x_12,+/-1)=(x_i,y_i) in R^3 whenever our training data is boolean class identified points in R^2.

Hyperplanes can be denoted by an equation w 'dot' x =b or w\dot x-b=0, in 'w' X 'x' Space, for some appropriate dot-product. For this example we know x is R^2 because of the data...lets assume w is R^2 also, and note that this dot product takes two length two vectors and outputs a real number. It is a map from a 4-dimensional space to R, its actually a map from R^2 X R^2 -> R. One such dot-product could be thought of as w=(w_1,w_2) as w^T*x = w*x^T = x*w^T = x^T *w, as the traditional dot product of vectors in R^2. This natural choice tells us that w 'dot' x_i = w_1*x_i1 + w_2*x_i2.

so the hyperplane would then be given by w_1*x_1 + w_2*x_2 - b.

Recall how we pushed out the boundary before in the red/blue example, here we do a similar trick using two different hyperplanes, w dot x - b =+1 and w dot x - b =-1. The distance between these planes is 2/(w_1^2+w_2^2)=2/||w||. In order to maximize the separation between hyperplanes, we must minimize ||w|| = 2norm of w= square root of the sum of the squares. So our maximization problem turned into a minimization problem...no worries its still a question of finding an extrema...something calculus tells us all about!

Note if we choose to prevent points from falling in between these two hyperplanes (the margin) we should put boundary conditions that prevent +1 x_i's going below w dot x_i -b=+1 and -1 x_i's from being above w dot x_i -b=-1...

ie we need to enforce the conditions w \dot x_i -b >= 1 and w \dot x_i -b <= -1. Using the fact that each x_i comes with a class identifier, that we called y_i above, we can rewrite (and by division and flipping the inequality when dividing by negative y_i's) the above 2 inequalities for each class of data points (x_i,y_i) as one single inequality:

y_i(w \dot x_i -b) >= 1 for all i in the Index. And we seek only to minimize the 2norm of w subject to this single constraint!

This is the SVM problem!

The methods of solution all require the use of calculus...a technique for solving optimization problems called Lagrange Multipliers!

Lets change gears for a moment and examine this technique in detail.

Suppose we have two continuous and partially differentiable functions f,g: R^k -> R, that is two real maps that take in vectors x=(x_1,...,x_k) and output real numbers.

We wish to find a point in R^k that is an extrema (WLOG assume maxima) of f, subject to a constraint that the value of g is some constant, c at that point. WLOG assume c=0, for now.

This problem amounts to finding the x s.t. f(x) is maximal over all the x's with g(x)=c.

The trick here is to consider 'level sets' for f, that is contours of constant value. Solutions to the equation f(x)=d form such contours, which are continuous curves bc f is continuous (and differentiable). Note how because f is constant (has value d) along these curves, EVERY single partial derivative of f along those curves is 0.

Consider a new function L(x,a) = f(x) - a*g(x), for some constant real number a.

Recall that a gradient is a vector consisting of the partial derivatives. If we apply a gradient to L() and set this equal to 0, we will find the point we seek!

\grad_x,a L(x,a) = 0 finds us the solution(s), for some value of a!

Consider the (\del)/(\del a) piece of the gradient, that is the partial derivative wrt a. Since f(x) doesn't depend on a it is 0, and we get that -g(x)=0, this enforces the condition we were subject to. When we look at derivatives wrt x, a is a constant and pulls through, thus the equation \grad L(x,a) = 0 it finds us solutions (x,a) such that g(x)=0. Good our, restriction is satisfied!

Why are these values maxima for f? Well that comes from the idea of a critical point in calculus. Functions are locally extreme when their derivatives are 0. Going back to the idea of contours, f(x)=d, along these curves the derivative of f=0 bc it is constant. A bit of geometry here is needed to understand what the gradient actually gives us. When we consider contours of a function and look at their gradient along the curve, it provides us a normal vector...that is a vector perpendicular to the tangent at each point. By using \grad L(x) = 0, this forces: \grad_x f(x) = a * \grad_x g(x).

This means that the normal vectors for f and g are in the same direction at these x's (along the contour), i.e. the tangent to the contour of f is tangent to g, Graphically this is what must happen as we blow out the contours from some starting point, say a point where f(x)=0.

There is a lot more to this method of Lagrange multipliers, see wikipedia for a good reference.

The important point here is that this provides a tool to solve any extrema problem that is subject to a constraint!

Going back to our SVM problem with training data:

We were given data (x_i,y_i) in R^3 where y_i=+/-1 for all i in the Index.

We seek maximally spaced hyperplanes described by w \dot x -b=+/-1,

this meant we need to find (w,b) s.t. y_i(w \dot x_i -b) >= 1 for all i, and ||w|| is minimized!

if we think of ||w|| as f, the function we want the extrema of, and the constraint above as our g...we can apply the technique of Lagrange multipliers...

we define L(w,b,\alpha) = ||w|| - \alpha \dot (y_i(w \dot x_i -b)-1). Here we need \alpha = vector of a_i's for each i in the Index, for this to work, because there a multiple constraints for each i. Thats it, the rest is just messy algebra using summations and writing out all the terms based on definitions of the \dot product. This is then a problem in many variables...w which is a vector with real valued components, b which is a real number and a vector indexed over I called our Lagrange multiplier \alpha. We force the \alpha_i components to be positive and seek to maximize over \alpha non-negative while simultaneously minimizing over (w,b).

Some of the real work for SVM comes from the realizations that only a handful of the training data points matter, in the sense that only some of them 'support' the hyperplanes. These correspond to the \alpha_i's which are non-zero, and those vectors x_i become support vectors for the hyperplane(s).

In the linear setting, like we were discussing above, the w is actually a linear combination of the training data...

thus, w=\sum_i \alpha_i y_i x_i (this result is due to KKT, Krush-Kuhn-Tucker).

Through this substitution, and the modification to examine ||w||^2 to eliminate the square root... we do some algebra and notice that the Lagrange function we defined above depends only on \alpha! (since w was a linear combination of y_i x_i terms with coefficients equal to \alpha_i, and b=w \dot x_i - y_i can be normalized to be Constant * \sum_i w \dot x_i -y_i).

The algebra works out to say that:

L(\alpha) = \sum_i \alpha_i - 1/2 \sum_i,j \alpha_i \alpha_j y_i y_j (x_i \dot x_j) subject to constraints that \alpha_i >=0 (and to minimize b: \sum_i \alpha_i y_i =0).

The recognition of the dot product above as some image of a kernel, is what allows us to begin to analyze non-linear spaces for the data! This is where the math gets hairy, but the general ideas are simply extensions, and the reductions in algebra are simply arithmetic.

The viewpoint that changing variables when we apply Lagrange multipliers is actually transforming space, allows for us to attack the problem in this dual setting, and eventually only have one variable \alpha to consider for the maximization!

This can be viewed as projecting into a higher dimensional space to solve this hyperplane location problem, where the kernel is some positive definite function on the DATA space, K(x,x_i)=\phi(x) \dot \phi(x_i)...where \phi is the transformation of complex data space into a higher dimensional (Hilbert) space which has a \dot product accessible to it that behaves linearly in some way! Note that the hyperplane w must now exist in the transformed space, because in the nonlinear setting we can only say that it is a linear combination of the images of the x_i data points...

w = \sum_i \alpha_i y_i \phi(x_i), and as such any dot products w \dot \phi(x) = \sum_i \alpha_i y_i K(x_i,x).

That is SVM in a nutshell!

Commonly used Kernels are the following:

K(x_i,x_j) = (x_i \dot x_j)^d or (1+x_i \dot x_j)^d

homogeneous and non-homogeneous polynomials of degree d.

K(x_i,x_j) = \exp{-\gamma ||x_i - x_j||^2} for \gamma>0, a nice choice is \gamme = 1/2*\sigma^2.

This is called a Gaussian radial Kernel, as it is reminiscent of a normal distribution PDF.

Another one commonly used involves a hyperbolic tangent function!