Support Vector Machine
In general, there are two strategies that are commonly used when trying to classify non-linear data:
- Fit a non-linear classification algorithm to the data in its original feature space.
- Enlarge the feature space to a higher dimension where a linear decision boundary exists.
SVMs aim to find a linear decision boundary in a higher dimensional space, but they do this in a computationally efficient manner using Kernel functions, which allow them to find this decision boundary without having to apply the non-linear transformation to the observations.
There exist many different options to enlarge the feature space via some non-linear transformation of features (higher order polynomial, interaction terms, etc.). Let’s look at an example where we expand the feature space by applying a quadratic polynomial expansion.
Suppose our original feature set consists of the p features below.
Our new feature set after applying the quadratic polynomial expansion consists of the 2p features below.
Now, we need to solve the following optimization problem.
It’s the same as the SVC optimization problem we saw earlier, but now we have quadratic terms included in our feature space, so we have twice as many features. The solution to the above will be linear in the quadratic space, but non-linear when translated back to the original feature space.
However, to solve the problem above, it would require applying the quadratic polynomial transformation to every observation the SVC would be fit on. This could be computationally expensive with high dimensional data. Additionally, for more complex data, a linear decision boundary may not exist even after applying the quadratic expansion. In that case, we must explore other higher dimensional spaces before we can find a linear decision boundary, where the cost of applying the non-linear transformation to our data could be very computationally expensive. Ideally, we would be able to find this decision boundary in the higher dimensional space without having to apply the required non-linear transformation to our data.
Luckily, it turns out that the solution to the SVC optimization problem above does not require explicit knowledge of the feature vectors for the observations in our dataset. We only need to know how the observations compare to each other in the higher dimensional space. In mathematical terms, this means we just need to compute the pairwise inner products (chap. 2 here explains this in detail), where the inner product can be thought of as some value that quantifies the similarity of two observations.
It turns out for some feature spaces, there exists functions (i.e. Kernel functions) that allow us to compute the inner product of two observations without having to explicitly transform those observations to that feature space. More detail behind this Kernel magic and when this is possible can be found in chap. 3 & chap. 6 here.
Since these Kernel functions allow us to operate in a higher dimensional space, we have the freedom to define decision boundaries that are much more flexible than that produced by a typical SVC.
Let’s look at a popular Kernel function: the Radial Basis Function (RBF) Kernel.
The formula is shown above for reference, but for the sake of basic intuition the details aren’t important: just think of it as something that quantifies how “similar” two observations are in a high (infinite!) dimensional space.
Let’s revisit the data we saw at the end of the SVC section. When we apply the RBF kernel to an SVM classifier & fit it to that data, we can produce a decision boundary that does a much better job of distinguishing the observation classes than that of the SVC.
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_circles
from sklearn import svm# create circle within a circle
X, Y = make_circles(n_samples=100, factor=0.3, noise=0.05, random_state=0)
kernel_list = ['linear','rbf']
fignum = 1
for k in kernel_list:
# fit the model
clf = svm.SVC(kernel=k, C=1)
clf.fit(X, Y)
# plot the line, the points, and the nearest vectors to the plane
xx = np.linspace(-2, 2, 8)
yy = np.linspace(-2, 2, 8)
X1, X2 = np.meshgrid(xx, yy)
Z = np.empty(X1.shape)
for (i, j), val in np.ndenumerate(X1):
x1 = val
x2 = X2[i, j]
p = clf.decision_function([[x1, x2]])
Z[i, j] = p[0]
levels = [-1.0, 0.0, 1.0]
linestyles = ["dashed", "solid", "dashed"]
colors = "k"
plt.figure(fignum, figsize=(4,3))
plt.contour(X1, X2, Z, levels, colors=colors, linestyles=linestyles)
plt.scatter(
clf.support_vectors_[:, 0],
clf.support_vectors_[:, 1],
s=80,
facecolors="none",
zorder=10,
edgecolors="k",
cmap=plt.get_cmap("RdBu"),
)
plt.scatter(X[:, 0], X[:, 1], c=Y, cmap=plt.cm.Paired, edgecolor="black", s=20)
# print kernel & corresponding accuracy score
plt.title(f"Kernel = {k}: Accuracy = {clf.score(X, Y)}")
plt.axis("tight")
fignum = fignum + 1
plt.show()
Ultimately, there are many different choices for Kernel functions, which provides lots of freedom in what kinds of decision boundaries we can produce. This can be very powerful, but it’s important to keep in mind to accompany these Kernel functions with appropriate regularization to reduce chances of overfitting.