Supervised machine learning problems are typically of the form "minimize your
error while regularizing your parameters." The idea is that while many choices
of parameters may make your training error low, the goal isn't low training
error -- it's low test-time error. Thus, parameters should be minimize training
error while remaining "simple," where the definition of "simple" is left up to
the regularization function. Typically, supervised learning can be phrased as
minimizing the following objective function,
where is the loss for predicting when the
true label is for sample is and is a regularization
function.
There are many choices for , but the ones I'm going to talk about
today are so called "sparsifying regularizers" such as . These norms
are most often employed because they produce "sparse" -- that is,
with many zeros. This is in stark contrast to other regularizers such
as which leads to lots of small but nonzero entries in
.
Feature Selection One of the key reasons people turn to sparsifying
regularizers is that they lead to automatic feature selection. Quite often,
many of the entries of are irrelevant or uninformative to predicting
the output . Minimizing the objective function using these extra
features will lead to lower training error, but when the learned is
employed at test-time it will depend on these features to be more informative
than they are. By employing a sparsifying regularizer, the hope is that these
features will automatically be eliminated.
Interpretability A second reason for favoring sparse solutions is that
the model is easier to interpret. For example, a simple sentiment classifier
might use a binary vector where an entry is 1 if a word is present and 0
otherwise. If the resulting learned weights has only a few non-zero
entries, we might believe that those are the most indicative words in deciding
sentiment.
We now come to the \ norm," but there's a better reason than
that. In fact, I claim that any regularizer that is non-differentiable at and can be decomposed into a sum can lead to sparse solutions.
Intuition The intuition lies in the idea of subgradients. Recall that the
subgradient of a (convex) function at is any vector such that,
The set of all subgradients for at is called the subdifferential
and is denoted . If is differentiable at ,
then -- in other words,
contains 1 vector, the gradient. Where the
subdifferential begins to matter is when isn't differentiable at
. Then, it becomes something more interesting.
Suppose we want to minimize an unconstrained objective like the following,
By the KKT conditions, 0 must be in the subdifferential at
the minimizer ,
Looking forward, we're particularly interested in when the previous
inequality holds when . What conditions are necessary for this to be
true?
Dual Norms Since we're primarily concerned with ,
let's plug that in. In the following, it'll actually be easier to prove things
about any norm, so we'll drop the 1 for the rest of this section.
Recal the definition of a dual norm. In particular, the dual norm of a norm
is defined as,
A cool fact is that the dual of a dual norm is the original norm. In other words,
Let's take the gradient of the previous expression on both sides. A nice fact
to keep in mind is that if we take the gradient of an expression of the form
, then its gradient with respect to x is where is any that achieves the . Since , that means,
Now let . Then for all , so any with is in for .
Back to our original goal, recall that
If when , then we've
already established that is in . In other words, solves the original problem!
We've just established that
implies , but we don't want all of to be 0, we want some
coordinates of to be 0. How can we take what we just concluded and
apply it only a subvector of ?
Rather than a general norm, let's return once again to the norm. The
norm has a very special property that will be of use here:
separability. In words, this means that the norm can be expressed as a
sum of functions over 's individual coordinates, each independent of every
other. In particular, . It's easy to see that the
function is independent of the rest of 's elements.
Let's take another look at our objective function,
where is all coordinates of except and . Taking the
derivative of with respect to , we again require that,
Hmm, that looks familiar. And isn't ? That means that if
when , then . In other words, given the optimal values
for all coordinates other than , we can evaluate the derivative of
with respect to and check if the absolute value
of that is less than 1. If it is, then is optimal!
In the first section, we showed that in order to solve the problem
, it is necessary that . If is
differentiable at , then there can be only 1 possible choice for
, but in all other cases there are a multitude of potential solutions.
When isn't differentiable at , there is a non-singleton set
of values which can be in such that is an optimal solution. If , then a sufficient
condition for to be optimal is at .
In the next section, we showed that in the special case of the norm, we
can express the norm as the sum of norms applied to 's individual
coordinates. Because of this, we can rewrite the original optimization problem
as where . Using the same results from the previous
section, we showed that as long as when , then is an optimal choice. In
other words, we established conditions upon which a coordinate will be 0. This
is why the norm causes sparsity.
Everything written here was explained to me by the ever-knowledgable
MetaOptimize king, Alexandre Passos.