In this post, we'll take a look at the Frank-Wolfe Algorithm
also known as the Conditional Gradient Method, an algorithm particularly suited
for solving problems with compact domains. Like the Proximal
Gradient and Accelerated Proximal
Gradient algorithms, Frank-Wolfe requires we
exploit problem structure to quickly solve a mini-optimization problem. Our
reward for doing so is a converge rate of
Returning to my valley-finding metaphor, Frank-Wolfe is a bit like this,
- Look around you and see which way points the most downwards
- Walk as far as possible in that direction until you hit a wall
- Go back in the direction you started, stop part way along the path, then repeat.
How does it work?
Frank-Wolfe is designed to solve problems of the form,
where
Input: Initial iterate
- For
- Let
- If
, break - Let
- Let
The proof relies on
then take a step in that direction. We can immediately see that if
Upper Bound One nice property of Frank-Wolfe is that it comes with its
own upper bound on
Since,
we know that
A Small Example
For this example, we'll minimize a simple univariate quadratic function constrained to lie in an interval,
Its derivative is given by



Why does it work?
We begin by making the two assumptions given earlier,
is convex, differentiable, and finite for all is compact
Assumptions First, notice that we never needed to assume that a solution
Secondly, we never made a Lipschitz assumption on
This immediate implies the following upper bound on
Proof Outline The proof for Frank-Wolfe is surprisingly simple. The idea
is to first upper bound
Step 1 Upper bound
Step 2 Use induction on
Now, we employ induction on
We'll assume that the step size is
Next, for the recursive case, we use the inductive assumption on
Thus, if we want an error tolerance of
When should I use it?
Like Proximal Gradient, efficient use of Frank-Wolfe requires solving a
mini-optimization problem at each iteration. Unlike Proximal Gradient, however,
this mini-problem will lead to unbounded iterates if the input space is not
compact -- in other words, Frank-Wolfe cannot directly be applied when your
domain is all of
Sparsity The primary reason machine learning researchers have recently
taken an interest in Frank-Wolfe is because in certain problems the iterates
Atomic Norms One particular case where Frank-Wolfe shines is when
minimizing
For example,
Extensions
Step Size The proof above relied on a step size of
Approximate Linear Solutions Though not stated in the proof above,
another cool point about Frank-Wolfe is that you don't actually need to solve
the linear mini-problem exactly, but you will still converge to the optimal
solution (albet at a slightly slower rate). In particular, assume that each
mini-problem can be solved approximately with additive error
then Frank-Wolfe's rate of convergence is
The proof for this can be found in the supplement to Jaggi's excellent survey on Frank-Wolfe for machine learning.
Linear Invariance
Another cool fact about Frank-Wolfe is that it's linearly invariant -- that
is, if you rotate and scale the space, nothing changes about the convergence
rate. This is in direct contrast to many other methods which depend on the
condition number of a function (for functions with
Hessians, this is the ratio between the largest and smallest eigenvalues,
Suppose we transform our input space with a surjective (that is, onto) linear
transformation
Let's look at the solution to the per-iteration mini-problem we need to solve for Frank-Wolfe,
In other words, we will find the same
References
Proof of Convergence, Linear Invariance Pretty much everything in this article comes from Jaggi's fantastic article on Frank-Wolfe for machine learning.
Reference Implementation
def frank_wolfe(minisolver, gradient, alpha, x0, epsilon=1e-2):
"""Frank-Wolfe Algorithm
Parameters
----------
minisolver : function
minisolver(x) = argmin_{s \in D} <x, s>
gradient : function
gradient(x) = gradient[f](x)
alpha : function
learning rate
x0 : array
initial value for x
epsilon : float
desired accuracy
"""
xs = [x0]
iteration = 0
while True:
x = xs[-1]
g = gradient(x)
s_next = minisolver(g)
if g * (x - s_next) <= epsilon:
break
a = alpha(iteration=iteration, x=x, direction=s_next)
x_next = (1 - a) * x + a * s_next
xs.append(x_next)
iteration += 1
return xs
def default_learning_rate(iteration, **kwargs):
return 2.0 / (iteration + 2.0)
if __name__ == '__main__':
import os
import numpy as np
import pylab as pl
import yannopt.plotting as plotting
### FRANK WOLFE ALGORITHM ###
# problem definition
function = lambda x: (x - 0.5) ** 2 + 2 * x
gradient = lambda x: 2 * (x - 0.5) + 2
minisolver = lambda y: -1 if y > 0 else 2 # D = [-1, 2]
x0 = 1.0
# run gradient descent
iterates = frank_wolfe(minisolver, gradient, default_learning_rate, x0)
### PLOTTING ###
plotting.plot_iterates_vs_function(iterates, function,
path='figures/iterates.png', y_star=0.0)
plotting.plot_iteration_vs_function(iterates, function,
path='figures/convergence.png', y_star=0.0)
# make animation
iterates = np.asarray(iterates)
try:
os.makedirs('figures/animation')
except OSError:
pass
for t in range(len(iterates)-1):
x = iterates[t]
x_plus = iterates[t+1]
s_plus = minisolver(gradient(x))
f = function
g = gradient
f_hat = lambda y: f(x) + g(x) * (y - x)
xmin, xmax = plotting.limits([-1, 2])
ymin, ymax = -4, 8
pl.plot(np.linspace(xmin ,xmax), function(np.linspace(xmin, xmax)), alpha=0.2)
pl.xlim([xmin, xmax])
pl.ylim([ymin, ymax])
pl.xlabel('x')
pl.ylabel('f(x)')
pl.plot([xmin, xmax], [f_hat(xmin), f_hat(xmax)], '--', alpha=0.2)
pl.vlines([-1, 2], ymin, ymax, color=np.ones((2,3)) * 0.2, linestyle='solid')
pl.scatter(x, f(x), c=[0.8, 0.0, 0.0], marker='o', alpha=0.8)
pl.scatter(x_plus, f(x_plus), c=[0.0, 0.8, 0.0], marker='D', alpha=0.8)
pl.vlines(x_plus, f_hat(x_plus), f(x_plus), color=[0.0,0.8,0.0], linestyle='dotted')
pl.scatter(s_plus, f_hat(s_plus), c=0.35, marker='x', alpha=0.8)
pl.savefig('figures/animation/%02d.png' % t)
pl.close()