When I originally wrote about the Alternating Direction Method of Multipliers algorithm, the community's understanding of its convergence properties was light to say the least. While it has long been known (See Boyd's excellent article, Appendix A) that ADMM will converge, it is only recently that the community has begun to establish how fast it converges (e.g. Hong, Deng, Feng, He).
In this article, we'll explore one way to establish an
How does it work?
Let's begin by introducing the optimization problem ADMM solves,
This problem is characterized by 2 primal variables,
The ADMM algorithm then finds the "saddle point" of the Augmented Lagrangian for the corresponding problem,
Note that we say Augmented Lagrangian, as the typical Lagrangian does not
include the final quadratic term. It's easy to see, however, that the quadratic
does not affect the problem's optimal solution, as the constraint
The ADMM algorithm iteratively minimizes
Input Step size
- For
t=0,1,… - Let
x(t+1)=argminxLρ(x,z(t),y(t)) - Let
z(t+1)=argminzLρ(x(t+1),z,y(t)) - Let
y(t+1)=y(t)+ρ(Ax(t+1)+Bz(t+1)−c)
- Let
Intuitively, the extra quadratic term prevents each iteration of the algorithm from stepping "too far" from the last iteration, an idea that's also at the core of Proximal Gradient Descent.
Animation of
In the remainder of the article, we'll often use the following notation for conciseness,
Why does it work?
Unlike other convergence proofs presented on this website, we won't directly
show that the objective converges to its minimum as
Geometric interpretation of a the optimality condition for a variational
inequality when ignoring
For the following proof, we'll replace
Assumptions
The assumptions on ADMM are almost as light as we can imagine. This is
largely due to the fact that we needn't use gradients or subgradients for
is convex.f(x)+g(z) - There exists a solution
that minimizes[x∗;z∗] while respecting the constraintf(x)+g(z) .Ax+Bz=c
Proof Outline
The proof presented hereafter is a particularly simple if unintuitive one. Theoretically, the only tools necessary are the linear lower bound definition of a convex function, the subgradient condition for optimality in an unconstrained optimization problem, and Jensen's Inequality. Steps 1 and 2 below rely purely on the first 2 of these tools. Step 3 merely massages a preceding equation into a simpler form via completing squares. Step 4 closes by exploiting a telescoping sum and Jensen's Inequality to obtain the desired result,
As
Step 1 Optimality conditions for Step A. In this portion of the proof,
we'll use the fact that
We begin by recognizing that
As
Substituting in our subgradient for
Now recall Step C of the algorithm:
We finish by moving everything not multiplied by
Step 2 Optimality conditions for Step B. Similar to Step 1, we'll use the
fact that
As
Substituting in the previously derived subgradient and moving all terms to the left side, we obtain,
Substituting in Step C's definition for
Step 3 We now sum Equation
We begin by summing equations
Next, we use the definitions of
Then, moving the last term on the left side of the inequality over and
observing that Step C implies
We will now tackle the two components on the right hand side of the
inequality in isolation. Our goal is to rewrite these inner products in terms
of sums of
We'll start with
We'll do the same for
Finally, let's plug equations
Recall that
We can substitute that into the right hand side of the preceding equation to cancel out a couple terms,
Finally dropping the portion of the equation that's always non-positive
(doing so doesn't affect the validity of the inequality), we obtain a concise
inequality in terms of sums of
Step 4 Averaging across iterations. We're now in the home stretch. In this
step, we'll sum the previous equation across
We begin by summing the previous equation across iterations,
For convenience, we'll choose
Finally, recall that for a convex function
The same is true for each of
The right hand side decreases as
When should I use it?
Similar to the proximal methods presented on this website, ADMM is only
efficient if we can perform each of its steps efficiently. Solving
2 optimization problems at each iteration may be very fast or very slow,
depending on if a closed form solution exists for
ADMM has been particularly useful in supervised machine learning, where
All in all, ADMM is not a quick method, but it is a scalable one. ADMM is
best suited when data is too large to fit on a single machine or when
Extensions
Accelerated As ADMM is so closely related to Proximal Gradient-based
methods, one might ask if there exists an accelerated variant with a better
convergence rate. The answer is a resounding yes, as shown by Goldstein et
al., though care must be taken for non-strongly convex
objectives. In their article, Goldstein et al. show that a convergence rate of
Online In online learning, one is interested in solving a series of
supervised machine learning instances in sequence with minimal error. At each
iteration, the algorithm is presented with an input
In this setting, Wang has shown that an online variant to ADMM
can achieve regret competitive with the best possible (
Stochastic In a stochastic setting, one is interested in minimizing the
average value of
Multi Component Traditional ADMM considers an objective with only
2 components
References
ADMM While ADMM has existed for decades, it has only recently been brought to light by Boyd's article describing its applications for statistical machine learning. It is from this work from which I initially learned of ADMM.
Proof of Convergence The proof of convergence presented here is a verbose expansion of that presented in Wang's paper on Online ADMM.
Reference Implementation
Using the optim
Python package, we can generate the animation above,
"""
Example usage of ADMM solver.
A gif is generated showing the iterates as they converge.
"""
from matplotlib import animation
from optim.admm import *
from optim.tests.test_admm import quadratic1
import itertools as it
import numpy as np
import pylab as pl
import sys
if len(sys.argv) != 2:
sys.stderr.write("Usage: %s OUTPUT\n" % (sys.argv[0],))
sys.exit(1)
else:
output = sys.argv[1]
prob, state = quadratic1()
admm = ADMM(rho=0.1)
iterates = list(it.islice(admm.solve(prob, state), 0, 50))
pl.figure()
_ = np.linspace(-2, 5)
xs = np.asarray([s.x for s in iterates])
zs = np.asarray([s.z for s in iterates])
xs2 = [prob.primal(State(x=v,z=v,y=0)) for v in xs]
zs2 = [prob.primal(State(x=v,z=v,y=0)) for v in zs]
def animate(i):
print 'iteration:', i
pl.cla()
pl.title("Iteration #%d" % i)
pl.plot (_, [prob.f(v) + prob.g(v) for v in _], 'k-' , label='f(x)+g(z)')
pl.plot (_, [prob.f(v) for v in _], 'g--', label='f(x)' )
pl.plot (_, [ prob.g(v) for v in _], 'b--', label= 'g(z)')
pl.scatter(xs[i], xs2[i], c='g', label='x')
pl.scatter(zs[i], zs2[i], c='b', label='z')
pl.xlim(min(_), max(_))
pl.legend()
anim = animation.FuncAnimation(pl.gcf(), animate, frames=len(iterates))
anim.save(output, writer='imagemagick', fps=4)