In the past 50+ years of convex optimization research, a great many algorithms have been developed, each with slight nuances to their assumptions, implementations, and guarantees. In this article, I'll give a shorthand comparison of these methods in terms of the number of iterations required to reach a desired accuracy \(\epsilon\) for convex and strongly convex objective functions.
Below, methods are grouped according to what "order" of information they require about the objective function. In general, the more information you have, the faster you can converge; but beware, you will also need more memory and computation. Zeroth and first order methods are typically appropriate for large scale problems, whereas second order methods are limited to small-to-medium scale problems that require a high degree of precision.
At the bottom, you will find algorithms aimed specifically at minimizing supervised learning problems and other meta-algorithms useful for distributing computation across multiple nodes.
Unless otherwise stated, all objectives are assumed to be Lipschitz continuous (though not necssarily differentiable) and the domain convex. The variable being optimized is \(x \in \mathbb{R}^n\).
Zeroth Order Methods
Zeroth order methods are characterized by not requiring any gradients or subgradients for their objective functions. In exchange, however, it is assumed that the objective is "simple" in the sense that a subset of variables (a "block") can be minimized exactly while holding all other variables fixed.
Algorithm | Problem Formulation | Convex | Strongly Convex | Per-Iteration Cost | Notes |
---|---|---|---|---|---|
Randomized Block Coordinate Descent | $\displaystyle \min_{x \in \mathbb{R}^{n}} f(x) + g(x)$ | $O(1 / \epsilon)$[^richtarik-2011] | $O(\log (1 / \epsilon))$[^richtarik-2011] | $O(1)$ | Applicable when $f(x)$ is differentiable and $g(x)$ is separable in each block. $g(x)$ may be a barrier function. |
First Order Methods
First order methods typically require access to an objective function's gradient or subgradient. The algorithms typically take the form \(x^{(t+1)} = x^{(t)} - \alpha^{(t)} g^{(t)}\) for some step size \(\alpha^{(t)}\) and descent direction \(g^{(t)}\). As such, each iteration takes approximately \(O(n)\) time.
Algorithm | Problem Formulation | Convex | Strongly Convex | Per-Iteration Cost | Notes |
---|---|---|---|---|---|
Subgradient Descent | $\displaystyle \min_{x \in \mathbb{R}^n} f(x)$ | $O(1 / \epsilon^{2})$[^blog-sd] | ... | $O(n)$ | Cannot be improved upon without further assumptions. |
Mirror Descent | $\displaystyle \min_{x \in \mathcal{C}} f(x)$ | $O(1 / \epsilon^{2} )$[^ee381-md] | $O(1 / \epsilon )$[^nedich-2013] | $O(n)$ | Different parameterizations result in gradient descent and exponentiated gradient descent. |
Dual Averaging | $\displaystyle \min_{x \in \mathcal{C}} f(x)$ | $O(1 / \epsilon^{2})$[^nesterov-2007] | ... | $O(n)$ | Cannot be improved upon without further assumptions. |
Gradient Descent | $\displaystyle \min_{x \in \mathbb{R}^n} f(x)$ | $O(1 / \epsilon)$[^blog-gd] | $O(\log (1 / \epsilon))$[^ee381-gd] | $O(n)$ | Applicable when $f(x)$ is differentiable. |
Accelerated Gradient Descent | $\displaystyle \min_{x \in \mathbb{R}^n} f(x)$ | $O(1 / \sqrt{\epsilon})$[^blog-agd] | $O(\log (1 / \epsilon))$[^bubeck-agd] | $O(n)$ | Applicable when $f(x)$ is differentiable. Cannot be improved upon without further assumptions. Has better constants than Gradient Descent for "Strongly Convex" case. |
Proximal Gradient Descent | $\displaystyle \min_{x \in \mathcal{C}} f(x) + g(x)$ | $O(1 / \epsilon)$[^blog-pgd] | $O(\log (1 / \epsilon))$[^mairal-2013] | $O(n)$ | Applicable when $f(x)$ is differentiable and $\text{prox}_{\tau_t g}(x)$ is easily computable. |
Proximal Accelerated Gradient Descent | $\displaystyle \min_{x \in \mathcal{C}} f(x) + g(x)$ | $O(1 / \sqrt{\epsilon})$[^blog-apgd] | $O(\log (1 / \epsilon))$[^mairal-2013] | $O(n)$ | Applicable when $f(x)$ is differentiable and $\text{prox}_{\tau_t g}(x)$ is easily computable. Has better constants than Proximal Gradient Descent for "Strongly Convex" case. |
Frank-Wolfe Algorithm / Conditional Gradient Algorithm | $\displaystyle \min_{x \in \mathcal{C}} f(x)$ | $O(1/\epsilon)$[^blog-fw] | $O(1/\sqrt{\epsilon})$[^garber-2014] | $O(n)$ | Applicable when $\mathcal{C}$ is bounded and $h_{g}(x) = \arg\min_{x \in \mathcal{C}} \langle g, x \rangle$ is easily computable. Most useful when $\mathcal{C}$ is a polytope in a very high dimensional space with sparse extrema. |
Second Order Methods
Second order methods either use or approximate the hessian (\(\nabla^2 f(x)\)) of the objective function to result in better-than-linear rates of convergence. As such, each iteration typically requires \(O(n^2)\) memory and between \(O(n^2)\) and \(O(n^3)\) computation per iteration.
Algorithm | Problem Formulation | Convex | Strongly Convex | Per-Iteration Cost | Notes |
---|---|---|---|---|---|
Newton's Method | $\displaystyle \min_{x \in \mathbb{R}^n} f(x)$ | ... | $O(\log \log (1/\epsilon))$[^ee364a-unconstrained] | $O(n^3)$ | Only applicable when $f(x)$ is twice differentiable. Constraints can be incorporated via interior point methods. |
Conjugate Gradient Descent | $\displaystyle \min_{x \in \mathbb{R}^n} f(x)$ | ... | $O(n)$ | $O(n^2)$ | Converges in exactly $n$ steps for quadratic $f(x)$. May fail to converge for non-quadratic objectives. |
L-BFGS | $\displaystyle \min_{x \in \mathbb{R}^n} f(x)$ | ... | Between $O(\log (1/\epsilon))$ and $O(\log \log (1/\epsilon))$[^ee236c-qnewton] | $O(n^2)$ | Applicable when $f(x)$ is differentiable, but works best when twice differentiable. Convergence rate is not guaranteed. |
Stochastic Methods
The following algorithms are specifically designed for supervised machine learning where the objective can be decomposed into independent "loss" functions and a regularizer,
The intuition is that finding the optimal solution to this problem is unnecessary as the goal is to minimize the "risk" (read: error) with respect to a set of samples from the true distribution of potential loss functions. Thus, the following algorithms' convergence rates are for the expected rate of convergence (as opposed to the above algorithms which upper bound the true rate of convergence).
Algorithm | Problem Formulation | Convex | Strongly Convex | Per-Iteration Cost | Notes |
---|---|---|---|---|---|
Stochastic Gradient Descent (SGD) | $\displaystyle \min_{x \in \mathbb{R}^n} \sum_{i} f_{i}(x) + \lambda g(x)$ | $O(n/\epsilon^2)$[^bach-2012] | $O(n/\epsilon)$[^bach-2012] | $O(n)$ | Assumes objective is differentiable. |
Stochastic Dual Coordinate Ascent (SDCA) | $\displaystyle \min_{x \in \mathbb{R}^n} \sum_{i} f_{i}(x) + \frac{\lambda}{2} \norm{x}_2^2$ | $O(\frac{1}{\lambda \epsilon})$[^shalevshwartz-2012] | $O(( \frac{1}{\lambda} ) \log ( \frac{1}{\lambda \epsilon} ))$[^shalevshwartz-2012] | $O(n)$ | |
Accelerated Proximal Stochastic Dual Coordinate Ascent (APSDCA) | $\displaystyle \min_{x \in \mathbb{C}} \sum_{i} f_{i}(x) + \lambda g(x)$ | $O(\min (\frac{1}{\lambda \epsilon}, \sqrt{\frac{N}{\lambda \epsilon}} ))$[^shalevshwartz-2013] | $O(\min (\frac{1}{\lambda}, \sqrt{\frac{N}{\lambda}}) \log ( \frac{1}{\epsilon} ))$[^shalevshwartz-2013] | $O(n)$ | |
Stochastic Average Gradient (SAG) | $\displaystyle \min_{x \in \mathbb{R}^n} \sum_{i} f_{i}(x) + \lambda g(x)$ | $O(1 / \epsilon)$[^schmidt-2013] | $O(\log (1/\epsilon))$[^schmidt-2013] | $O(n)$ | Applicable when $f_{i}(x)$ is differentiable. |
Stochastic Variance Reduced Gradient (SVRG) | $\displaystyle \min_{x \in \mathbb{R}^n} \sum_{i} f_{i}(x) + \lambda g(x)$ | $O(1 / \epsilon)$[^johnson-2013] | $O(\log (1/\epsilon))$[^johnson-2013] | $O(n)$ | Applicable when $f_{i}(x)$ is differentiable. |
MISO | $\displaystyle \min_{x \in \mathbb{R}^n} \sum_{i} f_{i}(x) + \lambda g(x)$ | $O(1 / \epsilon)$[^mairal-2013] | $O(\log (1/\epsilon))$[^mairal-2013] | $O(n)$ | Applicable when $f_{i}(x)$ is differentiable. $g(x)$ may be used as a barrier function. |
Other Methods
The following methods do not fit well into any of the preceding categories. Included are meta-algorithms like ADMM, which are good for distributing computation across machines, and methods whose per-iteration complexity depends on iteration count \(t\).
Algorithm | Problem Formulation | Convex | Strongly Convex | Per-Iteration Cost | Notes |
---|---|---|---|---|---|
Alternating Direction Method of Multipliers (ADMM) | $$ \begin{align*} \min_{x,z} \quad & f(x) + g(z) \\ \text{s.t.} \quad & Ax + Bz = c \end{align*} $$ | $O(1/\epsilon)$[^blog-admm] | $O(\log (1/\epsilon))$[^hong-2012] | $O(n)$ | The stated convergence rate for "Strongly Convex" only requires $f(x)$ to be strongly convex, not $g(x)$. This same rate can also be applied to the "Convex" case under several non-standard assumptions[^hong-2012]. Matrices $A$ and $B$ may also need to be full column rank[^deng-2012] . |
Bundle Method | $\displaystyle \min_{x \in \mathcal{C}} f(x)$ | $O(1/\epsilon)$[^smola-2007] | $O(\log (1 / \epsilon))$[^smola-2007] | $O(tn)$ | |
Center of Gravity Algorithm | $\displaystyle \min_{x \in \mathcal{C}} f(x)$ | $O(\log (1 / \epsilon))$[^ee236c-localization] | $O(\log (1 / \epsilon))$[^ee236c-localization] | At least $O(tn)$ | Applicable when $\mathcal{C}$ is bounded. Each iteration requires finding a near-central point in a convex set; this may be computationally expensive. |