This is pag
Printer: O
CHAP T E R3
Line Search
Methods
Each iteration of a line search method computes a search direction pk and then decides how
far to move along that direction. The iteration is given by
xk+1 � xk + αk pk, (3.1)
where the positive scalar αk is called the step length. The success of a line search method
depends on effective choices of both the direction pk and the step length αk .
Most line search algorithms require pk to be a descent direction—one for which
pTk ∇ fk < 0—because this property guarantees that the function f can be reduced along
3 . 1 . S T E P L E N G T H 31
this direction, as discussed in the previous chapter. Moreover, the search direction often has
the form
pk � −B−1k ∇ fk, (3.2)
where Bk is a symmetric and nonsingular matrix. In the steepest descent method, Bk is
simply the identity matrix I , while in Newton’s method, Bk is the exact Hessian ∇2 f (xk).
In quasi-Newton methods, Bk is an approximation to the Hessian that is updated at every
iteration by means of a low-rank formula. When pk is defined by (3.2) and Bk is positive
definite, we have
pTk ∇ fk � −∇ f Tk B−1k ∇ fk < 0,
and therefore pk is a descent direction.
In this chapter, we discuss how to choose αk and pk to promote convergence from
remote starting points. We also study the rate of convergence of steepest descent, quasi-
Newton, andNewtonmethods. Since thepureNewton iteration is not guaranteed toproduce
descent directionswhen the current iterate is not close to a solution,wediscussmodifications
in Section 3.4 that allow it to start from any initial point.
We now give careful consideration to the choice of the step-length parameter αk .
3.1 STEP LENGTH
In computing the step length αk , we face a tradeoff. We would like to choose αk to give
a substantial reduction of f , but at the same time we do not want to spend too much
time making the choice. The ideal choice would be the global minimizer of the univariate
function φ(·) defined by
φ(α) � f (xk + αpk), α > 0, (3.3)
but in general, it is too expensive to identify this value (see Figure 3.1). To find even a local
minimizer of φ to moderate precision generally requires too many evaluations of the objec-
tive function f and possibly the gradient ∇ f . More practical strategies perform an inexact
line search to identify a step length that achieves adequate reductions in f at minimal cost.
Typical line search algorithms try out a sequence of candidate values for α, stopping to
accept one of these valueswhen certain conditions are satisfied. The line search is done in two
stages: A bracketing phase finds an interval containing desirable step lengths, and a bisection
or interpolation phase computes a good step length within this interval. Sophisticated line
search algorithms can be quite complicated, so we defer a full description until Section 3.5.
32 C H A P T E R 3 . L I N E S E A R C H M E T H O D S
(φ α)
point
stationary
first
minimizer
localfirst
global minimizer
α
Figure 3.1 The ideal step length is the global minimizer.
We now discuss various termination conditions for line search algorithms and show
that effective step lengths need not lie near minimizers of the univariate function φ(α)
defined in (3.3).
A simple condition we could impose on αk is to require a reduction in f , that is,
f (xk + αk pk) < f (xk). That this requirement is not enough to produce convergence to
x∗ is illustrated in Figure 3.2, for which the minimum function value is f ∗ � −1, but a
sequence of iterates {xk} for which f (xk) � 5/k, k � 0, 1, . . . yields a decrease at each
iteration but has a limiting function value of zero. The insufficient reduction in f at each
step causes it to fail to converge to the minimizer of this convex function. To avoid this
behavior we need to enforce a sufficient decrease condition, a concept we discuss next.
2
x
0
x1x
x
xf( )
Figure 3.2 Insufficient reduction in f .
3 . 1 . S T E P L E N G T H 33
THE WOLFE CONDITIONS
A popular inexact line search condition stipulates that αk should first of all give
sufficient decrease in the objective function f , as measured by the following inequality:
f (xk + αpk) ≤ f (xk) + c1α∇ f Tk pk, (3.4)
for some constant c1 ∈ (0, 1). In other words, the reduction in f should be proportional to
both the step length αk and the directional derivative ∇ f Tk pk . Inequality (3.4) is sometimes
called the Armijo condition.
The sufficient decrease condition is illustrated in Figure 3.3. The right-hand-side of
(3.4), which is a linear function, can be denoted by l(α). The function l(·) has negative slope
c1∇ f Tk pk , but because c1 ∈ (0, 1), it lies above the graph of φ for small positive values of
α. The sufficient decrease condition states that α is acceptable only if φ(α) ≤ l(α). The
intervals on which this condition is satisfied are shown in Figure 3.3. In practice, c1 is chosen
to be quite small, say c1 � 10−4.
The sufficient decrease condition is not enough by itself to ensure that the algorithm
makes reasonable progress because, as we see from Figure 3.3, it is satisfied for all sufficiently
small values of α. To rule out unacceptably short steps we introduce a second requirement,
called the curvature condition, which requires αk to satisfy
∇ f (xk + αk pk)T pk ≥ c2∇ f Tk pk, (3.5)
for some constant c2 ∈ (c1, 1), where c1 is the constant from (3.4). Note that the left-hand-
side is simply the derivative φ′(αk), so the curvature condition ensures that the slope of φ at
αk is greater than c2 times the initial slope φ′(0). This makes sense because if the slope φ′(α)
αl( )
φ (α f(x +) = kαk p )
acceptableacceptable
α
Figure 3.3 Sufficient decrease condition.
34 C H A P T E R 3 . L I N E S E A R C H M E T H O D S
desired
slope
k )φ (α) =f(xk+α p
tangent
α
acceptableacceptable
Figure 3.4 The curvature condition.
is strongly negative, we have an indication that we can reduce f significantly by moving
further along the chosen direction.
On the other hand, if φ′(αk) is only slightly negative or even positive, it is a sign that
we cannot expect much more decrease in f in this direction, so it makes sense to terminate
the line search. The curvature condition is illustrated in Figure 3.4. Typical values of c2 are
0.9 when the search direction pk is chosen by a Newton or quasi-Newton method, and 0.1
when pk is obtained from a nonlinear conjugate gradient method.
The sufficient decrease and curvature conditions are known collectively as the Wolfe
conditions. We illustrate them in Figure 3.5 and restate them here for future reference:
f (xk + αk pk) ≤ f (xk) + c1αk∇ f Tk pk, (3.6a)
∇ f (xk + αk pk)T pk ≥ c2∇ f Tk pk, (3.6b)
with 0 < c1 < c2 < 1.
A step length may satisfy the Wolfe conditions without being particularly close to a
minimizer of φ, as we show in Figure 3.5. We can, however, modify the curvature condition
to force αk to lie in at least a broad neighborhood of a local minimizer or stationary point
of φ. The strong Wolfe conditions require αk to satisfy
f (xk + αk pk) ≤ f (xk) + c1αk∇ f Tk pk, (3.7a)
|∇ f (xk + αk pk)T pk | ≤ c2|∇ f Tk pk |, (3.7b)
with 0 < c1 < c2 < 1. The only difference with the Wolfe conditions is that we no longer
allow the derivative φ′(αk) to be too positive. Hence, we exclude points that are far from
stationary points of φ.
3 . 1 . S T E P L E N G T H 35
slope
desired
line of sufficient
decrease
l(α )
acceptable
α
φ (α ) = αpf(x + kk )
acceptable
Figure 3.5 Step lengths satisfying the Wolfe conditions.
It is not difficult to prove that there exist step lengths that satisfy the Wolfe conditions
for every function f that is smooth and bounded below.
Lemma 3.1.
Suppose that f : IRn → IR is continuously differentiable. Let pk be a descent direction at
xk , and assume that f is bounded below along the ray {xk + αpk |α > 0}. Then if 0 < c1 <
c2 < 1, there exist intervals of step lengths satisfying the Wolfe conditions (3.6) and the strong
Wolfe conditions (3.7).
PROOF. Note that φ(α) � f (xk + αpk) is bounded below for all α > 0. Since 0 < c1 < 1,
the line l(α) � f (xk) + αc1∇ f Tk pk is unbounded below and must therefore intersect the
graph of φ at least once. Let α′ > 0 be the smallest intersecting value of α, that is,
f (xk + α′ pk) � f (xk) + α′c1∇ f Tk pk . (3.8)
The sufficient decrease condition (3.6a) clearly holds for all step lengths less than α′.
By the mean value theorem (see (A.55)), there exists α′′ ∈ (0, α′) such that
f (xk + α′ pk) − f (xk) � α′∇ f (xk + α′′ pk)T pk . (3.9)
By combining (3.8) and (3.9), we obtain
∇ f (xk + α′′ pk)T pk � c1∇ f Tk pk > c2∇ f Tk pk, (3.10)
since c1 < c2 and ∇ f Tk pk < 0. Therefore, α′′ satisfies the Wolfe conditions (3.6), and the
inequalities hold strictly in both (3.6a) and (3.6b). Hence, by our smoothness assumption
on f , there is an interval around α′′ for which the Wolfe conditions hold. Moreover, since
36 C H A P T E R 3 . L I N E S E A R C H M E T H O D S
the term in the left-hand side of (3.10) is negative, the strong Wolfe conditions (3.7) hold in
the same interval. �
The Wolfe conditions are scale-invariant in a broad sense: Multiplying the objective
function by a constant ormaking an affine change of variables does not alter them. They can
be used in most line search methods, and are particularly important in the implementation
of quasi-Newton methods, as we see in Chapter 6.
THE GOLDSTEIN CONDITIONS
Like the Wolfe conditions, the Goldstein conditions ensure that the step length α
achieves sufficient decrease but is not too short. The Goldstein conditions can also be stated
as a pair of inequalities, in the following way:
f (xk) + (1 − c)αk∇ f Tk pk ≤ f (xk + αk pk) ≤ f (xk) + cαk∇ f Tk pk, (3.11)
with 0 < c < 1/2. The second inequality is the sufficient decrease condition (3.4), whereas
the first inequality is introduced to control the step length from below; see Figure 3.6
A disadvantage of the Goldstein conditions vis-a`-vis the Wolfe conditions is that the
first inequality in (3.11) may exclude all minimizers of φ. However, the Goldstein and Wolfe
conditions have much in common, and their convergence theories are quite similar. The
Goldstein conditions are often used in Newton-type methods but are not well suited for
quasi-Newton methods that maintain a positive definite Hessian approximation.
fkTpkcα
Tpkkα (1
_ c) f
φ ( = f(x k+α pα ) k )
acceptable steplengths
α
Figure 3.6 The Goldstein conditions.
3 . 2 . C O N V E R G E N C E O F L I N E S E A R C H M E T H O D S 37
SUFFICIENT DECREASE AND BACKTRACKING
We havementioned that the sufficient decrease condition (3.6a) alone is not sufficient
to ensure that the algorithm makes reasonable progress along the given search direction.
However, if the line search algorithm chooses its candidate step lengths appropriately, by
using a so-called backtracking approach, we can dispense with the extra condition (3.6b)
and use just the sufficient decrease condition to terminate the line search procedure. In its
most basic form, backtracking proceeds as follows.
Algorithm 3.1 (Backtracking Line Search).
Choose α¯ > 0, ρ ∈ (0, 1), c ∈ (0, 1); Set α ← α¯;
repeat until f (xk + αpk) ≤ f (xk) + cα∇ f Tk pk
α ← ρα;
end (repeat)
Terminate with αk � α.
In this procedure, the initial step length α¯ is chosen to be 1 in Newton and quasi-
Newton methods, but can have different values in other algorithms such as steepest descent
or conjugate gradient. An acceptable step length will be found after a finite number of
trials, because αk will eventually become small enough that the sufficient decrease condition
holds (see Figure 3.3). In practice, the contraction factor ρ is often allowed to vary at each
iteration of the line search. For example, it can be chosen by safeguarded interpolation, as
we describe later. We need ensure only that at each iteration we have ρ ∈ [ρlo, ρhi], for some
fixed constants 0 < ρlo < ρhi < 1.
The backtracking approach ensures either that the selected step length αk is some fixed
value (the initial choice α¯), or else that it is short enough to satisfy the sufficient decrease
condition but not too short. The latter claim holds because the accepted value αk is within
a factor ρ of the previous trial value, αk/ρ, which was rejected for violating the sufficient
decrease condition, that is, for being too long.
This simple andpopular strategy for terminating a line search iswell suited forNewton
methods but is less appropriate for quasi-Newton and conjugate gradient methods.
3.2 CONVERGENCE OF LINE SEARCH METHODS
To obtain global convergence, we must not only have well chosen step lengths but also well
chosen search directions pk . We discuss requirements on the search direction in this section,
focusing on one key property: the angle θk between pk and the steepest descent direction
−∇ fk , defined by
cos θk � −∇ f
T
k pk
‖∇ fk‖ ‖pk‖ . (3.12)
38 C H A P T E R 3 . L I N E S E A R C H M E T H O D S
The following theorem,due toZoutendijk, has far-reaching consequences. It quantifies
the effect of properly chosen step lengthsαk , and shows, for example, that the steepest descent
method is globally convergent. For other algorithms, it describes how far pk can deviate
from the steepest descent direction and still produce a globally convergent iteration. Various
line search termination conditions can be used to establish this result, but for concreteness
we will consider only the Wolfe conditions (3.6). Though Zoutendijk’s result appears at first
to be technical and obscure, its power will soon become evident.
Theorem 3.2.
Consider any iteration of the form (3.1), where pk is a descent direction and αk satisfies
the Wolfe conditions (3.6). Suppose that f is bounded below in IRn and that f is continuously
differentiable in an open set N containing the level set L def� {x : f (x) ≤ f (x0)}, where x0 is
the starting point of the iteration. Assume also that the gradient ∇ f is Lipschitz continuous on
N , that is, there exists a constant L > 0 such that
‖∇ f (x) − ∇ f (x˜)‖ ≤ L‖x − x˜‖, for all x, x˜ ∈ N . (3.13)
Then
∑
k≥0
cos2 θk ‖∇ fk‖2 < ∞. (3.14)
PROOF. From (3.6b) and (3.1) we have that
(∇ fk+1 − ∇ fk)T pk ≥ (c2 − 1)∇ f Tk pk,
while the Lipschitz condition (3.13) implies that
(∇ fk+1 − ∇ fk)T pk ≤ αk L‖pk‖2.
By combining these two relations, we obtain
αk ≥ c2 − 1L
∇ f Tk pk
‖pk‖2 .
By substituting this inequality into the first Wolfe condition (3.6a), we obtain
fk+1 ≤ fk − c1 1 − c2L
(∇ f Tk pk)2
‖pk‖2 .
From the definition (3.12), we can write this relation as
fk+1 ≤ fk − c cos2 θk‖∇ fk‖2,
3 . 2 . C O N V E R G E N C E O F L I N E S E A R C H M E T H O D S 39
where c � c1(1 − c2)/L . By summing this expression over all indices less than or equal to
k, we obtain
fk+1 ≤ f0 − c
k∑
j�0
cos2 θ j‖∇ f j‖2. (3.15)
Since f is bounded below, we have that f0 − fk+1 is less than some positive constant, for all
k. Hence, by taking limits in (3.15), we obtain
∞∑
k�0
cos2 θk‖∇ fk‖2 < ∞,
which concludes the proof. �
Similar results to this theorem hold when the Goldstein conditions (3.11) or strong
Wolfe conditions (3.7) are used in place of the Wolfe conditions. For all these strategies, the
step length selection implies inequality (3.14), which we call the Zoutendijk condition.
Note that the assumptions of Theorem3.2 are not too restrictive. If the function f were
not bounded below, the optimization problem would not be well defined. The smoothness
assumption—Lipschitz continuity of the gradient—is implied by many of the smoothness
conditions that are used in local convergence theorems (see Chapters 6 and 7) and are often
satisfied in practice.
The Zoutendijk condition (3.14) implies that
cos2 θk‖∇ fk‖2 → 0. (3.16)
This limit can be used in turn to derive global convergence results for line search algorithms.
If our method for choosing the search direction pk in the iteration (3.1) ensures that
the angle θk defined by (3.12) is bounded away from 90◦, there is a positive constant δ such
that
cos θk ≥ δ > 0, for all k. (3.17)
It follows immediately from (3.16) that
lim
k→∞
‖∇ fk‖ � 0. (3.18)
In other words, we can be sure that the gradient norms ‖∇ fk‖ converge to zero, provided
that the search directions are never too close to orthogonalitywith the gradient. In particular,
the method of steepest descent (for which the search direction pk is parallel to the negative
40 C H A P T E R 3 . L I N E S E A R C H M E T H O D S
gradient) produces a gradient sequence that converges to zero, provided that it uses a line
search satisfying the Wolfe or Goldstein conditions.
We use the term globally convergent to refer to algorithms for which the property
(3.18) is satisfied, but note that this term is sometimes used in other contexts to mean
different things. For line search methods of the general form (3.1), the limit (3.18) is the
strongest global convergence result that can be obtained: We cannot guarantee that the
method converges to a minimizer, but only that it is attracted by stationary points. Only
by making additional requirements on the search direction pk—by introducing negative
curvature information from the Hessian ∇2 f (xk), for example—can we strengthen these
results to include convergence to a local minimum. See the Notes and References at the end
of this chapter for further discussion of this point.
Consider now the Newton-like method (3.1), (3.2) and assume that the matrices Bk
are positive definite with a uniformly bounded condition number. That is, there is a constant
M such that
‖Bk‖ ‖B−1k ‖ ≤ M, for all k.
It is easy to show from the definition (3.12) that
cos θk ≥ 1/M (3.19)
(see Exercise 3.5). By combining this bound with (3.16) we find that
lim
k→∞
‖∇ fk‖ � 0. (3.20)
Therefore, we have shown that Newton and quasi-Newton methods are globally convergent
if the matrices Bk have a bounded condition number and are positive definite (which is
needed to ensure that pk is a descent direction), and if the step lengths satisfy the Wolfe
conditions.
For some algorithms, such as conjugate gradient methods, we will be able to prove
the limit (3.18), but only the weaker result
lim inf
k→∞
‖∇ fk‖ � 0. (3.21)
In other words, just a subsequence of the gradient norms ‖∇ fk j ‖ converges to zero, rather
than the whole sequence (see Appendix A). This result, too, can be proved by using Zou-
tendijk’s condition (3.14), but instead of a constructive proof, we outline a proof by
contradiction. Suppose that (3.21) does not hold, so that the gradients remain bounded
away from zero, that is, there exists γ > 0 such that
‖∇ fk‖ ≥ γ, for all k sufficiently large. (3.22)
3 . 3 . R A T E O F C O N V E R G E N C E 41
Then from (3.16) we conclude that
cos θk → 0, (3.23)
that is, the entire sequence {cos θk} converges to 0. To establish (3.21), therefore, it is enough
to show that a subsequence {cos θk j } is bounded away from zero. We will use this strategy in
Chapter 5 to study the convergence of nonlinear conjugate gradient methods.
By applying this proof technique, we can prove global convergence in the sense of
(3.20) or (3.21) for a general class of algorithms. Consider any algorithm for which (i) every
iteration produces a decrease in the objective function, and (ii) every mth iteration is a
steepest descent step, with step length chosen to satisfy the Wolfe or Goldstein conditions.
Then, since cos θk � 1 for the steepest descent steps, the result (3.21) holds. Of course, we
would design the algorithm so that it does something “better" than steepest descent at the
other m − 1 iterates. The occasional steepest descent steps may not make much progress,
but they at least guarantee overall global convergence.
Note that throughout this section we have used only the fact that Zoutendijk’s condi-
tion implies the limit (3.16). In later chapterswewillmake use of the bounded sumcondition
(3.14), which forces the sequence {cos2 θk‖∇ fk‖2} to converge to zero at a sufficiently rapid
rate.
3.3 RATE OF CONVERGENCE
It wou