Line Search Methods下载_在线阅读_36

首页 > Line Search Methods

is_382380

暂无简介

Line Search Methods This is pag Printer: O CHAP T E R3 Line Search Methods Each iteration of a line search method computes a search direction pk and then decides how far to move along that direction. The iteration is given by xk+1 � xk + αk pk, (3.1) where the positive scalar αk i...

This is pag Printer: O CHAP T E R3 Line Search Methods Each iteration of a line search method computes a search direction pk and then decides how far to move along that direction. The iteration is given by xk+1 � xk + αk pk, (3.1) where the positive scalar αk is called the step length. The success of a line search method depends on effective choices of both the direction pk and the step length αk . Most line search algorithms require pk to be a descent direction—one for which pTk ∇ fk < 0—because this property guarantees that the function f can be reduced along 3 . 1 . S T E P L E N G T H 31 this direction, as discussed in the previous chapter. Moreover, the search direction often has the form pk � −B−1k ∇ fk, (3.2) where Bk is a symmetric and nonsingular matrix. In the steepest descent method, Bk is simply the identity matrix I , while in Newton’s method, Bk is the exact Hessian ∇2 f (xk). In quasi-Newton methods, Bk is an approximation to the Hessian that is updated at every iteration by means of a low-rank formula. When pk is deﬁned by (3.2) and Bk is positive deﬁnite, we have pTk ∇ fk � −∇ f Tk B−1k ∇ fk < 0, and therefore pk is a descent direction. In this chapter, we discuss how to choose αk and pk to promote convergence from remote starting points. We also study the rate of convergence of steepest descent, quasi- Newton, andNewtonmethods. Since thepureNewton iteration is not guaranteed toproduce descent directionswhen the current iterate is not close to a solution,wediscussmodiﬁcations in Section 3.4 that allow it to start from any initial point. We now give careful consideration to the choice of the step-length parameter αk . 3.1 STEP LENGTH In computing the step length αk , we face a tradeoff. We would like to choose αk to give a substantial reduction of f , but at the same time we do not want to spend too much time making the choice. The ideal choice would be the global minimizer of the univariate function φ(·) deﬁned by φ(α) � f (xk + αpk), α > 0, (3.3) but in general, it is too expensive to identify this value (see Figure 3.1). To ﬁnd even a local minimizer of φ to moderate precision generally requires too many evaluations of the objec- tive function f and possibly the gradient ∇ f . More practical strategies perform an inexact line search to identify a step length that achieves adequate reductions in f at minimal cost. Typical line search algorithms try out a sequence of candidate values for α, stopping to accept one of these valueswhen certain conditions are satisﬁed. The line search is done in two stages: A bracketing phase ﬁnds an interval containing desirable step lengths, and a bisection or interpolation phase computes a good step length within this interval. Sophisticated line search algorithms can be quite complicated, so we defer a full description until Section 3.5. 32 C H A P T E R 3 . L I N E S E A R C H M E T H O D S (φ α) point stationary first minimizer localfirst global minimizer α Figure 3.1 The ideal step length is the global minimizer. We now discuss various termination conditions for line search algorithms and show that effective step lengths need not lie near minimizers of the univariate function φ(α) deﬁned in (3.3). A simple condition we could impose on αk is to require a reduction in f , that is, f (xk + αk pk) < f (xk). That this requirement is not enough to produce convergence to x∗ is illustrated in Figure 3.2, for which the minimum function value is f ∗ � −1, but a sequence of iterates {xk} for which f (xk) � 5/k, k � 0, 1, . . . yields a decrease at each iteration but has a limiting function value of zero. The insufﬁcient reduction in f at each step causes it to fail to converge to the minimizer of this convex function. To avoid this behavior we need to enforce a sufﬁcient decrease condition, a concept we discuss next. 2 x 0 x1x x xf( ) Figure 3.2 Insufﬁcient reduction in f . 3 . 1 . S T E P L E N G T H 33 THE WOLFE CONDITIONS A popular inexact line search condition stipulates that αk should ﬁrst of all give sufﬁcient decrease in the objective function f , as measured by the following inequality: f (xk + αpk) ≤ f (xk) + c1α∇ f Tk pk, (3.4) for some constant c1 ∈ (0, 1). In other words, the reduction in f should be proportional to both the step length αk and the directional derivative ∇ f Tk pk . Inequality (3.4) is sometimes called the Armijo condition. The sufﬁcient decrease condition is illustrated in Figure 3.3. The right-hand-side of (3.4), which is a linear function, can be denoted by l(α). The function l(·) has negative slope c1∇ f Tk pk , but because c1 ∈ (0, 1), it lies above the graph of φ for small positive values of α. The sufﬁcient decrease condition states that α is acceptable only if φ(α) ≤ l(α). The intervals on which this condition is satisﬁed are shown in Figure 3.3. In practice, c1 is chosen to be quite small, say c1 � 10−4. The sufﬁcient decrease condition is not enough by itself to ensure that the algorithm makes reasonable progress because, as we see from Figure 3.3, it is satisﬁed for all sufﬁciently small values of α. To rule out unacceptably short steps we introduce a second requirement, called the curvature condition, which requires αk to satisfy ∇ f (xk + αk pk)T pk ≥ c2∇ f Tk pk, (3.5) for some constant c2 ∈ (c1, 1), where c1 is the constant from (3.4). Note that the left-hand- side is simply the derivative φ′(αk), so the curvature condition ensures that the slope of φ at αk is greater than c2 times the initial slope φ′(0). This makes sense because if the slope φ′(α) αl( ) φ (α f(x +) = kαk p ) acceptableacceptable α Figure 3.3 Sufﬁcient decrease condition. 34 C H A P T E R 3 . L I N E S E A R C H M E T H O D S desired slope k )φ (α) =f(xk+α p tangent α acceptableacceptable Figure 3.4 The curvature condition. is strongly negative, we have an indication that we can reduce f signiﬁcantly by moving further along the chosen direction. On the other hand, if φ′(αk) is only slightly negative or even positive, it is a sign that we cannot expect much more decrease in f in this direction, so it makes sense to terminate the line search. The curvature condition is illustrated in Figure 3.4. Typical values of c2 are 0.9 when the search direction pk is chosen by a Newton or quasi-Newton method, and 0.1 when pk is obtained from a nonlinear conjugate gradient method. The sufﬁcient decrease and curvature conditions are known collectively as the Wolfe conditions. We illustrate them in Figure 3.5 and restate them here for future reference: f (xk + αk pk) ≤ f (xk) + c1αk∇ f Tk pk, (3.6a) ∇ f (xk + αk pk)T pk ≥ c2∇ f Tk pk, (3.6b) with 0 < c1 < c2 < 1. A step length may satisfy the Wolfe conditions without being particularly close to a minimizer of φ, as we show in Figure 3.5. We can, however, modify the curvature condition to force αk to lie in at least a broad neighborhood of a local minimizer or stationary point of φ. The strong Wolfe conditions require αk to satisfy f (xk + αk pk) ≤ f (xk) + c1αk∇ f Tk pk, (3.7a) |∇ f (xk + αk pk)T pk | ≤ c2|∇ f Tk pk |, (3.7b) with 0 < c1 < c2 < 1. The only difference with the Wolfe conditions is that we no longer allow the derivative φ′(αk) to be too positive. Hence, we exclude points that are far from stationary points of φ. 3 . 1 . S T E P L E N G T H 35 slope desired line of sufficient decrease l(α ) acceptable α φ (α ) = αpf(x + kk ) acceptable Figure 3.5 Step lengths satisfying the Wolfe conditions. It is not difﬁcult to prove that there exist step lengths that satisfy the Wolfe conditions for every function f that is smooth and bounded below. Lemma 3.1. Suppose that f : IRn → IR is continuously differentiable. Let pk be a descent direction at xk , and assume that f is bounded below along the ray {xk + αpk |α > 0}. Then if 0 < c1 < c2 < 1, there exist intervals of step lengths satisfying the Wolfe conditions (3.6) and the strong Wolfe conditions (3.7). PROOF. Note that φ(α) � f (xk + αpk) is bounded below for all α > 0. Since 0 < c1 < 1, the line l(α) � f (xk) + αc1∇ f Tk pk is unbounded below and must therefore intersect the graph of φ at least once. Let α′ > 0 be the smallest intersecting value of α, that is, f (xk + α′ pk) � f (xk) + α′c1∇ f Tk pk . (3.8) The sufﬁcient decrease condition (3.6a) clearly holds for all step lengths less than α′. By the mean value theorem (see (A.55)), there exists α′′ ∈ (0, α′) such that f (xk + α′ pk) − f (xk) � α′∇ f (xk + α′′ pk)T pk . (3.9) By combining (3.8) and (3.9), we obtain ∇ f (xk + α′′ pk)T pk � c1∇ f Tk pk > c2∇ f Tk pk, (3.10) since c1 < c2 and ∇ f Tk pk < 0. Therefore, α′′ satisﬁes the Wolfe conditions (3.6), and the inequalities hold strictly in both (3.6a) and (3.6b). Hence, by our smoothness assumption on f , there is an interval around α′′ for which the Wolfe conditions hold. Moreover, since 36 C H A P T E R 3 . L I N E S E A R C H M E T H O D S the term in the left-hand side of (3.10) is negative, the strong Wolfe conditions (3.7) hold in the same interval. � The Wolfe conditions are scale-invariant in a broad sense: Multiplying the objective function by a constant ormaking an afﬁne change of variables does not alter them. They can be used in most line search methods, and are particularly important in the implementation of quasi-Newton methods, as we see in Chapter 6. THE GOLDSTEIN CONDITIONS Like the Wolfe conditions, the Goldstein conditions ensure that the step length α achieves sufﬁcient decrease but is not too short. The Goldstein conditions can also be stated as a pair of inequalities, in the following way: f (xk) + (1 − c)αk∇ f Tk pk ≤ f (xk + αk pk) ≤ f (xk) + cαk∇ f Tk pk, (3.11) with 0 < c < 1/2. The second inequality is the sufﬁcient decrease condition (3.4), whereas the ﬁrst inequality is introduced to control the step length from below; see Figure 3.6 A disadvantage of the Goldstein conditions vis-a`-vis the Wolfe conditions is that the ﬁrst inequality in (3.11) may exclude all minimizers of φ. However, the Goldstein and Wolfe conditions have much in common, and their convergence theories are quite similar. The Goldstein conditions are often used in Newton-type methods but are not well suited for quasi-Newton methods that maintain a positive deﬁnite Hessian approximation. fkTpkcα Tpkkα (1 _ c) f φ ( = f(x k+α pα ) k ) acceptable steplengths α Figure 3.6 The Goldstein conditions. 3 . 2 . C O N V E R G E N C E O F L I N E S E A R C H M E T H O D S 37 SUFFICIENT DECREASE AND BACKTRACKING We havementioned that the sufﬁcient decrease condition (3.6a) alone is not sufﬁcient to ensure that the algorithm makes reasonable progress along the given search direction. However, if the line search algorithm chooses its candidate step lengths appropriately, by using a so-called backtracking approach, we can dispense with the extra condition (3.6b) and use just the sufﬁcient decrease condition to terminate the line search procedure. In its most basic form, backtracking proceeds as follows. Algorithm 3.1 (Backtracking Line Search). Choose α¯ > 0, ρ ∈ (0, 1), c ∈ (0, 1); Set α ← α¯; repeat until f (xk + αpk) ≤ f (xk) + cα∇ f Tk pk α ← ρα; end (repeat) Terminate with αk � α. In this procedure, the initial step length α¯ is chosen to be 1 in Newton and quasi- Newton methods, but can have different values in other algorithms such as steepest descent or conjugate gradient. An acceptable step length will be found after a ﬁnite number of trials, because αk will eventually become small enough that the sufﬁcient decrease condition holds (see Figure 3.3). In practice, the contraction factor ρ is often allowed to vary at each iteration of the line search. For example, it can be chosen by safeguarded interpolation, as we describe later. We need ensure only that at each iteration we have ρ ∈ [ρlo, ρhi], for some ﬁxed constants 0 < ρlo < ρhi < 1. The backtracking approach ensures either that the selected step length αk is some ﬁxed value (the initial choice α¯), or else that it is short enough to satisfy the sufﬁcient decrease condition but not too short. The latter claim holds because the accepted value αk is within a factor ρ of the previous trial value, αk/ρ, which was rejected for violating the sufﬁcient decrease condition, that is, for being too long. This simple andpopular strategy for terminating a line search iswell suited forNewton methods but is less appropriate for quasi-Newton and conjugate gradient methods. 3.2 CONVERGENCE OF LINE SEARCH METHODS To obtain global convergence, we must not only have well chosen step lengths but also well chosen search directions pk . We discuss requirements on the search direction in this section, focusing on one key property: the angle θk between pk and the steepest descent direction −∇ fk , deﬁned by cos θk � −∇ f T k pk ‖∇ fk‖ ‖pk‖ . (3.12) 38 C H A P T E R 3 . L I N E S E A R C H M E T H O D S The following theorem,due toZoutendijk, has far-reaching consequences. It quantiﬁes the effect of properly chosen step lengthsαk , and shows, for example, that the steepest descent method is globally convergent. For other algorithms, it describes how far pk can deviate from the steepest descent direction and still produce a globally convergent iteration. Various line search termination conditions can be used to establish this result, but for concreteness we will consider only the Wolfe conditions (3.6). Though Zoutendijk’s result appears at ﬁrst to be technical and obscure, its power will soon become evident. Theorem 3.2. Consider any iteration of the form (3.1), where pk is a descent direction and αk satisﬁes the Wolfe conditions (3.6). Suppose that f is bounded below in IRn and that f is continuously differentiable in an open set N containing the level set L def� {x : f (x) ≤ f (x0)}, where x0 is the starting point of the iteration. Assume also that the gradient ∇ f is Lipschitz continuous on N , that is, there exists a constant L > 0 such that ‖∇ f (x) − ∇ f (x˜)‖ ≤ L‖x − x˜‖, for all x, x˜ ∈ N . (3.13) Then ∑ k≥0 cos2 θk ‖∇ fk‖2 < ∞. (3.14) PROOF. From (3.6b) and (3.1) we have that (∇ fk+1 − ∇ fk)T pk ≥ (c2 − 1)∇ f Tk pk, while the Lipschitz condition (3.13) implies that (∇ fk+1 − ∇ fk)T pk ≤ αk L‖pk‖2. By combining these two relations, we obtain αk ≥ c2 − 1L ∇ f Tk pk ‖pk‖2 . By substituting this inequality into the ﬁrst Wolfe condition (3.6a), we obtain fk+1 ≤ fk − c1 1 − c2L (∇ f Tk pk)2 ‖pk‖2 . From the deﬁnition (3.12), we can write this relation as fk+1 ≤ fk − c cos2 θk‖∇ fk‖2, 3 . 2 . C O N V E R G E N C E O F L I N E S E A R C H M E T H O D S 39 where c � c1(1 − c2)/L . By summing this expression over all indices less than or equal to k, we obtain fk+1 ≤ f0 − c k∑ j�0 cos2 θ j‖∇ f j‖2. (3.15) Since f is bounded below, we have that f0 − fk+1 is less than some positive constant, for all k. Hence, by taking limits in (3.15), we obtain ∞∑ k�0 cos2 θk‖∇ fk‖2 < ∞, which concludes the proof. � Similar results to this theorem hold when the Goldstein conditions (3.11) or strong Wolfe conditions (3.7) are used in place of the Wolfe conditions. For all these strategies, the step length selection implies inequality (3.14), which we call the Zoutendijk condition. Note that the assumptions of Theorem3.2 are not too restrictive. If the function f were not bounded below, the optimization problem would not be well deﬁned. The smoothness assumption—Lipschitz continuity of the gradient—is implied by many of the smoothness conditions that are used in local convergence theorems (see Chapters 6 and 7) and are often satisﬁed in practice. The Zoutendijk condition (3.14) implies that cos2 θk‖∇ fk‖2 → 0. (3.16) This limit can be used in turn to derive global convergence results for line search algorithms. If our method for choosing the search direction pk in the iteration (3.1) ensures that the angle θk deﬁned by (3.12) is bounded away from 90◦, there is a positive constant δ such that cos θk ≥ δ > 0, for all k. (3.17) It follows immediately from (3.16) that lim k→∞ ‖∇ fk‖ � 0. (3.18) In other words, we can be sure that the gradient norms ‖∇ fk‖ converge to zero, provided that the search directions are never too close to orthogonalitywith the gradient. In particular, the method of steepest descent (for which the search direction pk is parallel to the negative 40 C H A P T E R 3 . L I N E S E A R C H M E T H O D S gradient) produces a gradient sequence that converges to zero, provided that it uses a line search satisfying the Wolfe or Goldstein conditions. We use the term globally convergent to refer to algorithms for which the property (3.18) is satisﬁed, but note that this term is sometimes used in other contexts to mean different things. For line search methods of the general form (3.1), the limit (3.18) is the strongest global convergence result that can be obtained: We cannot guarantee that the method converges to a minimizer, but only that it is attracted by stationary points. Only by making additional requirements on the search direction pk—by introducing negative curvature information from the Hessian ∇2 f (xk), for example—can we strengthen these results to include convergence to a local minimum. See the Notes and References at the end of this chapter for further discussion of this point. Consider now the Newton-like method (3.1), (3.2) and assume that the matrices Bk are positive deﬁnite with a uniformly bounded condition number. That is, there is a constant M such that ‖Bk‖ ‖B−1k ‖ ≤ M, for all k. It is easy to show from the deﬁnition (3.12) that cos θk ≥ 1/M (3.19) (see Exercise 3.5). By combining this bound with (3.16) we ﬁnd that lim k→∞ ‖∇ fk‖ � 0. (3.20) Therefore, we have shown that Newton and quasi-Newton methods are globally convergent if the matrices Bk have a bounded condition number and are positive deﬁnite (which is needed to ensure that pk is a descent direction), and if the step lengths satisfy the Wolfe conditions. For some algorithms, such as conjugate gradient methods, we will be able to prove the limit (3.18), but only the weaker result lim inf k→∞ ‖∇ fk‖ � 0. (3.21) In other words, just a subsequence of the gradient norms ‖∇ fk j ‖ converges to zero, rather than the whole sequence (see Appendix A). This result, too, can be proved by using Zou- tendijk’s condition (3.14), but instead of a constructive proof, we outline a proof by contradiction. Suppose that (3.21) does not hold, so that the gradients remain bounded away from zero, that is, there exists γ > 0 such that ‖∇ fk‖ ≥ γ, for all k sufﬁciently large. (3.22) 3 . 3 . R A T E O F C O N V E R G E N C E 41 Then from (3.16) we conclude that cos θk → 0, (3.23) that is, the entire sequence {cos θk} converges to 0. To establish (3.21), therefore, it is enough to show that a subsequence {cos θk j } is bounded away from zero. We will use this strategy in Chapter 5 to study the convergence of nonlinear conjugate gradient methods. By applying this proof technique, we can prove global convergence in the sense of (3.20) or (3.21) for a general class of algorithms. Consider any algorithm for which (i) every iteration produces a decrease in the objective function, and (ii) every mth iteration is a steepest descent step, with step length chosen to satisfy the Wolfe or Goldstein conditions. Then, since cos θk � 1 for the steepest descent steps, the result (3.21) holds. Of course, we would design the algorithm so that it does something “better" than steepest descent at the other m − 1 iterates. The occasional steepest descent steps may not make much progress, but they at least guarantee overall global convergence. Note that throughout this section we have used only the fact that Zoutendijk’s condi- tion implies the limit (3.16). In later chapterswewillmake use of the bounded sumcondition (3.14), which forces the sequence {cos2 θk‖∇ fk‖2} to converge to zero at a sufﬁciently rapid rate. 3.3 RATE OF CONVERGENCE It wou

本文档为【Line Search Methods】，请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑，图片更改请在作品中右键图片并更换，文字修改请直接点击文字进行修改，也可以新增和删除文档中的内容。

Line Search Methods

热门搜索

历史搜索