Log_linear下载_在线阅读_9

is_457328

暂无简介

Log_linear Log-Linear Models Noah A. Smith∗ Department of Computer Science / Center for Language and Speech Processing Johns Hopkins University nasmith@cs.jhu.edu December 2004 Abstract This is yet another introduction to log-linear (“maximum entropy”) models for NLP pr...

Log-Linear Models Noah A. Smith∗ Department of Computer Science / Center for Language and Speech Processing Johns Hopkins University nasmith@cs.jhu.edu December 2004 Abstract This is yet another introduction to log-linear (“maximum entropy”) models for NLP prac- titioners, in the spirit of Berger (1996) and Ratnaparkhi (1997b). The derivations here are similar to Berger’s, but more details are filled in and some errors are corrected. I do not address iterative scaling (Darroch and Ratcliff, 1972), but rather give derivations of the gradient and Hessian of the dual objective function (conditional likelihood). Note: This is a draft; please contact the author if you have comments, and do not cite or circulate this document. 1 Log-linear Models Log-linear models1 have become a widely-used tool in NLP classification tasks (Berger et al., 1996; Ratnaparkhi, 1998). Log-linear models assign joint probabilities to observation/label pairs (x, y) ∈ X× Y as follows: Pr ~θ (x, y) = exp ( ~θ · ~f(x, y) ) ∑ x′,y′ exp ( ~θ · ~f(x′, y′) ) (1) where ~θ is a R-valued vector of feature weights and ~f is a function that maps pairs (x, y) to a nonnegative R-valued feature vector. These features can take on any form; in particular, unlike directed, generative models (like HMMs and PCFGs), the features may overlap, predicting parts of the data more than once.2 Each feature has an associated θi, which is called its weight. Maximum likelihood parameter estimation (training) for such a model, with a set of labeled ex- amples, amounts to solving the following optimization problem. Let {(x1, y∗1), (x2, y∗2), ..., (xm, y∗m)} ∗This document is a revised version of portions of the author’s 2004 thesis research proposal, “Discovering gram- matical structure in unannotated text: implicit negative evidence and dynamic feature selection.” 1Such models have many names, including maximum-entropy models, exponential models, and Gibbs models; Markov random fields are structured log-linear models, conditional random fields (Lafferty et al., 2001) are Markov random fields with a specific training criterion. 2The ability to handle arbitrary, overlapping features is an important advantage that log-linear models have over directed generative models (like HMMs and PCFGs). Importantly, for the latter type of models, maximum likelihood estimation is not straightforward when complicated feature dependencies are introduced that cannot be described as a series of generative steps (Abney, 1997). This comparison will be more fully explored in my thesis. 1 be the labeled training set. ~θ∗ = max ~θ m∏ j=1 Pr ~θ (xj , y∗j ) (2) = max ~θ m∑ j=1 log Pr ~θ(xj , y∗j ) (3) (Eq. 1) = max ~θ m∑ j=1 log exp ( ~θ · ~f(xj , y∗j ) ) ∑ x′,y′ exp ( ~θ · ~f(x′, y′) )} Z ( ~θ ) (4) =  m∑ j=1 ~θ · ~f(xj , y∗j ) −m logZ(~θ) (5) This function can be shown to be concave, with a single global maximum.3 Commonly the conditional probability of the label given the observation is computed; for one example (x, y) it is given by: Pr ~θ (y | x) = exp ( ~θ · ~f(x, y) ) ∑ y′ exp ( ~θ · ~f(x, y′) ) (6) Why is this quantity important? A second feature of log-linear models is that they admit relatively efficient conditional parameter estimation. Conditional training is a technique that seeks to give maximum probability (or weight, or score) to the correct label y∗ for a given observation x (known from an annotated corpus), at the expense of competing possible labels. This kind of training can be contrasted with joint maximum likelihood training (Equation 4), which seeks to maximize the total probability of training pairs (xj , y∗j ). Conditional training for log-linear models amounts to maximizing the product of conditional probabilities (as in Equation 6):4 ~θ∗ = max ~θ m∏ j=1 Pr ~θ (y∗j | xj) (7) = max ~θ m∑ j=1 log Pr ~θ (y∗j | xj) (8) =  m∑ j=1 ~θ · ~f(xj , y∗j ) − m∑ j=1 log ∑ y′ exp ( ~θ · ~f(xj , y′) ) ︸︷︷︸ Z(~θ,xj) (9) The argument for conditional training is that it is unnecessary to model x, the observed part of the data, since (even at test time) it is given; we need only be able to determine which y fits x 3In Section 3 I derive the Hessian and demonstrate that the function is everywhere concave. 4A discussion of conditional likelihood as a statistical estimator is given by Johnson et al. (1999). An earlier discussion is Nadas (1983). 2 best, and predicting x is extraneous effort imposed on the model.5 There is also a practical reason for using conditional likelihood Pr~θ(y ∗ | x) rather than Pr~θ(x, y∗). Consider the terms marked in Equations 4 and 9 as Z(~θ) and Z(~θ, xj), respectively. This term is known as the partition function. In ML training (Equation 4), computing it involves summing over the entire space of observation/label pairs predicted by the model, while in conditional training (Equation 9) one must sum only over the labels for each example x. The former is in practice much more computationally expensive, even with approximations. Both optimization problems are unconstrained and convex, making them relatively straightfor- ward to solve (modulo the partition function) using iterative hill-climbing methods. Iterative scaling (Darroch and Ratcliff, 1972) is one approach; Malouf (2002) describes how Newtonian numerical optimization algorithms using the gradient of the function can be used (see Section 3).6 With log-linear models of labeled data, the numerical part of learning is straightforward; the difficult part is feature selection. Using too many features is likely to result in over-fitting of the training data. One approach is to use features that have frequency in training data above some threshold; Della Pietra et al. (1997) use a more sophisticated feature selection algorithm. Another way of avoiding over-fitting is to smooth the model. This is often done using a prior (Khudanpur, 1995; Chen and Rosenfeld, 2000). One recent approach to smoothing by Kazama and Tsujii (2003) (this work extends that of Khudanpur (1995)) uses soft constraints and a penalty function (which can be interpreted as a prior on parameter values); this has the side-effect of many parameters going to zero,7 resulting in automatic feature selection.8 Of course, introducing priors requires introducing more parameters (and perhaps auxiliary learning algorithms to estimate them). This is an active area of research. Log-linear models have been applied to many supervised problems in NLP: • Sentence boundary detection (Reynar and Ratnaparkhi, 1997) • Parsing (Ratnaparkhi et al., 1994b; Ratnaparkhi, 1997a; Abney, 1997; Riezler et al., 2000; Johnson and Riezler, 2000; Johnson, 2001)9 5Conditional training encompasses a number of approaches to parameter estimation and originated in the auto- matic speech recognition community. In particular, maximum mutual information (MMI) estimation (Bahl et al., 1986) can be shown to be equivalent to conditional maximum likelihood estimation. The other approach is to directly minimize classification error (Juang and Katagiri, 1992). 6The learning task for log-linear models is usually described as follows. Given a vector of observed feature values ~˜f , we seek model parameters ~θ that yield these expectations. That is, we wish the following constraint to be met for all feature indices i: ∑ x,y Pr ~θ (y | x)fi(x, y) = f˜i. (10) Now, there may be many solutions to this system of equations. The criterion used to decide among them is maximum entropy — we seek the parameters that are closest to uninformative (uniform), subject to the expectation constraints. This is motivated by an Occam’s razor argument that advocates choosing the simplest possible model that explains the data; entropy can be roughly regarded as a measure of simplicity of the model. It turns out that the dual of this constrained optimization problem has the form of the maximum likelihood problems given in Equations 4 and 9, depending on whether the entropy to be maximized is joint entropy H(Pr~θ(X,Y )) or the total entropy of the conditional distributions for each xj , ∑m j=1H(Pr~θ(Y | xj)) (Berger et al., 1996). See Section 2 for a derivation of the dual in the latter case. 7A zero-weighted feature will have no effect on the probability of examples where the feature is present; see Equation 1. 8To my knowledge, this technique has not been carefully tested as a feature selection technique per se, though M. Osborne (personal communication) reports that its performance as such is reasonable in comparison with other feature selection algorithms. 9Charniak (2000) used log-linear-like features to build a state-of-the-art parser, but without training via the 3 • Priors over grammar rules (Eisner, 2001) • Language modeling for automatic speech recognition (Rosenfeld, 1994; Khudanpur and Wu, 2000) • Prepositional phrase attachment (Ratnaparkhi et al., 1994a) • Part-of-speech tagging (Ratnaparkhi, 1996) • Named entity recognition (Borthwick et al., 1998) • Machine translation (Berger et al., 1996; Och and Ney, 2002) 2 Maximum Conditional Likelihood as the Dual of Maximum En- tropy Berger et al. (1996) did not explicitly derive the relationship between the maximum entropy es- timation problem and the maximum likelihood problem; they gave only a sketch. Here I give a derivation that shows how maximum conditional likelihood estimation is the dual of the maximum entropy problem. Notation is shown in Table 1. The maximum entropy problem we seek to solve is: maximize p ∑ x p˜(x)H(p(Y | x)) (11) subject to ∑ (x,y) p˜(x, y)fj(x, y) = ∑ x p˜(x) ∑ y p(y | x)fj(x, y),∀j (12)∑ y p(y | x) = 1,∀x (13) p(y | x) ≥ 0,∀x, y (14) In short, we seek to maximize the empirically-averaged entropy over labels given each example x, subject to the constraint that, on average the feature expectations within each neighborhood match the feature expectations of the corresponding observations. maximum-entropy criterion. X the space of unlabeled examples Y the space of labels Y a random variable over labels p˜ the observed (empirical) distribution over X× Y (we also use p˜ to denote marginals over X) p the desired distribution over X× Y that we would like to estimate; here, p takes the form of a conditional distribution, p(y | x) fj the jth feature function, X× Y→ R+ Table 1: Notation. 4 We define the conditional entropy as H(Y | x) = − ∑ y p(y | x) log p(y | x) = Ep(Y |x)[− log p(Y | x)] (15) Let θj be a Lagrangian multiplier for constraint j. We can write the above as an unconstrained problem, where θj is the cost of violating the jth constraint in line 12. Let µx be the constraint corresponding to the constraint for x in line 13. Note that we have ignored the nonnegativity constraint in line 14; that will ultimately fall out of the derivation. min ~µ,~θ max p ∑ x p˜(x)H(p(Y | x))− ∑ j θj ∑ (x,y) p˜(x, y)fj(x, y)− ∑ x p˜(x) ∑ y p(y | x)fj(x, y)  − ∑ x µx (∑ y p(y | x)− 1 ) (16) Next we hold the Lagrangian multipliers θj and µx constant and maximize Equation 16 with respect to p(y | x). This will give a closed form solution for p(y | x) in terms of the Lagrangian multipliers.10 To solve Equation 16 with respect to p(y | x), we take the derivative with respect to p(y | x). (Note that there are many x and many y; the form of the derivative with respect to each will be the same, with different bindings for x, y.): ∂ ∂p(y | x) = −p˜(x) (1 + log p(y | x)) + ∑ j θj p˜(x)fj(x, y)− µx (17) Setting this equal to 0 and solving for p(y | x), we have: −p˜(x) (1 + log p(y | x)) + ∑ j θj p˜(x)fj(x, y)− µx = 0 p˜(x) (1 + log p(y | x)) = ∑ j θj p˜(x)fj(x, y)− µx 1 + log p(y | x) = ∑ j θj p˜(x)fj(x, y)− µx p˜(x) log p(y | x) = ~θ · ~f(x, y) + µx p˜(x) − 1 p(y | x) = exp ( ~θ · ~f(x, y) ) exp ( 1− µxp˜(x) ) The remarkable thing at this point is to notice that the optimal p(y | x) will take on an expo- nential form. Note that the numerator is exactly the numerator in Equation 6. The denominator is 10The theory underlying these moves is hard to give at the right level of detail. Like most other NLP-oriented expositions of log-linear models, this one will avoid that theory. The most rigorous treatment of this area is Bertsekas (1999). The key point from the theory is that a maximum in Equation 11 will always correspond to a maximum in Equation 16 (an unconstrained problem with more variables). For this problem, we can move the optimization burden onto the Lagrange multipliers and forget about p(y | x), because the latter have a closed form in terms of the former. 5 a function only of µx; to enforce that the sum-to-one constraints (Equation 13) are met, we simply let, for all x, ∑ y′ exp ( ~θ · ~f(x, y′) ) = exp ( 1− µx p˜(x) ) (18) (Note that the constraints in line 14 must be satisfied, because the range of the exponential function is nonnegative.) We see now that the solution to the maximum entropy problem will always be a log-linear model. This is the source of the name, “maximum entropy model,” though I maintain that the name is not always appropriate because there are alternative training objectives for these models. Now we substitute this form back into Equation 16. We remove the µx terms, since we have guaranteed by the form of the model that the sum-to-one constraints (Equation 13) are satisfied: ∑ x p˜(x)H(p(Y | x))− ∑ j θj ∑ (x,y) p˜(x, y)fj(z)− ∑ x p˜(x) ∑ y p(y | x)fj(x, y)  = − ∑ x p˜(x) ∑ y p(y | x) log p(y | x)− ∑ j θj ∑ (x,y) p˜(x, y)fj(x, y)− ∑ x p˜(x) ∑ y p(y | x)fj(x, y)  = − ∑ x p˜(x) ∑ y p(y | x) log p(y | x)− ∑ (x,y) p˜(x, y) ∑ j θjfj(x, y) + ∑ x p˜(x) ∑ y p(y | x) ∑ j θjfj(x, y) = − ∑ x p˜(x) ∑ y p(y | x) log p(y | x)− ∑ (x,y) p˜(x, y) ( ~θ · ~f(x, y) ) + ∑ x p˜(x) ∑ y p(y | x) ( ~θ · ~f(x, y) ) = − ∑ x p˜(x) ∑ y p(y | x)  ( ~θ · ~f(x, y) ) − log ∑ y′ exp ( ~θ · ~f(x, y′) ) ︸︷︷︸ Z(~θ,x)  − ∑ (x,y) p˜(x, y) ( ~θ · ~f(x, y) ) + ∑ x p˜(x) ∑ y p(y | x) ( ~θ · ~f(x, y) ) = − ∑ x p˜(x) ∑ y p(y | x) ( ~θ · ~f(x, y) ) + ∑ x p˜(x) ∑ y p(y | x) logZ ( ~θ, x ) − ∑ (x,y) p˜(x, y) ( ~θ · ~f(x, y) ) + ∑ x p˜(x) ∑ y p(y | x) ( ~θ · ~f(x, y) ) The first and last terms cancel. = ∑ x p˜(x) ∑ y p(y | x) logZ ( ~θ, x ) − ∑ (x,y) p˜(x, y) ( ~θ · ~f(x, y) ) = ∑ x p˜(x) logZ ( ~θ, x ) − ∑ (x,y) p˜(x, y) ( ~θ · ~f(x, y) ) = − ∑ (x,y) p˜(x, y) log exp ( ~θ · ~f(x, y) ) Z ( ~θ, x ) 6 This is the cross-entropy, and we seek to minimize it. Cross-entropy is equivalent to the negative likelihood (Equation 9) times a constant factor (m, the size of the data). 3 A Bit About Training I noted previously that Newtonian methods can be used to carry out maximum conditional likeli- hood training of log-linear models. First-order Newtonian methods operate by iteratively improving parameter values. Given parameter values ~θ(i), the objective L ( ~θ(i) ) and the gradient ∇~θ(i)L are computed. The optimization proceeds by letting ~θ(i+1) = ~θ(i)+α∇~θ(i)L, i.e., moving in the direction of the gradient. (α, the step size, is determined using an auxiliary procedure that looks for the step size that will increase L as much as possible.) Modern Newtonian methods may use an additional term in the update, taking advantage of exact or approximate second-derivative information. Since log-linear models usually have many parameters, it is not typically tractable to compute the second derivative exactly (though we will see that it has a very simple form). Here I derive the gradient and show that the function is everywhere concave. We start with the objective to be maximized (Equation 9, repeated below): L ( ~θ ) =  m∑ j=1 ~θ · ~f(xj , y∗j ) − m∑ j=1 log ∑ y′ exp ( ~θ · ~f(xj , y′) ) (19) =  m∑ j=1 ∑ k θkfk(xj , y∗j ) − m∑ j=1 log ∑ y′ exp (∑ k θkfk(xj , y′) ) (20) Taking the derivative with respect to θ`, we have: ∂L ∂θ` =  m∑ j=1 f`(xj , y∗j ) − m∑ j=1 ∑ y′ (exp ∑ k θkfk(xj , y ′)) f`(xj , y′)∑ y′ exp ∑ k θkfk(xj , y′) (21) = m∑ j=1 ( f`(xj , y∗j )− ∑ y′ (exp ∑ k θkfk(xj , y ′)) f`(xj , y′)∑ y′ exp ∑ k θkfk(xj , y′) ) (22) = m∑ j=1 f`(xj , y∗j )−∑ y′ Pr ~θ ( y′ | xj ) f`(xj , y′)  (23) = m∑ j=1 ( f`(xj , y∗j )−EPr~θ(Y |xj)[f`(xj , Y )] ) (24) Therefore, ∇~θL = m∑ j=1 ~f(xj , y∗j )−EPr~θ(Y |xj)[~f(xj , Y )] (25) An important note is that, when the constraints in line 12 are met, then this gradient will be all zeroes. The gradient is no more difficult to compute than the feature expectations; the time required to do this is O(mn|Y|) where n is the length of ~θ. In some cases, such as conditional random fields Lafferty et al. (2001), Y is a set whose cardinality grows exponentially in the length of the input x. (In Lafferty et al.’s CRFs, x is a sequence.) Often a dynamic programming algorithm can be applied in those cases to efficiently compute the expectations. 7 Second derivative. We can now take the derivative of an element of the gradient with respect to θp. We begin with Equation 22 for convenience: ∂2L ∂θ`∂θp = − m∑ j=1 ∑ y′ e ~θ·~f(xj ,y′) ∑ y′ f`(xj , y′)e ~θ·~f(xj ,y′)fp(xj , y′)  − ∑ y′ e ~θ·~f(xj ,y′)f`(xj , y′) ∑ y′ e ~θ·~f(xj ,y′)fp(xj , y′)  ∑ y′ e ~θ·~f(xj ,y′) 2 (26) = − m∑ j=1 ( EPr~θ(Y |xj)[f`(xj , Y )fp(xj , Y )]−EPr~θ(Y |xj)[f`(xj , Y )]EPr~θ(Y |xj)[fp(xj , Y )] ) = − m∑ j=1 covPr~θ(Y |xj) (f`(xj , Y ), fp(xj , Y )) (27) where cov is the covariance. We see therefore that the Hessian matrix is the negated covariance matrix with the feature values under the model as the random variables. Covariance matrices are always positive definite or positive semi-definite; this means that the function we are maximizing is concave everywhere. References S. P. Abney. Stochastic attribute-value grammars. Computational Linguistics, 23(4):597–617, 1997. L. R. Bahl, P. F. Brown, P. V. de Souza, and R. L. Mercer. Maximummutual information estimation of hidden Markov model parameters for speech recognition. In Proc. of ICASSP, 1986. A. Berger. A brief maxent tutorial, 1996. A. Berger, S. Della Pietra, and V. Della Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 22(1), 1996. D. P. Bertsekas. Nonlinear Programming. Athena Scientific, second edition, 1999. A. Borthwick, J. Sterling, E. Agichtein, and R. Grishman. Exploiting diverse knowledge sources via maximum entropy in named entity recognition. In Proc. of VLC, 1998. E. Charniak. A maximum-entropy-inspired parser. In Proc. of NAACL, 2000. S. Chen and R. Rosenfeld. A survey of smoothing techniques for ME models. IEEE Transactions on Speech and Audio Processing, 8(1):37–50, 2000. J. N. Darroch and D. Ratcliff. Generalized iterative scaling for log-linear models. Annals of Mathematical Statistics, 43:1470–80, 1972. S. Della Pietra, V. J. Della Pietra, and J. D. Lafferty. Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(4):380–393, 1997. J. Eisner. Smoothing a Probabilistic Lexicon via Syntactic Transformations. PhD thesis, University of Pennsylvania, 2001. M. Johnson. Joint and conditional estimation of tagging and parsing models. In Proc. of ACL, 2001. M. Johnson, S. Geman, S. Canon, Z. Chi, and S. Riezler. Estimators for stochastic “unification- based” grammars. In Proc. of ACL, 1999. 8 M. Johnson and S. Riezler. Exploiting auxiliary distributions in stochastic unification-based gram- mars. In Proc. of NAACL, 2000. B. H. Juang and S. Katagiri. Discriminative learning for minimum error classification. IEEE Transactions on Acoustics, Speech, and Signal Processing, 40(12):3043–54, 1992. J. Kazama and J. Tsujii. Evaluation and extension of maximum entropy models with inequality constraints. In Proc. of EMNLP, 2003. S. Khudanpur. A method of ME estimation with relaxed constraints. In Proc. of JHU Language Modeling Workshop, 1995. S. Khudanpur and J. Wu

本文档为【Log_linear】，请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑，图片更改请在作品中右键图片并更换，文字修改请直接点击文字进行修改，也可以新增和删除文档中的内容。

Log_linear

热门搜索

历史搜索