Log-Linear Models
Noah A. Smith∗
Department of Computer Science /
Center for Language and Speech Processing
Johns Hopkins University
nasmith@cs.jhu.edu
December 2004
Abstract
This is yet another introduction to log-linear (“maximum entropy”) models for NLP prac-
titioners, in the spirit of Berger (1996) and Ratnaparkhi (1997b). The derivations here are
similar to Berger’s, but more details are filled in and some errors are corrected. I do not address
iterative scaling (Darroch and Ratcliff, 1972), but rather give derivations of the gradient and
Hessian of the dual objective function (conditional likelihood).
Note: This is a draft; please contact the author if you have comments, and do not cite or
circulate this document.
1 Log-linear Models
Log-linear models1 have become a widely-used tool in NLP classification tasks (Berger et al.,
1996; Ratnaparkhi, 1998). Log-linear models assign joint probabilities to observation/label pairs
(x, y) ∈ X× Y as follows:
Pr
~θ
(x, y) =
exp
(
~θ · ~f(x, y)
)
∑
x′,y′ exp
(
~θ · ~f(x′, y′)
) (1)
where ~θ is a R-valued vector of feature weights and ~f is a function that maps pairs (x, y) to a
nonnegative R-valued feature vector. These features can take on any form; in particular, unlike
directed, generative models (like HMMs and PCFGs), the features may overlap, predicting parts
of the data more than once.2 Each feature has an associated θi, which is called its weight.
Maximum likelihood parameter estimation (training) for such a model, with a set of labeled ex-
amples, amounts to solving the following optimization problem. Let {(x1, y∗1), (x2, y∗2), ..., (xm, y∗m)}
∗This document is a revised version of portions of the author’s 2004 thesis research proposal, “Discovering gram-
matical structure in unannotated text: implicit negative evidence and dynamic feature selection.”
1Such models have many names, including maximum-entropy models, exponential models, and Gibbs models;
Markov random fields are structured log-linear models, conditional random fields (Lafferty et al., 2001) are Markov
random fields with a specific training criterion.
2The ability to handle arbitrary, overlapping features is an important advantage that log-linear models have over
directed generative models (like HMMs and PCFGs). Importantly, for the latter type of models, maximum likelihood
estimation is not straightforward when complicated feature dependencies are introduced that cannot be described as
a series of generative steps (Abney, 1997). This comparison will be more fully explored in my thesis.
1
be the labeled training set.
~θ∗ = max
~θ
m∏
j=1
Pr
~θ
(xj , y∗j ) (2)
= max
~θ
m∑
j=1
log Pr ~θ(xj , y∗j ) (3)
(Eq. 1)
= max
~θ
m∑
j=1
log
exp
(
~θ · ~f(xj , y∗j )
)
∑
x′,y′ exp
(
~θ · ~f(x′, y′)
)}
Z
(
~θ
) (4)
=
m∑
j=1
~θ · ~f(xj , y∗j )
−m logZ(~θ) (5)
This function can be shown to be concave, with a single global maximum.3
Commonly the conditional probability of the label given the observation is computed; for one
example (x, y) it is given by:
Pr
~θ
(y | x) =
exp
(
~θ · ~f(x, y)
)
∑
y′ exp
(
~θ · ~f(x, y′)
) (6)
Why is this quantity important? A second feature of log-linear models is that they admit relatively
efficient conditional parameter estimation. Conditional training is a technique that seeks to give
maximum probability (or weight, or score) to the correct label y∗ for a given observation x (known
from an annotated corpus), at the expense of competing possible labels. This kind of training can
be contrasted with joint maximum likelihood training (Equation 4), which seeks to maximize the
total probability of training pairs (xj , y∗j ).
Conditional training for log-linear models amounts to maximizing the product of conditional
probabilities (as in Equation 6):4
~θ∗ = max
~θ
m∏
j=1
Pr
~θ
(y∗j | xj) (7)
= max
~θ
m∑
j=1
log Pr
~θ
(y∗j | xj) (8)
=
m∑
j=1
~θ · ~f(xj , y∗j )
− m∑
j=1
log
∑
y′
exp
(
~θ · ~f(xj , y′)
)
︸ ︷︷ ︸
Z(~θ,xj)
(9)
The argument for conditional training is that it is unnecessary to model x, the observed part
of the data, since (even at test time) it is given; we need only be able to determine which y fits x
3In Section 3 I derive the Hessian and demonstrate that the function is everywhere concave.
4A discussion of conditional likelihood as a statistical estimator is given by Johnson et al. (1999). An earlier
discussion is Nadas (1983).
2
best, and predicting x is extraneous effort imposed on the model.5
There is also a practical reason for using conditional likelihood Pr~θ(y
∗ | x) rather than Pr~θ(x, y∗).
Consider the terms marked in Equations 4 and 9 as Z(~θ) and Z(~θ, xj), respectively. This term is
known as the partition function. In ML training (Equation 4), computing it involves summing over
the entire space of observation/label pairs predicted by the model, while in conditional training
(Equation 9) one must sum only over the labels for each example x. The former is in practice much
more computationally expensive, even with approximations.
Both optimization problems are unconstrained and convex, making them relatively straightfor-
ward to solve (modulo the partition function) using iterative hill-climbing methods. Iterative scaling
(Darroch and Ratcliff, 1972) is one approach; Malouf (2002) describes how Newtonian numerical
optimization algorithms using the gradient of the function can be used (see Section 3).6
With log-linear models of labeled data, the numerical part of learning is straightforward; the
difficult part is feature selection. Using too many features is likely to result in over-fitting of the
training data. One approach is to use features that have frequency in training data above some
threshold; Della Pietra et al. (1997) use a more sophisticated feature selection algorithm. Another
way of avoiding over-fitting is to smooth the model. This is often done using a prior (Khudanpur,
1995; Chen and Rosenfeld, 2000). One recent approach to smoothing by Kazama and Tsujii (2003)
(this work extends that of Khudanpur (1995)) uses soft constraints and a penalty function (which
can be interpreted as a prior on parameter values); this has the side-effect of many parameters
going to zero,7 resulting in automatic feature selection.8 Of course, introducing priors requires
introducing more parameters (and perhaps auxiliary learning algorithms to estimate them). This
is an active area of research.
Log-linear models have been applied to many supervised problems in NLP:
• Sentence boundary detection (Reynar and Ratnaparkhi, 1997)
• Parsing (Ratnaparkhi et al., 1994b; Ratnaparkhi, 1997a; Abney, 1997; Riezler et al., 2000;
Johnson and Riezler, 2000; Johnson, 2001)9
5Conditional training encompasses a number of approaches to parameter estimation and originated in the auto-
matic speech recognition community. In particular, maximum mutual information (MMI) estimation (Bahl et al.,
1986) can be shown to be equivalent to conditional maximum likelihood estimation. The other approach is to directly
minimize classification error (Juang and Katagiri, 1992).
6The learning task for log-linear models is usually described as follows. Given a vector of observed feature values
~˜f , we seek model parameters ~θ that yield these expectations. That is, we wish the following constraint to be met for
all feature indices i: ∑
x,y
Pr
~θ
(y | x)fi(x, y) = f˜i. (10)
Now, there may be many solutions to this system of equations. The criterion used to decide among them is maximum
entropy — we seek the parameters that are closest to uninformative (uniform), subject to the expectation constraints.
This is motivated by an Occam’s razor argument that advocates choosing the simplest possible model that explains
the data; entropy can be roughly regarded as a measure of simplicity of the model.
It turns out that the dual of this constrained optimization problem has the form of the maximum likelihood problems
given in Equations 4 and 9, depending on whether the entropy to be maximized is joint entropy H(Pr~θ(X,Y )) or the
total entropy of the conditional distributions for each xj ,
∑m
j=1H(Pr~θ(Y | xj)) (Berger et al., 1996). See Section 2
for a derivation of the dual in the latter case.
7A zero-weighted feature will have no effect on the probability of examples where the feature is present; see
Equation 1.
8To my knowledge, this technique has not been carefully tested as a feature selection technique per se, though
M. Osborne (personal communication) reports that its performance as such is reasonable in comparison with other
feature selection algorithms.
9Charniak (2000) used log-linear-like features to build a state-of-the-art parser, but without training via the
3
• Priors over grammar rules (Eisner, 2001)
• Language modeling for automatic speech recognition (Rosenfeld, 1994; Khudanpur and Wu,
2000)
• Prepositional phrase attachment (Ratnaparkhi et al., 1994a)
• Part-of-speech tagging (Ratnaparkhi, 1996)
• Named entity recognition (Borthwick et al., 1998)
• Machine translation (Berger et al., 1996; Och and Ney, 2002)
2 Maximum Conditional Likelihood as the Dual of Maximum En-
tropy
Berger et al. (1996) did not explicitly derive the relationship between the maximum entropy es-
timation problem and the maximum likelihood problem; they gave only a sketch. Here I give a
derivation that shows how maximum conditional likelihood estimation is the dual of the maximum
entropy problem.
Notation is shown in Table 1.
The maximum entropy problem we seek to solve is:
maximize
p
∑
x
p˜(x)H(p(Y | x)) (11)
subject to
∑
(x,y)
p˜(x, y)fj(x, y) =
∑
x
p˜(x)
∑
y
p(y | x)fj(x, y),∀j (12)∑
y
p(y | x) = 1,∀x (13)
p(y | x) ≥ 0,∀x, y (14)
In short, we seek to maximize the empirically-averaged entropy over labels given each example x,
subject to the constraint that, on average the feature expectations within each neighborhood match
the feature expectations of the corresponding observations.
maximum-entropy criterion.
X the space of unlabeled examples
Y the space of labels
Y a random variable over labels
p˜ the observed (empirical) distribution over X× Y
(we also use p˜ to denote marginals over X)
p the desired distribution over X× Y that we would like to estimate;
here, p takes the form of a conditional distribution, p(y | x)
fj the jth feature function, X× Y→ R+
Table 1: Notation.
4
We define the conditional entropy as
H(Y | x) = −
∑
y
p(y | x) log p(y | x) = Ep(Y |x)[− log p(Y | x)] (15)
Let θj be a Lagrangian multiplier for constraint j. We can write the above as an unconstrained
problem, where θj is the cost of violating the jth constraint in line 12. Let µx be the constraint
corresponding to the constraint for x in line 13. Note that we have ignored the nonnegativity
constraint in line 14; that will ultimately fall out of the derivation.
min
~µ,~θ
max
p
∑
x
p˜(x)H(p(Y | x))−
∑
j
θj
∑
(x,y)
p˜(x, y)fj(x, y)−
∑
x
p˜(x)
∑
y
p(y | x)fj(x, y)
−
∑
x
µx
(∑
y
p(y | x)− 1
)
(16)
Next we hold the Lagrangian multipliers θj and µx constant and maximize Equation 16 with
respect to p(y | x). This will give a closed form solution for p(y | x) in terms of the Lagrangian
multipliers.10
To solve Equation 16 with respect to p(y | x), we take the derivative with respect to p(y | x).
(Note that there are many x and many y; the form of the derivative with respect to each will be
the same, with different bindings for x, y.):
∂
∂p(y | x) = −p˜(x) (1 + log p(y | x)) +
∑
j
θj p˜(x)fj(x, y)− µx (17)
Setting this equal to 0 and solving for p(y | x), we have:
−p˜(x) (1 + log p(y | x)) +
∑
j
θj p˜(x)fj(x, y)− µx = 0
p˜(x) (1 + log p(y | x)) =
∑
j
θj p˜(x)fj(x, y)− µx
1 + log p(y | x) =
∑
j θj p˜(x)fj(x, y)− µx
p˜(x)
log p(y | x) = ~θ · ~f(x, y) + µx
p˜(x)
− 1
p(y | x) =
exp
(
~θ · ~f(x, y)
)
exp
(
1− µxp˜(x)
)
The remarkable thing at this point is to notice that the optimal p(y | x) will take on an expo-
nential form. Note that the numerator is exactly the numerator in Equation 6. The denominator is
10The theory underlying these moves is hard to give at the right level of detail. Like most other NLP-oriented
expositions of log-linear models, this one will avoid that theory. The most rigorous treatment of this area is Bertsekas
(1999). The key point from the theory is that a maximum in Equation 11 will always correspond to a maximum
in Equation 16 (an unconstrained problem with more variables). For this problem, we can move the optimization
burden onto the Lagrange multipliers and forget about p(y | x), because the latter have a closed form in terms of the
former.
5
a function only of µx; to enforce that the sum-to-one constraints (Equation 13) are met, we simply
let, for all x, ∑
y′
exp
(
~θ · ~f(x, y′)
)
= exp
(
1− µx
p˜(x)
)
(18)
(Note that the constraints in line 14 must be satisfied, because the range of the exponential function
is nonnegative.) We see now that the solution to the maximum entropy problem will always be a
log-linear model. This is the source of the name, “maximum entropy model,” though I maintain
that the name is not always appropriate because there are alternative training objectives for these
models.
Now we substitute this form back into Equation 16. We remove the µx terms, since we have
guaranteed by the form of the model that the sum-to-one constraints (Equation 13) are satisfied:
∑
x
p˜(x)H(p(Y | x))−
∑
j
θj
∑
(x,y)
p˜(x, y)fj(z)−
∑
x
p˜(x)
∑
y
p(y | x)fj(x, y)
= −
∑
x
p˜(x)
∑
y
p(y | x) log p(y | x)−
∑
j
θj
∑
(x,y)
p˜(x, y)fj(x, y)−
∑
x
p˜(x)
∑
y
p(y | x)fj(x, y)
= −
∑
x
p˜(x)
∑
y
p(y | x) log p(y | x)−
∑
(x,y)
p˜(x, y)
∑
j
θjfj(x, y) +
∑
x
p˜(x)
∑
y
p(y | x)
∑
j
θjfj(x, y)
= −
∑
x
p˜(x)
∑
y
p(y | x) log p(y | x)−
∑
(x,y)
p˜(x, y)
(
~θ · ~f(x, y)
)
+
∑
x
p˜(x)
∑
y
p(y | x)
(
~θ · ~f(x, y)
)
= −
∑
x
p˜(x)
∑
y
p(y | x)
(
~θ · ~f(x, y)
)
− log
∑
y′
exp
(
~θ · ~f(x, y′)
)
︸ ︷︷ ︸
Z(~θ,x)
−
∑
(x,y)
p˜(x, y)
(
~θ · ~f(x, y)
)
+
∑
x
p˜(x)
∑
y
p(y | x)
(
~θ · ~f(x, y)
)
= −
∑
x
p˜(x)
∑
y
p(y | x)
(
~θ · ~f(x, y)
)
+
∑
x
p˜(x)
∑
y
p(y | x) logZ
(
~θ, x
)
−
∑
(x,y)
p˜(x, y)
(
~θ · ~f(x, y)
)
+
∑
x
p˜(x)
∑
y
p(y | x)
(
~θ · ~f(x, y)
)
The first and last terms cancel.
=
∑
x
p˜(x)
∑
y
p(y | x) logZ
(
~θ, x
)
−
∑
(x,y)
p˜(x, y)
(
~θ · ~f(x, y)
)
=
∑
x
p˜(x) logZ
(
~θ, x
)
−
∑
(x,y)
p˜(x, y)
(
~θ · ~f(x, y)
)
= −
∑
(x,y)
p˜(x, y) log
exp
(
~θ · ~f(x, y)
)
Z
(
~θ, x
)
6
This is the cross-entropy, and we seek to minimize it. Cross-entropy is equivalent to the negative
likelihood (Equation 9) times a constant factor (m, the size of the data).
3 A Bit About Training
I noted previously that Newtonian methods can be used to carry out maximum conditional likeli-
hood training of log-linear models. First-order Newtonian methods operate by iteratively improving
parameter values. Given parameter values ~θ(i), the objective L
(
~θ(i)
)
and the gradient ∇~θ(i)L are
computed. The optimization proceeds by letting ~θ(i+1) = ~θ(i)+α∇~θ(i)L, i.e., moving in the direction
of the gradient. (α, the step size, is determined using an auxiliary procedure that looks for the step
size that will increase L as much as possible.) Modern Newtonian methods may use an additional
term in the update, taking advantage of exact or approximate second-derivative information. Since
log-linear models usually have many parameters, it is not typically tractable to compute the second
derivative exactly (though we will see that it has a very simple form).
Here I derive the gradient and show that the function is everywhere concave. We start with the
objective to be maximized (Equation 9, repeated below):
L
(
~θ
)
=
m∑
j=1
~θ · ~f(xj , y∗j )
− m∑
j=1
log
∑
y′
exp
(
~θ · ~f(xj , y′)
)
(19)
=
m∑
j=1
∑
k
θkfk(xj , y∗j )
− m∑
j=1
log
∑
y′
exp
(∑
k
θkfk(xj , y′)
)
(20)
Taking the derivative with respect to θ`, we have:
∂L
∂θ`
=
m∑
j=1
f`(xj , y∗j )
− m∑
j=1
∑
y′ (exp
∑
k θkfk(xj , y
′)) f`(xj , y′)∑
y′ exp
∑
k θkfk(xj , y′)
(21)
=
m∑
j=1
(
f`(xj , y∗j )−
∑
y′ (exp
∑
k θkfk(xj , y
′)) f`(xj , y′)∑
y′ exp
∑
k θkfk(xj , y′)
)
(22)
=
m∑
j=1
f`(xj , y∗j )−∑
y′
Pr
~θ
(
y′ | xj
)
f`(xj , y′)
(23)
=
m∑
j=1
(
f`(xj , y∗j )−EPr~θ(Y |xj)[f`(xj , Y )]
)
(24)
Therefore,
∇~θL =
m∑
j=1
~f(xj , y∗j )−EPr~θ(Y |xj)[~f(xj , Y )] (25)
An important note is that, when the constraints in line 12 are met, then this gradient will be
all zeroes. The gradient is no more difficult to compute than the feature expectations; the time
required to do this is O(mn|Y|) where n is the length of ~θ. In some cases, such as conditional random
fields Lafferty et al. (2001), Y is a set whose cardinality grows exponentially in the length of the
input x. (In Lafferty et al.’s CRFs, x is a sequence.) Often a dynamic programming algorithm can
be applied in those cases to efficiently compute the expectations.
7
Second derivative. We can now take the derivative of an element of the gradient with respect
to θp. We begin with Equation 22 for convenience:
∂2L
∂θ`∂θp
= −
m∑
j=1
∑
y′
e
~θ·~f(xj ,y′)
∑
y′
f`(xj , y′)e
~θ·~f(xj ,y′)fp(xj , y′)
−
∑
y′
e
~θ·~f(xj ,y′)f`(xj , y′)
∑
y′
e
~θ·~f(xj ,y′)fp(xj , y′)
∑
y′
e
~θ·~f(xj ,y′)
2
(26)
= −
m∑
j=1
(
EPr~θ(Y |xj)[f`(xj , Y )fp(xj , Y )]−EPr~θ(Y |xj)[f`(xj , Y )]EPr~θ(Y |xj)[fp(xj , Y )]
)
= −
m∑
j=1
covPr~θ(Y |xj)
(f`(xj , Y ), fp(xj , Y )) (27)
where cov is the covariance. We see therefore that the Hessian matrix is the negated covariance
matrix with the feature values under the model as the random variables. Covariance matrices are
always positive definite or positive semi-definite; this means that the function we are maximizing
is concave everywhere.
References
S. P. Abney. Stochastic attribute-value grammars. Computational Linguistics, 23(4):597–617, 1997.
L. R. Bahl, P. F. Brown, P. V. de Souza, and R. L. Mercer. Maximummutual information estimation
of hidden Markov model parameters for speech recognition. In Proc. of ICASSP, 1986.
A. Berger. A brief maxent tutorial, 1996.
A. Berger, S. Della Pietra, and V. Della Pietra. A maximum entropy approach to natural language
processing. Computational Linguistics, 22(1), 1996.
D. P. Bertsekas. Nonlinear Programming. Athena Scientific, second edition, 1999.
A. Borthwick, J. Sterling, E. Agichtein, and R. Grishman. Exploiting diverse knowledge sources
via maximum entropy in named entity recognition. In Proc. of VLC, 1998.
E. Charniak. A maximum-entropy-inspired parser. In Proc. of NAACL, 2000.
S. Chen and R. Rosenfeld. A survey of smoothing techniques for ME models. IEEE Transactions
on Speech and Audio Processing, 8(1):37–50, 2000.
J. N. Darroch and D. Ratcliff. Generalized iterative scaling for log-linear models. Annals of
Mathematical Statistics, 43:1470–80, 1972.
S. Della Pietra, V. J. Della Pietra, and J. D. Lafferty. Inducing features of random fields. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 19(4):380–393, 1997.
J. Eisner. Smoothing a Probabilistic Lexicon via Syntactic Transformations. PhD thesis, University
of Pennsylvania, 2001.
M. Johnson. Joint and conditional estimation of tagging and parsing models. In Proc. of ACL,
2001.
M. Johnson, S. Geman, S. Canon, Z. Chi, and S. Riezler. Estimators for stochastic “unification-
based” grammars. In Proc. of ACL, 1999.
8
M. Johnson and S. Riezler. Exploiting auxiliary distributions in stochastic unification-based gram-
mars. In Proc. of NAACL, 2000.
B. H. Juang and S. Katagiri. Discriminative learning for minimum error classification. IEEE
Transactions on Acoustics, Speech, and Signal Processing, 40(12):3043–54, 1992.
J. Kazama and J. Tsujii. Evaluation and extension of maximum entropy models with inequality
constraints. In Proc. of EMNLP, 2003.
S. Khudanpur. A method of ME estimation with relaxed constraints. In Proc. of JHU Language
Modeling Workshop, 1995.
S. Khudanpur and J. Wu