computer 30
COVER FE ATURE
Published by the IEEE Computer Society 0018-9162/09/$26.00 © 2009 IEEE
Such systems are particularly useful for entertainment
products such as movies, music, and TV shows. Many cus-
tomers will view the same movie, and each customer is
likely to view numerous different movies. Customers have
proven willing to indicate their level of satisfaction with
particular movies, so a huge volume of data is available
about which movies appeal to which customers. Com-
panies can analyze this data to recommend movies to
particular customers.
RecommendeR system stRategies
Broadly speaking, recommender systems are based
on one of two strategies. The content filtering approach
creates a profile for each user or product to characterize
its nature. For example, a movie profile could include at-
tributes regarding its genre, the participating actors, its
box office popularity, and so forth. User profiles might
include demographic information or answers provided
on a suitable questionnaire. The profiles allow programs
to associate users with matching products. Of course,
content-based strategies require gathering external infor-
mation that might not be available or easy to collect.
A known successful realization of content filtering is
the Music Genome Project, which is used for the Internet
radio service Pandora.com. A trained music analyst scores
M
odern consumers are inundated with
choices. Electronic retailers and content
providers offer a huge selection of prod-
ucts, with unprecedented opportunities
to meet a variety of special needs and
tastes. Matching consumers with the most appropriate
products is key to enhancing user satisfaction and loy-
alty. Therefore, more retailers have become interested in
recommender systems, which analyze patterns of user
interest in products to provide personalized recommenda-
tions that suit a user’s taste. Because good personalized
recommendations can add another dimension to the user
experience, e-commerce leaders like Amazon.com and
Netflix have made recommender systems a salient part
of their websites.
As the Netflix Prize competition has dem-
onstrated, matrix factorization models
are superior to classic nearest-neighbor
techniques for producing product recom-
mendations, allowing the incorporation
of additional information such as implicit
feedback, temporal effects, and confidence
levels.
Yehuda Koren, Yahoo Research
Robert Bell and Chris Volinsky, AT&T Labs—Research
MATRIX
FACTORIZATION
TECHNIQUES FOR
RECOMMENDER
SYSTEMS
31AuGuSt 2009
well-defined dimensions such as depth of character de-
velopment or quirkiness; or completely uninterpretable
dimensions. For users, each factor measures how much
the user likes movies that score high on the correspond-
ing movie factor.
Figure 2 illustrates this idea for a simplified example
in two dimensions. Consider two hypothetical dimen-
sions characterized as female- versus male-oriented and
serious versus escapist. The figure shows where several
well-known movies and a few fictitious users might fall on
these two dimensions. For this model, a user’s predicted
rating for a movie, relative to the movie’s average rating,
would equal the dot product of the movie’s and user’s lo-
cations on the graph. For example, we would expect Gus
to love Dumb and Dumber, to hate The Color Purple, and
to rate Braveheart about average. Note that some mov-
ies—for example, Ocean’s 11—and users—for example,
Dave—would be characterized as fairly neutral on these
two dimensions.
matRix factoRization methods
Some of the most successful realizations of latent factor
models are based on matrix factorization. In its basic form,
matrix factorization characterizes both items and users by
vectors of factors inferred from item rating patterns. High
correspondence between item and user factors leads to a
each song in the Music Genome Project
based on hundreds of distinct musical
characteristics. These attributes, or genes,
capture not only a song’s musical identity
but also many significant qualities that are
relevant to understanding listeners’ musi-
cal preferences.
An alternative to content filtering relies
only on past user behavior—for example,
previous transactions or product ratings—
without requiring the creation of explicit
profiles. This approach is known as col-
laborative filtering, a term coined by the
developers of Tapestry, the first recom-
mender system.1 Collaborative filtering
analyzes relationships between users and
interdependencies among products to
identify new user-item associations.
A major appeal of collaborative fil-
tering is that it is domain free, yet it can
address data aspects that are often elusive
and difficult to profile using content filter-
ing. While generally more accurate than
content-based techniques, collaborative
filtering suffers from what is called the cold
start problem, due to its inability to ad-
dress the system’s new products and users. In this aspect,
content filtering is superior.
The two primary areas of collaborative filtering are the
neighborhood methods and latent factor models. Neighbor-
hood methods are centered on computing the relationships
between items or, alternatively, between users. The item-
oriented approach evaluates a user’s preference for an
item based on ratings of “neighboring” items by the same
user. A product’s neighbors are other products that tend
to get similar ratings when rated by the same user. For
example, consider the movie Saving Private Ryan. Its
neighbors might include war movies, Spielberg movies,
and Tom Hanks movies, among others. To predict a par-
ticular user’s rating for Saving Private Ryan, we would look
for the movie’s nearest neighbors that this user actually
rated. As Figure 1 illustrates, the user-oriented approach
identifies like-minded users who can complement each
other’s ratings.
Latent factor models are an alternative approach
that tries to explain the ratings by characterizing both
items and users on, say, 20 to 100 factors inferred from
the ratings patterns. In a sense, such factors comprise a
computerized alternative to the aforementioned human-
created song genes. For movies, the discovered factors
might measure obvious dimensions such as comedy versus
drama, amount of action, or orientation to children; less
Joe
#2
#3
#1
#4
figure 1. The user-oriented neighborhood method. Joe likes the three
movies on the left. To make a prediction for him, the system finds similar
users who also liked those movies, and then determines which other movies
they liked. In this case, all three liked Saving Private Ryan, so that is the first
recommendation. Two of them liked Dune, so that is next, and so on.
COVER FE ATURE
computer 32
vector q
i
∈ R f, and each user u is associ-
ated with a vector p
u
∈ R f. For a given item
i, the elements of q
i
measure the extent to
which the item possesses those factors,
positive or negative. For a given user u,
the elements of p
u
measure the extent of
interest the user has in items that are high
on the corresponding factors, again, posi-
tive or negative. The resulting dot product,
q
i
T p
u
, captures the interaction between user
u and item i—the user’s overall interest in
the item’s characteristics. This approximates
user u’s rating of item i, which is denoted by
r
ui
, leading to the estimate
rˆui
= q
i
T p
u
. (1)
The major challenge is computing the map-
ping of each item and user to factor vectors
q
i
, p
u
∈ R f. After the recommender system
completes this mapping, it can easily esti-
mate the rating a user will give to any item
by using Equation 1.
Such a model is closely related to singular value decom-
position (SVD), a well-established technique for identifying
latent semantic factors in information retrieval. Applying
SVD in the collaborative filtering domain requires factoring
the user-item rating matrix. This often raises difficulties
due to the high portion of missing values caused by sparse-
ness in the user-item ratings matrix. Conventional SVD is
undefined when knowledge about the matrix is incom-
plete. Moreover, carelessly addressing only the relatively
few known entries is highly prone to overfitting.
Earlier systems relied on imputation to fill in missing
ratings and make the rating matrix dense.2 However, im-
putation can be very expensive as it significantly increases
the amount of data. In addition, inaccurate imputation
might distort the data considerably. Hence, more recent
works3-6 suggested modeling directly the observed rat-
ings only, while avoiding overfitting through a regularized
model. To learn the factor vectors (p
u
and q
i
), the system
minimizes the regularized squared error on the set of
known ratings:
min
* *,q p ( , )u i ∈
∑
κ
(r
ui
- q
i
Tp
u
)2 + λ(|| q
i
||2 + || p
u
||2) (2)
Here, κ is the set of the (u,i) pairs for which r
ui
is known
(the training set).
The system learns the model by fitting the previously
observed ratings. However, the goal is to generalize those
previous ratings in a way that predicts future, unknown
ratings. Thus, the system should avoid overfitting the
observed data by regularizing the learned parameters,
whose magnitudes are penalized. The constant λ controls
recommendation. These methods have become popular in
recent years by combining good scalability with predictive
accuracy. In addition, they offer much flexibility for model-
ing various real-life situations.
Recommender systems rely on different types of
input data, which are often placed in a matrix with one
dimension representing users and the other dimension
representing items of interest. The most convenient data
is high-quality explicit feedback, which includes explicit
input by users regarding their interest in products. For
example, Netflix collects star ratings for movies, and TiVo
users indicate their preferences for TV shows by pressing
thumbs-up and thumbs-down buttons. We refer to explicit
user feedback as ratings. Usually, explicit feedback com-
prises a sparse matrix, since any single user is likely to
have rated only a small percentage of possible items.
One strength of matrix factorization is that it allows
incorporation of additional information. When explicit
feedback is not available, recommender systems can infer
user preferences using implicit feedback, which indirectly
reflects opinion by observing user behavior, including pur-
chase history, browsing history, search patterns, or even
mouse movements. Implicit feedback usually denotes the
presence or absence of an event, so it is typically repre-
sented by a densely filled matrix.
a Basic matRix factoRization modeL
Matrix factorization models map both users and items
to a joint latent factor space of dimensionality f, such that
user-item interactions are modeled as inner products in
that space. Accordingly, each item i is associated with a
Geared
toward
males
Serious
Escapist
The Princess
Diaries
Braveheart
Lethal Weapon
Independence
Day
Ocean’s 11
Sense and
Sensibility
Gus
Dave
Geared
toward
females
Amadeus
The Lion King
Dumb and
Dumber
The Color Purple
figure 2. A simplified illustration of the latent factor approach, which
characterizes both users and movies using two axes—male versus female
and serious versus escapist.
33AuGuSt 2009
data aspects and other application-specific requirements.
This requires accommodations to Equation 1 while staying
within the same learning framework. Equation 1 tries to cap-
ture the interactions between users and items that produce
the different rating values. However, much of the observed
variation in rating values is due to effects associated with
either users or items, known as biases or intercepts, indepen-
dent of any interactions. For example, typical collaborative
filtering data exhibits large systematic tendencies for some
users to give higher ratings than others, and for some items
to receive higher ratings than others. After all, some products
are widely perceived as better (or worse) than others.
Thus, it would be unwise to explain the full rating value
by an interaction of the form q
i
Tp
u
. Instead, the system tries
to identify the portion of these values that individual user or
item biases can explain, subjecting only the true interaction
portion of the data to factor modeling. A first-order approxi-
mation of the bias involved in rating r
ui
is as follows:
b
ui
= µ + b
i
+ b
u
(3)
The bias involved in rating r
ui
is denoted by b
ui
and ac-
counts for the user and item effects. The overall average
rating is denoted by µ; the parameters b
u
and b
i
indicate
the observed deviations of user u and item i, respectively,
from the average. For example, suppose that you want
a first-order estimate for user Joe’s rating of the movie
Titanic. Now, say that the average rating over all movies, µ,
is 3.7 stars. Furthermore, Titanic is better than an average
movie, so it tends to be rated 0.5 stars above the average.
On the other hand, Joe is a critical user, who tends to rate
0.3 stars lower than the average. Thus, the estimate for
Titanic’s rating by Joe would be 3.9 stars (3.7 + 0.5 - 0.3).
Biases extend Equation 1 as follows:
rˆui
= µ+ b
i
+ b
u
+ q
i
Tp
u
(4)
Here, the observed rating is broken down into its four
components: global average, item bias, user bias, and user-
item interaction. This allows each component to explain
only the part of a signal relevant to it. The system learns
by minimizing the squared error function:4,5
min
* * *, ,p q b ( , )u i ∈
∑
κ
(r
ui
- µ - b
u
- b
i
- p
u
Tq
i
)2 + λ
(|| p
u
||2 + || q
i
||2 + b
u
2 + b
i
2) (5)
Since biases tend to capture much of the observed
signal, their accurate modeling is vital. Hence, other works
offer more elaborate bias models.11
additionaL inPUt soURces
Often a system must deal with the cold start problem,
wherein many users supply very few ratings, making it
the extent of regularization and is usually determined by
cross-validation. Ruslan Salakhutdinov and Andriy Mnih’s
“Probabilistic Matrix Factorization”7 offers a probabilistic
foundation for regularization.
LeaRning aLgoRithms
Two approaches to minimizing Equation 2 are stochastic
gradient descent and alternating least squares (ALS).
stochastic gradient descent
Simon Funk popularized a stochastic gradient descent
optimization of Equation 2 (http://sifter.org/~simon/
journal/20061211.html) wherein the algorithm loops
through all ratings in the training set. For each given
training case, the system predicts r
ui
and computes the
associated prediction error
e
ui
=
def
r
ui
- q
i
T p
u
.
Then it modifies the parameters by a magnitude pro-
portional to g in the opposite direction of the gradient,
yielding:
• q q e p qi i ui u i← + ⋅ ⋅ − ⋅γ λ( )
• p p e q pu u ui i u← + ⋅ ⋅ − ⋅γ λ( )
This popular approach4-6 combines implementation
ease with a relatively fast running time. Yet, in some cases,
it is beneficial to use ALS optimization.
alternating least squares
Because both q
i
and p
u
are unknowns, Equation 2 is not
convex. However, if we fix one of the unknowns, the op-
timization problem becomes quadratic and can be solved
optimally. Thus, ALS techniques rotate between fixing the
q
i
’s and fixing the p
u
’s. When all p
u
’s are fixed, the system
recomputes the q
i
’s by solving a least-squares problem, and
vice versa. This ensures that each step decreases Equation
2 until convergence.8
While in general stochastic gradient descent is easier
and faster than ALS, ALS is favorable in at least two cases.
The first is when the system can use parallelization. In ALS,
the system computes each q
i
independently of the other
item factors and computes each p
u
independently of the
other user factors. This gives rise to potentially massive
parallelization of the algorithm.9 The second case is for
systems centered on implicit data. Because the training
set cannot be considered sparse, looping over each single
training case—as gradient descent does—would not be
practical. ALS can efficiently handle such cases.10
adding Biases
One benefit of the matrix factorization approach to col-
laborative filtering is its flexibility in dealing with various
COVER FE ATURE
computer 34
prove accuracy. Decomposing ratings into distinct terms
allows the system to treat different temporal aspects sepa-
rately. Specifically, the following terms vary over time: item
biases, b
i
(t); user biases, b
u
(t); and user preferences, p
u
(t).
The first temporal effect addresses the fact that an
item’s popularity might change over time. For example,
movies can go in and out of popularity as triggered by
external events such as an actor’s appearance in a new
movie. Therefore, these models treat the item bias b
i
as a
function of time. The second temporal effect allows users
to change their baseline ratings over time. For example, a
user who tended to rate an average movie “4 stars” might
now rate such a movie “3 stars.” This might reflect several
factors including a natural drift in a user’s rating scale,
the fact that users assign ratings relative to other recent
ratings, and the fact that the rater’s identity within a house-
hold can change over time. Hence, in these models, the
parameter b
u
is a function of time.
Temporal dynamics go beyond this; they also affect
user preferences and therefore the interaction between
users and items. Users change their preferences over time.
For example, a fan of the psychological thrillers genre
might become a fan of crime dramas a year later. Simi-
larly, humans change their perception of certain actors
and directors. The model accounts for this effect by taking
the user factors (the vector p
u
) as a function of time. On
the other hand, it specifies static item characteristics, q
i
,
because, unlike humans, items are static in nature.
Exact parameterizations of time-varying parameters11
lead to replacing Equation 4 with the dynamic prediction
rule for a rating at time t:
rˆui
(t) = µ + b
i
(t) + b
u
(t) + q
i
T p
u
(t) (7)
inPUts With VaRying confidence LeVeLs
In several setups, not all observed ratings deserve the
same weight or confidence. For example, massive adver-
tising might influence votes for certain items, which do
not aptly reflect longer-term characteristics. Similarly, a
system might face adversarial users that try to tilt the rat-
ings of certain items.
Another example is systems built around implicit
feedback. In such systems, which interpret ongoing
user behavior, a user’s exact preference level is hard to
quantify. Thus, the system works with a cruder binary
representation, stating either “probably likes the product”
or “probably not interested in the product.” In such cases,
it is valuable to attach confidence scores with the esti-
mated preferences. Confidence can stem from available
numerical values that describe the frequency of actions,
for example, how much time the user watched a certain
show or how frequently a user bought a certain item. These
numerical values indicate the confidence in each obser-
vation. Various factors that have nothing to do with user
difficult to reach general conclusions on their taste. A way
to relieve this problem is to incorporate additional sources
of information about the users. Recommender systems can
use implicit feedback to gain insight into user preferences.
Indeed, they can gather behavioral information regardless
of the user’s willingness to provide explicit ratings. A re-
tailer can use its customers’ purchases or browsing history
to learn their tendencies, in addition to the ra