Report
Graph Databases
CS341 Distributed Information Systems
University of Basel
Lukas Beck
lukas.beck@stud.unibas.ch
21. December 2012
Contents
1 Introduction & Motivation 2
2 Graph Theory 3
2.1 Single-Relational Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Multi-Relational Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Property Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 Graph Databases 5
3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.3 Graph Traversal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.3.1 Local Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.3.2 Local Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3.3 Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.4 Real World Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4 Implementations 9
5 Conclusion 9
2
1 Introduction & Motivation
When working with large graphs, one realizes that traditional databases – for example
relational databases – reach their limits. The main problem lies in the storage of graphs.
To illustrate the problem, let us look at an example:
Figure 1 illustrates our graph and problem. User Alex is interested in further rec-
ommendations for Pizza places. The figure shows one possible (graphical) solution by
looking at other user’s recommendations based on Alex’ favorites.
Figure 1: An example graph illustrating a graphical solution for finding additional rec-
ommendations for user Alex based on his existing recommendations.
user recommends
Alex C
Alex D
Luke D
Luke E
Table 1
Table 1 shows the recommends edges for the presented graph in a relational database.
By a simple lookup, we find all recommendations of Alex. Then, we have to reach
other users that recommended the same places (Luke in our case). This is done by
self-joining the table. Now, we again have to use a self-join to ‘traverse’ from Luke to his
recommendations. The following SQL query shows a possible solution:
3
SELECT * FROM recommendations r1
JOIN recommendations r2 ON
r2.user <> r1.user AND
r2.recommends = r1.recommends
JOIN recommendations r3 ON
r3.user = r2.user
Notice that the filtering of already recommended places is omitted for reasons of
readability.
The main problem with this approach are the very expensive join operations. Imagine
that there are hundred of thousands of recommendations in the database and the database
system has to do several self-joins to answer the query. Thus, we need another approach
to answer such queries efficiently: graph databases. However, before introducing these
types of databases we first need to have a look at different types of graphs used in this
context.
2 Graph Theory
There exists a variety of graph models. This section serves as an introduction to the
most widely known and relevant ones used in graph databases.
Notice that the list of presented types is by no means complete.
2.1 Single-Relational Graph
This type of graph is the classical textbook graph. Formally, it is defined as follows:
Definition 1. A single-relational graph is given as a pair G = ⟨V,E⟩ with
V : a set of vertices, also called nodes
E: a set of pairs of vertices denoted as edges
If the pairs of E are ordered, the graph is directed, otherwise it is undirected.
Figure 2: A single-relational graph with undirected edges. Nodes are represented by
circles. Edges connect different nodes.
4
The main limitation of this graph is that it only models one type of relation – thus the
name single-relational. This means that there is only one type of nodes and one type of
edges. Graphs of this type are very restraining to use in real world problems.
2.2 Multi-Relational Graph
Multi-relational graphs overcome the main limitation of single-relational graphs by
introducing labels to every edge. This enables to model explicitly different types of edges.
Also, these labels can introduce different types of nodes implicitly.
Formally, it is defined as a single-relational graph G = ⟨V,E⟩ with a function ` that
maps edges to labels. In graph theory, G is called an edge-labeled graph.
Figure 3: An example figure of a directed multi-relational graph. In this example, the
labeling is defined as ` ∶ E → {follows, recommends, employs}. The types of
nodes are implicitly given by the labels where follows maps users to users,
recommends users to places and employs places to users.
2.3 Property Graph
Multi-relational graph are already very suited to use as a basis for modelling problems.
However, it is complicated to store additional, non-relational data in these graphs1.
Property graphs solve exactly this problem by enabling to store additional key-value
pairs to vertices and edges.
1This is done by introducing additional nodes and relating them to the initial node.
5
Figure 4: An example of a directed property graph. In addition to labels, we can store
key-value pairs for each node and edge.
Property graphs are very powerful and can be used to model other graph types. This
or similar types of graphs are most often used in actual graph database implementations.
Property graphs – as defined here – are binary edge graphs which means that an edge
connects two nodes. This simplifies the actual implementation as from a node’s point of
view, an edge is reduced to a link to another node instead of a list of nodes.
However, it is possible but cumbersome to model n-ary edge graphs like hypergraphs.
3 Graph Databases
3.1 Definition
Sadly, there is currently no textbook definition of what a graph database is. Additionally,
almost every database can model graphs and in particular property graphs too2. This
fact obviously raises the question what graph databases are if every database is able to
model graphs. Rodriguez [2010] provided a very natural definition of graph databases to
capture the basic intention:
“A graph database is any storage system that provides index-free adjacency.”
[Rodriguez, 2010, Slide 27]
Effectively, this means that a graph database provides some sort of pointer to follow
edges to other nodes. As we saw in the introduction, in relational databases this is not
the case. An edge is saved using some sort of identifier which is then used to find the
actual edge or node using an index. Figure 5 illustrates this difference between explicit
and implicit adjacency:
In both figures, a index is built to access the nodes. However, when storing the graph
implicitly, in each node only reference ids to adjacent nodes are saved. When following
an edge, this id is used to find the actual node B through the index. This lookup scales
logarithmically with the number of nodes in the index.
2We saw one example for relational databases in the introduction.
6
On the other hand, when storing the graph explicitly there exists a direct ‘pointer’ to
adjacent nodes. In this case, we do not have to search through the index again but just
‘follow’ the pointer. As this lookup is done locally, it has constant time complexity.
(a)
(b)
Figure 5: Difference between storing edges of graphs implicitly (5a) and explicitly (5b).
3.2 Properties
Out of the above definition, we can derive directly some of the properties that graph
databases provide. First and most importantly graph databases emphasis on relation
between things. Compared to expensive join operations known in relational databases,
graph databases scale better when looking up adjacent elements as the explicit storage
enables a local and not a global lookup.
Because of the underlying, flexible graph model, graph databases are suited for fast
changing and evolving data sets.
Graph databases enable to address and solve graph problems in a more natural way.
Examples of such problems are path finding algorithms or graph traversal. Some examples
of graph traversals are shown in the following section.
3.3 Graph Traversal
Graph databases provide a natural way to solve problems that can be solved by traversing
over the graph. This sections illustrates some of the possible problems. To give an idea
of how an actual query looks like, examples using Gremlin3 are provided as well.
3.3.1 Local Search
Local search is a simple problem where one is interested (is searching) for nodes around
the neighborhood of given nodes. For example, Figure 6 answers the question what Pizza
Places did users that are being followed by Mike recommend. The traversal starts at user
Mike, follows every edge labeled follows and then follows every recommends edge. The
result contains every node found at the end of the traversal.
3Gremlin is graph traversal language which provides a simple interface to work with graphs. More
information on Gremlin can be found at http://gremlin.tinkerpop.com.
7
In Gremlin, the following query answers the example:
gremlin > v.outE(’follows ’).inV.outE(’recommends ’).inV.name
==>A
==>B
Figure 6: An example of locally searching on a graph. We are interested in the Pizza
places that users followed by Mike did recommend.
3.3.2 Local Recommendation
A more sophisticated example assumes that if you like a Pizza place, you will also like
other recommendations of users that like your Pizza place. Using this assumption, we
can provide recommendations given a Pizza place as seen in Figure 7. To answer such
a query, we follow every recommendation and list every other recommendation of the
found users as a result:
gremlin > v.inE(’recommends ’).outV.outE(’recommends ’).
inV.filter{it != v}.name
==>C
==>E
8
Figure 7: An example of a simple recommendation query: “What Pizza Places may you
also like if you like D?”
3.3.3 Collaborative Filtering
Instead of defining a given Pizza place for recommendation, we can use the recommenda-
tions an user made directly. This is one possible way of defining collaborative filtering
and can be used to give users automatically additional recommendations based on their
existing ones. This is the example we used in the introduction in Figure 1 where we were
interested in recommendations for user Alex.
Effectively, we simply add a step before doing local recommendation and use every
Pizza place the user recommended as an ‘input’ for local recommendation:
gremlin > v.outE(’recommends ’).inV.inE(’recommends ’).
outV.filter{it != v}.outE(’recommends ’).inV.name
==>E
3.4 Real World Use Cases
Several characteristics of the data or the problems can lead in using graph databases. If
data is highly connected and/or emphasizes on relational aspects, a graph databases may
be a suitable solution. Similarly if the problem can be solved in traversing the graph or
exploiting relational properties, a graph database may be optimal.
One real world example which relies on graph database is fraud detection. The problem
in detecting fraudulent activities involves following multiple connections among persons
which is natural task for graph databases. A similar problem faces criminal detection
and identifying groups or communication of criminals across multiple hops.
9
4 Implementations
There exists a variety of graph database implementations. Wikipedia4 lists already 20
different projects around graph databases.
The most notable (or known) is arguably Neo4j5 which uses the property graph as its
underlying model.
Important to notice is that there is no standardized access to the graph database.
The most prominent approaches are Cypher6 – the query language used in Neo4j – and
Blueprints7 – a generic Java API similar to JDBC. Cypher is only supported in Neo4j
whereas Blueprint provides several implementations – among others Neo4j, OrientDB
and Dex.
5 Conclusion
We showed several problems with traditional databases when working with large graphs.
To approach these problems, we then introduced graph databases, their most common
graph model and their properties. To further illustrate these systems, we presented
several graph traversal examples that can be addressed and solved efficiently in graph
databases. Afterwards, we presented real world problems that use such systems.
4http://en.wikipedia.org/wiki/Graph_database#Graph_database_projects
5http://neo4j.org
6http://docs.neo4j.org/chunked/milestone/cypher-query-lang.html
7http://blueprints.tinkerpop.com/
10
References
Marko A. Rodriguez. The graph traversal programming pattern. WindyCity DB confer-
ence, 2010. URL http://slideshare.net/slidarko/graph-windycitydb2010.
Introduction & Motivation
Graph Theory
Single-Relational Graph
Multi-Relational Graph
Property Graph
Graph Databases
Definition
Properties
Graph Traversal
Local Search
Local Recommendation
Collaborative Filtering
Real World Use Cases
Implementations
Conclusion