Telecommun Syst (2012) 50:1–13
DOI 10.1007/s11235-010-9384-1
Large-scale IP network behavior anomaly detection
and identification using substructure-based approach
and multivariate time series mining
Weisong He · Guangmin Hu · Yingjie Zhou
Published online: 4 August 2010
© Springer Science+Business Media, LLC 2010
Abstract In this paper, a substructure-based network be-
havior anomaly detection approach, called WFS (Weighted
Frequent Subgraphs), is proposed to detect the anomalies
of a large-scale IP networks. With application of WFS, an
entire graph is examined, unusual substructures of which
are reported. Due to additional information given by the
graph, the anomalies are able to be detected more accurately.
With multivariate time series motif association rules min-
ing (MTSMARM), the patterns of abnormal traffic behavior
are able to be obtained. In order to verify the above propos-
als, experiments are conducted and, together with applica-
tion of backbone networks (Internet2) Netflow data, show
some positive results.
Keywords Anomaly detection and identification ·
Weighted frequent subgraphs · Multivariate time series
motif association rules mining
1 Introduction
Large-scale IP network behavior analysis is important for
network monitoring and network management. Network
traffic which exhibits multi-timescale properties within tem-
poral domain, however, is highly dynamic. large-scale IP
W. He (�) · G. Hu · Y. Zhou
Key Lab of Optical Fiber Sensing and Communication,
University of Electronic Science and Technology of China
(UESTC), 611731 Chengdu, P.R. China
e-mail: weisonghe@uestc.edu.cn
G. Hu (�)
e-mail: hgm@uestc.edu.cn
Y. Zhou
e-mail: yjzhou@uestc.edu.cn
network traffic anomalies which feature sudden erupting
without preknown signs, often bring great damage to net-
work equipments or computers of network in a short time.
Therefore, one of the prepositions to ensure trustworthy
networks is to detect and locate network traffic anomalies
quickly and accurately, determine the reasons that cause
them and make reasonable response to them in time.
Anomaly detection, which refers to the topic of finding
patterns in data that do not conform to expected behavior
or can be defined as follows: “Given a set of n data points
or objects and the number p of expected outliers, find the
top p objects that are considerably dissimilar, exceptional,
or inconsistent with respect to the remaining data” [1]. Net-
work behavior anomaly detection refers to the issue of find-
ing network behavior patterns in network data that do not
conform to expected behavior. An anomalous traffic pattern
in a computer network could mean that a hacked computer
is sending out sensitive data to an unauthorized destination
[4]. Based on the extent to which the labels are available,
anomaly detection techniques can operate in one of the fol-
lowing three modes: supervised anomaly detection, semi-
supervised anomaly detection and unsupervised anomaly
detection. Some anomaly detection techniques used in net-
work intrusion detection systems are listed as follows: Sta-
tistical Profiling using Histograms [5–19], Parametric Sta-
tistical Modeling [20–23], Non-parametric Statistical Mod-
eling [24], Bayesian Networks [25–28], Neural Networks
[13–15, 29–34], Support Vector Machines [35], Rule-Based
Systems [36–43], Clustering Based Method [35, 43, 45, 46],
Nearest Neighbor based Method [35, 44, 47], Spectral
and Graph Method [48, 49, 51, 53], Information Theoretic
Method [53, 54].
On the foundation of entropy-based analysis of traffic
feature distributions, this paper puts forward some new ideas
which go as follows. Firstly, two kinds of traffic feature dis-
2 W. He et al.
tributions are selected: one is address and port distributions,
the other is protocol and octets distributions. The two kinds
of distributions complement each other and provide differ-
ent insight into the underlying structure. Secondly, aimed
at detecting of large-scale IP network behavior anomalies
more accurately, in this paper multivariate time series are
represented by graphs which are produced by a given pair of
graphs using Cartesian Product, and a substructure-based
network behavior anomaly detection algorithm, named as
WFS, was proposed to detect the anomalies of a large-scale
IP network flow data.
The paper is organized as follows. We review related
work in Sect. 2. Then, in Sect. 3, we describe how to rep-
resent the real large-scale IP network flow data with graph
and propose WFS algorithm. In Sect. 4 we identify abnormal
patterns using MTSMARM. In Sect. 5 experiments are con-
ducted to evaluate our proposed method. Finally, we con-
clude in Sect. 6.
2 Related work
There is considerable interest in using entropy-based analy-
sis of traffic feature distributions for anomaly detection [49,
50]. Entropy-based metrics are tempting since they provide
more fine-grained insights into traffic structure than tradi-
tional traffic volume analysis.While previous work has illu-
minated the benefits of using the entropy of different traffic
distributions in isolation to detect anomalies, there has been
little effort in comprehensively understanding the detection
power provided by entropy-based analysis of multiple traf-
fic distribution used in connection with each other. Few re-
searchers focus on the detection power which is affected by
on choice of feature. Choosing traffic features distributions
that complement one another will benefit anomaly detection
capabilities greatly.
The detection of anomalies in multivariate time series is
more challenging due to that establishing a clear definition
of an anomaly is difficult. Analogous to univariate time se-
ries, some anomalies may correspond to abnormally high (or
low) values or unusual subsequences (discord [59]) in one
or more time series. Others may correspond to unexpected
changes in the relationships among a set of variables.
Recently there has been an urge towards analyzing data
using graph theory methods. Not to be confused with the
mechanisms for analyzing spatial data, graph-based data
mining methods are to analyze data that can be represented
as a graph (i.e., vertices and edges). Yet, while there has
been much written as it belongs to graph-based data mining
[60], very little research has been good at the area of graph-
based anomaly detection. In 2003 [53] used the SUBDUE
application to consider the problem of anomaly detection
from both the anomalous substructure and anomalous sub-
graph perspective. They were able to provide measurements
of anomalous behavior as they are applied to graphs from
two different views. Anomalous substructure detection dealt
with the unusual substructures that were discovered in an en-
tire graph. In order to discriminate an anomalous substruc-
ture from the other substructures, they proposed a simple
measurement whereby the value associated with a substruc-
ture indicated a degree of anomaly. They also introduced
the idea of anomalous subgraph detection which dealt with
how anomalous a subgraph (i.e., a substructure that is part
of a larger graph) was to other subgraphs. The idea was that
subgraphs that contained many ordinary substructures were
generally less anomalous than subgraphs that contained few
ordinary substructures. In addition, they also investigate the
idea of conditional entropy and data regularity using net-
work intrusion data as well as some artificially created data.
[64] utilized rarity measurements to the discovery of uncom-
mon links within a graph. Using various metrics to define
the commonality of paths between nodes, the user was able
to confirm whether a path between two nodes were interest-
ing or not, without having any preconceived notions of sig-
nificant patterns. One of the disadvantages of this approach
was that while it was domain independent, it assumed that
the user was querying the system to discover interesting pat-
terns regarding certain nodes. In other words, the uncommon
patterns had to originate or terminate from a user-defined
node. The AutoPart system presented a non-parametric ap-
proach to finding outliers in graph-based data [63]. Part of
Chakrabarti’s approach was to detect outliers by analyzing
how edges that were removed from the whole structure af-
fected the minimum descriptive length (MDL) of the graph.
Representing the graph as an adjacency matrix, and using
a compression technique to encode node groupings of the
graph, he searched the groups that cut down the compres-
sion cost as much as possible. Nodes were put in groups
based on their entropy. In 2005, the concept of entropy was
also used by [61] in their analysis of a real-world data set:
the famous Enron scandal. They used “event based graph
entropy” to determine the place of the most interesting peo-
ple in an Enron e-mail data set. Using a measure analogous
to what [53] had proposed, they supposed that the important
nodes (or people) were the ones who had the biggest effect
on the entropy of the graph when they were removed. Thus,
the most interesting node was the one that brought about the
maximum change to the graph’s entropy. However, they ne-
glected the relations between nodes which provided more
information about frequent subgraph. In 2005, by using just
bipartite graphs [52] presented a model for scoring the nor-
mality of nodes as they connect to the other nodes. Next,
by using an adjacency matrix they gave a “relevance score”
such that every node Vi had a relevance score to every node
Vj , whereby the higher the score the more related the two
nodes. The idea was that the nodes with the lower normality
score to node Vi were the more anomalous ones to that node.
Large-scale IP network behavior anomaly detection and identification using substructure-based approach 3
Table 1 The flow record contents
Basic information Related to routing
srcaddr, dstaddr, srcport, dstport nexthop
dpkts, doctets src_as, dst_as
first, last src_mask, dst_mask
tos, tcp_flags
The two drawbacks with this approach were that it only dealt
with bipartite graphs and it only found anomalous nodes,
rather than what could be anomalous substructures. In [62],
they also went after anomalous links, this time via a statis-
tical approach. By using a Katz measurement they used the
link structure to statistically predict the likelihood of a link.
While it worked on a small dataset of author-paper pairs,
their single measurement just analyzed the links in a graph.
3 Anomaly detection using substructure-based
approach
The graph representation of large-scale IP network flow data
can be divided into three steps. Firstly, large-scale IP net-
work flow data are compressed using Shannon entropy; Sec-
ondly, the entropy time series (a time series is composed
of entropy) is normalized, and the time series is converted
into an discrete symbolic sequence; Finally, multiple dimen-
sional data at each time point are represented by a graph.
3.1 Netflow, entropy and anomaly detection metrics
3.1.1 Netflow
With Netflow, applications on the network are able to be
monitored, and malicious traffic informations are able to
be identified. The Netflow, which is a passive monitoring
tool includes: statistics about groups of related packets (e.g.,
same TCP/IP headers and close in time); records header in-
formation, counts, and time. As can be seen from Table 1,
the flow record contents can be divided into two parts: one
is the basic information about the flow; the other is informa-
tion related to routing. The basic information about the flow
goes as follows: the first information is source and destina-
tion, IP address and port; the second one is packet and byte
counts; the third one is start and end times; the fourth one is
ToS and TCP flags. The informations related to routing are
listed as follows: the first next-hop IP address; source and
destination AS; input and output.
3.1.2 Shannon entropy and anomaly detection metrics
Entropy denoted by H , is a metric that captures the degree of
dispersal or concentration of a distribution and is a measure
Table 2 Description of six anomaly detection metrics at flow level
Metrics type Description
H(srcPort) Entropy of source port distribution
H(dstPort) Entropy of destination port distribution
H(srcIP) Entropy of source IP address distribution
H(dstIP) Entropy of destination IP address distribution
H(octets) Entropy of octets distribution
H(prot) Entropy of protocol type distribution
of the uncertainty of a random variable [2, 3, 49]. A wide va-
riety of anomalies will impact on the distribution of one of
the discussed IP features. Let X be a discrete random vari-
able with alphabet X representing the distribution of val-
ues which a particular network traffic feature can take, then
probability mass function is p(x) = Pr{X = x}, x ∈ X . The
entropy H(X) of a discrete random variable X is defined by
H(X) = −
∑
x∈X
p(x) logp(x) (1)
where p(x) is the probability of event x ∈ X occurring. For
example, the probability of seeing IP 129.173.192.0 is de-
fined to be the number of packets using IP 129.173.192.0
divided by the total number of packets in the given time in-
terval.
In this paper, entropy can be applied to obtain a com-
pressed representation of the large-scale IP network flow
data that is much smaller in volume, yet closely maintain
the integrity of the original data, which is called lossless
distribution information reduction. Anomaly detection met-
rics refer to metrics which are used for anomaly detection.
Network anomaly detection can be operated in three differ-
ent granularity such as packet level, flow level and network-
wide (OD traffic) level. Since flow level analysis is a good
compromise for traffic analysis based on network-wide and
packet in its performance and accuracy, the analysis method
of this paper is based on features distribution of Netflow.
The empirical evaluation of [50] shown that port and ad-
dress distribution are highly correlated. Besides port and ad-
dress, in order to explore the underlying traffic structure,
other features are chosen in this paper. The six flow level
metrics used in this paper are as follows: source address
(sometimes called source IP and denoted srcIP), destination
address (or destination IP, denoted dstIP), source port (src-
Port), destination port (dstPort), octets and type of protocol,
which are shown in Table 2.
3.2 Symbolic representation of multivariate time series
We use min-max normalization to map the entropy time se-
ries in the range [0,1] by computing
4 W. He et al.
Fig. 1 The representation of
graphs time series
h′i =
hi − hmin
hmax − hmin (2)
where hmin and hmax are the minimum and maximum values
of entropy time series. Then, we divide the normalized en-
tropy series into 2n (n ≥ 2) parts equally, with 2n alphabets
denoting the value of 2n parts respectively, and symbolic
sequence is obtained. For example, by using four alphabets
{A,B,C,D} denote the value of four parts respectively, and
symbolic sequence si is obtained.
si =
⎧
⎪⎪⎪⎨
⎪⎪⎪⎩
A, h′i ≤ 0.25
B, 0.25 < h′i ≤ 0.5
C, 0.5 < h′i ≤ 0.75
D, else
(3)
3.3 Graph representation of multivariate time series
Consider a time series H = h1h2 . . . hT , which is an or-
dered sequence of measurements taken for a real-valued
variable H at timestamps 1,2, . . . , T . A multivariate time
series D = {Hi}6i=1 is a collection of time series that corre-
sponds to measurements for six real-valued variables span-
ning the same time interval.
In order to represent the multivariate time series D, a new
undirected graph, G = (V ,E), V = {v1, v2, . . . , vn}, E =
{e1, e2, . . . , em} ⊆ V ×V which is derived from a given pair
of graphs TN and K6 using Cartesian Product TN × K6,
is made up. The representation of graphs time series is
shown in Fig. 1. Each vertex in the graph denotes a data
point of a Netflow feature and each vertex can be assigned
a value within {A,B,C,D}. Edge denotes the relation be-
tween two network anomaly detection metrics. As multi-
ple network anomaly metrics distribution may affect each
other, network anomaly metrics distribution time series are
represented with complete graph to describe the relations
of all network anomaly detection metrics. V (g) denotes
the vertex set of graph g and E(g) denotes the edge set.
Given a sequence GTs of n graphs {GT1 ,GT2, . . . ,GTn} with
GTn = (VTna,ETnb), 1 ≤ a ≤ 6,1 ≤ b ≤ 15, GTn denotes the
graph at time Tn, htc denotes the value for cth vertex Vc in
Gt , 1 ≤ t ≤ n,1 ≤ c ≤ 6. ETnp denotes the pth edge of GTn
at time Tn. We define GTs to be a graphs time series for
all 1 ≤ Ts ≤ n, and define Supt (i, j, . . .) as the numbers of
Large-scale IP network behavior anomaly detection and identification using substructure-based approach 5
graphs in GTs where g (consists of k vertexes) is a subgraph.
For example, Supt (i, j) ( i, j denote two different vertices,
t denotes the time point of the graph) represents the support
of 2-itemsets. A frequent subgraph is a graph whose support
is no less than a minimum support threshold, min_sup. Ac-
cording to different values of the nodes, the support of graph
pattern of each time point is counted. The fewer the sup-
ports are, the more anomalous the subgraphs. Covariance
is a measure of how much two traffic features change to-
gether. The corresponding edge weight is given according
to the covariance value. Let w be a discrete random vari-
able with alphabet W representing the weight coefficient of
subgraph can take, W(i, j) denotes the edge weight of 2 ver-
texes (i, j).
Consider m×1 random vector x(ζ ) = [x1(ζ ), x2(ζ ), . . . ,
xm(ζ )]T , then the mathematical expectation μx is
μx = [μ1,μ2, . . . ,μm]T (4)
Every random variable have m sample, then sample ma-
trix is
M = [α1,α2, . . . ,αn]T = [β1,β2, . . . ,βm] (5)
where βj (j = 1,2, . . . ,m) is the sample vector of j th ran-
dom variable, αi (i = 1,2, . . . , n) is the vector composed of
all the sample values of ith random variable.
The covariance matrix of random variable x(ζ ),y(ζ ) can
be defined as
Cxy = E{[x(ζ ) − μx][y(ζ ) − μy]H } (6)
The estimation covariance Cˆxy can be obtained based on the
sample value:
cˆxi ,yj =
1
n − 1
n∑
k=1
[(
Mki − 1
n
n∑
p=1
xpi
)
×
(
Mkj − 1
n
n∑
q=1
yqj
)]
(7)
The weight coefficient of subgraph can be computed as fol-
lows:
W(x1, x2) = exp(−cˆ1/2x1,x2)
W(x1, x2, x3) = exp(−cˆ1/2x1,x2 − cˆ1/2x2,x3 − cˆ1/2x3,x1)
W(x1, x2, x3, x4) = exp(−cˆ1/2x1,x2 − cˆ1/2x2,x3 − cˆ1/2x3,x4 − cˆ1/2x4,x1)
W(x1, x2, x3, x4, x5)
= exp(−cˆ1/2x1,x2 − cˆ1/2x2,x3 − cˆ1/2x3,x4 − cˆ1/2x4,x5 − cˆ1/2x5,x1) (8)
where 1 ≤ x1, x2, x3, x4, x5 ≤ 4. θτ denotes the contribution
factor (impact degree to anomaly)of subgraph G(i, j, . . .︸ ︷︷ ︸
τ
).
For example, θ2 denotes the contribution factor of sub-
graph G(i, j). According to F2(S,G) = Size(S) ∗
Instances(S,G), Size(S) is the number of vertices in S,
Instances(S,G) is the frequency that S appears in the
graph G, which is described in [53], small substructures
appear frequently, large substructures appears infrequently.
In general, 1 = θ2 > θ3 > θ4 > θ5 > 0. The abnormal index
Pt is represented as in (9):
Pt = log
⎛
⎜⎜⎜⎜⎜⎜⎜⎝
[θ2 θ3 θ4 θ5]
⎡
⎢⎢⎢⎢⎢⎢⎢⎣
∑
1≤x1