1
VIPS: a Vision-based Page Segmentation Algorithm
Deng Cai
Shipeng Yu
Ji-Rong Wen
Wei-Ying Ma
Nov. 1, 2003
Technical Report
MSR-TR-2003-79
Microsoft Research
Microsoft Corporation
One Microsoft Way
Redmond, WA 98052
2
VIPS: a Vision-based Page Segmentation Algorithm
Deng Cai1*, Shipeng Yu 2*, Ji-Rong Wen* and Wei-Ying Ma*
*Microsoft Research Asia
5F, Beijing Sigma Center
No.49, Zhichun Road Haidian District
Beijing, P.R.China
{jrwen, wyma}@microsoft.com
1Tsinghua University
Beijing, P.R.China
cai_deng@yahoo.com
2Peking University
Beijing, P.R.China
ysp@is.pku.edu.cn
3
VIPS: a Vision-based Page Segmentation Algorithm
Abstract
A new web content structure analysis based on visual representation is proposed in
this paper. Many web applications such as information retrieval, information
extraction and automatic page adaptation can benefit from this structure. This paper
presents an automatic top-down, tag-tree independent approach to detect web content
structure. It simulates how a user understands web layout structure based on his visual
perception. Comparing to other existing techniques, our approach is independent to
underlying documentation representation such as HTML and works well even when
the HTML structure is far different from layout structure. Experiments show
satisfactory results.
1 Introduction
Today the Web has become the largest information source for people. Most
information retrieval systems on the Web consider web pages as the smallest and
undividable units, but a web page as a whole may not be appropriate to represent a
single semantic. A web page usually contains various contents such as navigation,
decoration, interaction and contact information, which are not related to the topic of
the web-page. Furthermore, a web page often contains multiple topics that are not
necessarily relevant to each other. Therefore, detecting the semantic content structure
of a web page could potentially improve the performance of web information retrieval.
Many web applications can utilize the semantic content structures of web pages.
For example, in web information accessing, to overcome the limitations of browsing
and keyword searching, some researchers have been trying to use database techniques
and build wrappers to structure the web data [1-3, 18]. In building wrappers, it is
necessary to divide the web documents into different information chunks [2, 17].
Previous work uses ad hoc methods to deal with different types of web pages. If we
can get a semantic content structure of the web page, wrappers can be more easily
built and information can be more easily extracted. Moreover, Link analysis has
received much attention in recent years. Traditionally different links in a page are
treated identically. The basic assumption of link analysis is that if there is a link
4
between two pages, there is some relationship between the two whole pages. But in
most cases, a link from page A to page B just indicates that there might be some
relationship between some certain part of page A and some certain part of page B.
Also, the existence of large quantify of noisy links will cause the topic drift problem
in HITS algorithm [7, 20]. Recent works on topic distillation [11, 12] and focused
crawling [13] strengthen our observation. However, these works are based on DOM
(Document Object Model)1 tree of the web page which has no sufficient power to
semantically segment the web page as we show in the experimental section.
Furthermore, efficient browsing of large web pages on small handheld devices also
necessitates semantically segmentation of web pages [19].
Much recent work [11, 13, 14, 17] try to extract the structure information from
HTML DOM tree. However, because of the flexibility of HTML syntax, a lot of web
pages do not obey the W3C html specifications, which might cause mistakes in DOM
tree structure. Moreover, DOM tree is initially introduced for presentation in the
browser rather than description of the semantic structure of the web page. For
example, even though two nodes in the DOM tree have the same parent, it might not
be the case that the two nodes are more semantically related to each other than to
other nodes. Two examples are shown in the experimental section. To provide a better
description of the semantic structure of the web page content, XML2 is introduced.
However, as we can observe, the majority of the web pages are written in HTML
rather than XML.
In the sense of human perception, it is always the case that people view a web
page as different semantic objects rather than a single object. Some research efforts
show that users always expect that certain functional part of a web page (e.g.
navigational links, advertisement bar) appears at certain position of that page [6].
Actually, when a web page is presented to the user, the spatial and visual cues can
help the user to unconsciously divide the web page into several semantic parts.
Therefore, it might be possible to automatically segment the web pages by using the
spatial and visual cues. Throughout this paper, we use block to denote the semantic
part of the web page.
1
http://www.w3c.org/DOM/
2
http://www.w3c.org/XML/
5
In this paper, we propose VIPS (VIsion-based Page Segmentation) algorithm to
extract the semantic structure for a web page. Such semantic structure is a hierarchical
structure in which each node will correspond to a block. Each node will be assigned a
value (Degree of Coherence) to indicate how coherent of the content in the block
based on visual perception. The VIPS algorithm makes full use of page layout feature:
it first extracts all the suitable blocks from the html DOM tree, then it tries to find the
separators between these extracted blocks. Here, separators denote the horizontal or
vertical lines in a web page that visually cross with no blocks. Finally, based on these
separators, the semantic structure for the web page is constructed. VIPS algorithm
employs a top-down approach, which is very effective.
The rest of the paper is organized as follows. Section 2 provides an overview on
the related works. In Section 3, we define the web page content structure based on
vision. The VIPS algorithm is introduced in Section 4. The experimental results are
shown in Section 5. Finally, we give concluding remarks in Section 6.
2 Related Works
The applications mentioned in the introduction indicate the need of techniques for
extracting the content structure of a web page. Many researchers have considered
using the tag information and dividing the page based on the type of the tags. Useful
tags include
(paragraph),
(table), (list), ~ (heading),
etc. Diao [15] treats segments of web pages in a learning based web query processing
system and deals with these major types of tags. Lin [21] only considers
tag and its offspring as a content block and uses entropy based approach to discover
informative ones. For adaptive content delivery, Kaasinen [19] and Buyukkokten [10]
split the web page by some easy tags such as ,
and for further
conversion or summarization. Wong [26] defines tag types for page segmentation and
gives a label to each part of the web page for classification.
Besides the tag tree, some other algorithms also make use of the content or link
information. Embley [17] and Buttler [9] use some heuristic rules to discover record
boundaries within a page, which assist data extraction from the web page. Chakrabarti
[11] [12] addresses the fine-grained topic distillation and dis-aggregates hubs into
regions by analyzing link structure as well as intra-page text distribution. Recently,
Bar-Yossef [5] proposes the template detection problem and presents an easy
6
algorithm only based on link information. Also, Rahman gives some comments on
page decomposition in [22] but does not show details on the technique.
A Function-based Object Model (FOM) of a web page is proposed by Chen [14]
for content understanding and adaptation. Every undividable element in the tag tree is
called a basic object and can be grouped into a composite object. A function type can
be defined to each object and helps to build a hierarchical structure for the page.
However, the grouping rules and the functions are hard to define accurately, and thus
make the whole tree-constructing process very inflexible.
All of the above methods fail to take into account visual structure of the web page.
In this paper, we propose a new algorithm, call VIPS (VIsion-based Page
Segmentation), to semantically segment the web page. In the next section, we first
describe our defined vision-based content structure for web pages.
3 Vision-based Content Structure for Web Pages
Similar to [14], we define the basic object as the leaf node in the DOM tree that can
not be decomposed any more. In this paper, we propose the vision-based content
structure, where every node, called a block, is a basic object or a set of basic objects.
It is important to note that, the nodes in the vision-based content structure do not
necessarily correspond to the nodes in the DOM tree.
Similar to the description of document representation in [25], the basic model of
vision-based content structure for web pages is described as follows.
A web page � is represented as a triple ( ), ,O δΩ = Φ . { }1 2, ,..., NO = Ω Ω Ω is a
finite set of blocks. All these blocks are not overlapped. Each block can be recursively
viewed as a sub-web-page associated with sub-structure induced from the whole page
structure. { }1 2, ,..., Tϕ ϕ ϕΦ = is a finite set of separators, including horizontal
separators and vertical separators. Every separator has a weight indicating its visibility,
and all the separators in the same � have the same weight. � is the relationship of
every two blocks in O and can be expressed as: { }O O NULLδ = × → Φ ∪ . For
example, suppose �i and �j are two objects in �, ( ),i j NULLδ Ω Ω ≠ indicates that
�i and �j are exactly separated by the separator ( ),i jδ Ω Ω or we can say the two
objects are adjacent to each other, otherwise there are other objects between the two
blocks �i and �j.
7
1ϕ
2ϕ
3ϕ
1
2ϕ 22ϕ
(a) (b)
(c)
( )1, 2, 3, 4VB VB VB VBΟ =
{ }1 2 3, ,ϕ ϕ ϕΦ =
( )
( )
( )
1
2
3
1, 2
2, 3
3, 4
VB VB
VB VB
VB VB
else NULL
ϕ
ϕδ
ϕ
� �� �
� �� �
� �� �
=
� �� �
� �� �� � � �
� � � �
( )2 2 _1, 2 _ 2, 2 _ 3VB VB VB VB=
{ }2 1 22 2,ϕ ϕΦ =
( )
( )
1
2
2 2
2
2 _1, 2 _ 2
2 _ 2, 2 _ 3
VB VB
VB VB
NULLelse
ϕ
δ ϕ
� �� �
� �� �
= � �� �
� �� � � �
� � � �
(d) (e)
Figure 1. The layout structure and vision-based content structure of an example page. (d)
and (e) show the corresponding specification of vision-based content structure.
Since each �i is a sub-web-page of the original page, it has similar content structure
as �. Recursively, we have ( ), ,t t t ts s s sO δΩ = Φ , { }1 2, ,..., stNts st st stO = Ω Ω Ω ,
8
{ }1 2, ,..., stTts st st stϕ ϕ ϕΦ = and { }t t t ts s s sO O NULLδ = × → Φ ∪ where tsΩ is the tth object in
the sub-web-page level s, Nst and Tst are the number of objects in tsO and number of
separators in tsΦ .
Figure 1 shows an example of vision-based content structure for a web page of
Yahoo! Auctions (http://list.auctions.shopping.yahoo.com/21600-
category.html?alocale=0us). It illustrates the layout structure and the vision-based
content structure of the page. In the first level, the original web page has four objects or
visual blocks VB1~VB4 and three separators 1ϕ ~ 3ϕ , as specified in Figure 1(d). Then
we can further construct sub content structure for each sub web page. For example, VB2
has three offspring objects and two separators. It can be further analyzed as shown in
Figure 1(e).
For each visual block, the Degree of Coherence (DoC) is defined to measure how
coherent it is. DoC has the following properties:
� The greater the DoC value, the more consistent the content within the block;
� In the hierarchy tree, the DoC of the child is not smaller than that of its parent.
In our algorithm, DoC values are integers ranging from 1 to 10, although
alternatively different ranges (e.g., real numbers, etc.) could be used.
We can pre-define the Permitted Degree of Coherence (PDoC?to achieve different
granularities of content structure for different applications. The smaller the PDoC is, the
coarser the content structure would be. For example in Figure 1(a), the visual block
VB2_1 may not be further partitioned with an appropriate PDoC. Different application
can use VIPS to segment web page to a different granularity with proper PDoC.
The vision-based content structure is more likely to provide a semantic partitioning
of the page. Every node of the structure is likely to convey certain semantics. For
instance, in Figure 1(a) we can see that VB2_1_1 denotes the category links of Yahoo!
Shopping auctions, and that VB2_2_1 and VB2_2_2 show details of the two different
comics.
4 The VIPS Algorithm
9
Figure 2. The vision-based page segmentation algorithm
In this section, we introduce our VIPS algorithm. Basically, the vision-based content
structure of a page is obtained by combining the DOM structure and the visual cues.
The segmentation process is illustrated in Figure 2. It has three steps: block extraction,
separator detection and content structure construction. These three steps as a whole are
regarded as a round. The algorithm is top-down. The web page is firstly segmented into
several big blocks and the hierarchical structure of this level is recorded. For each big
block, the same segmentation process is carried out recursively until we get sufficiently
small blocks whose DoC values are greater than pre-defined PDoC.
(a) (b)
Figure 3. (a) Page Layout (b) DOM tree of the page
For each round, the DOM tree with its visual information corresponded to the
current block (page for the first round) is obtained from a web browser, as we show in
Figure 3. Then, from the root node(s) of the DOM tree (e.g. DOM node 1 in Figure 3b),
the block extraction process is started to extract blocks from the DOM tree based on
visual cues. Every DOM node (node 1, 2, 3, 4, 5, 6, 7 as shown in figure 3b) is checked
10
to judge whether it forms a single block or not. If not (node 1, 3, 4 in figure 3b), its
children will be processed in the same way. We will assign a DoC value to each
extracted block (node 2, 5, 6, 7 in figure 3b) based on the block’s visual property. When
all blocks of the current round in current page or sub-page are extracted, they are put
into a pool. Separators among these blocks are identified and the weight of a separator
is set based on properties of its neighboring blocks. The layout hierarchy was
constructed based on these separators. After constructing the layout hierarchy of the
current round, each leaf node of the content structure is checked to see whether or not it
meets the granularity requirement. If not, this leaf node will be treated as a sub-page and
will be further segmented similarly. For example, if the Block C in figure 3 does not
meet the requirement, we treat this block as a sub-page and it will be further segmented
into two parts, C1 and C2, as shown in Figure 4a, 4b.
(a) (b)
Figure 4. (a) Layout of the block C (b) DOM tree of block C
Figure 5. Vision-based content structure
After all blocks are processed, the final vision-based content structure for the web page
is outputted. In the above example, we finally obtain a vision-based content structure
11
tree as shown in Figure 5. In the following three subsections, we will describe the block
extraction, separator detection and content structure construction, respectively.
4.1 Visual Block Extraction
In this step, we aim at finding all appropriate visual blocks contained in the current sub-
page. In general, every node in the DOM tree can represent a visual block. However,
some “huge” nodes such as and are used only for organization purpose
and are not appropriate to represent a single visual block. In these cases, the current
node should be further divided and replaced by its children. Due to the flexibility of
HTML grammar, many web pages do not fully obey the W3C HTML specification, so
the DOM tree can not always reflect the true relationship of the different DOM node.
For each extracted node that represents a visual block, its DoC value is set
according to its intra visual difference. This process is iterated until all appropriate
nodes are found to represent the visual blocks in the current sub-page.
Figure 6. The visual block extraction algorithm
We judge if a DOM node can be divided based on following considerations:
� The properties of the DOM node itself. For example, the HTML tag of this
node, the background color of this node, the size and shape of this block
corresponding to this DOM node.
� The properties of the children of the DOM node. For example, the HTML tags
of children nodes, background color of children and size of the children. The
number of different kinds of children is also a consideration.
Algorithm DivideDomtree(pNode, nLevel)
{
IF (Dividable(pNode, nLevel) == TRUE){
FOR EACH child OF pNode {
DivideDomtree(child, nLevel);
}
} ELSE {
Put the sub-tree (pNode) into the
pool as a block;
}
}
12
Based on WWW html specification 4.011, we classify the DOM node into two
categories, inline node and line-break node:
� Inline node: the DOM node with inline text HTML tags, which affect the
appearance of text and can be applied to a string of characters without
introducing line break. Such tags include , , , , ,
, , etc.
� Line-break Node: the node with tag other than inline text tags.
Based on the appearance of the node on the browser and the children properties of
the node, we give some definitions:
� Valid node: a node that can be seen through the browser. The node’s width and
height are not equal to zero. (For example the second and fourth child of
is not a valid node in Figure 7)
� Text node: the DOM node corresponding to free text, which does not have an
html tag.
� Virtual text node (recursive definition):
� Inline node with only text node children is a virtual text node.
� Inline node with only text node and virtual text node children is a virtual
text node.
The visual block extraction algorithm DivideDomtree is illustrated in Figure 6.
Some important cues which are used to produce heuristic rules in the algorithm are:
� Tag cue:
1. Tags such as
are often used to separate different topics from visual
perspective. Therefore we prefer to divide a DOM node if it contains these
tags.
2. If an inline node has a child which is line-break node, we divide this inline
node.
� Color cue: We prefer to divide a DOM node if its background color is different
from one of its children’s. At the same time, the child node with different
background color will not be divided in this round.
1
http://www.w3.org/TR/html4/
13
� Text cue: If most of the children of a DOM node are text nodes or virtual text
node, we prefer not to divide it.
� Size cue: We predefine a relative size threshold (the node size compared with
the size of the whole page or sub-page) for different tags (the threshold varies
with the DOM nodes having different HTML tags). If the relative size of the
node is small than the th
本文档为【页面拆分算法】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。