自然语言处理软件MMAXv1实用手册
ar
X
iv
:c
s/9
90
70
03
v1
[
cs
.C
L]
5
Ju
l 1
99
9
Annotation Graphs as a Framework for
Multidimensional Linguistic Data Analysis
Steven Bird and Mark Liberman
Linguistic Data Consortium, University of Pennsylvania
3615 Market St, Philadelphia, PA ...
ar
X
iv
:c
s/9
90
70
03
v1
[
cs
.C
L]
5
Ju
l 1
99
9
Annotation Graphs as a Framework for
Multidimensional Linguistic Data Analysis
Steven Bird and Mark Liberman
Linguistic Data Consortium, University of Pennsylvania
3615 Market St, Philadelphia, PA 19104-2608, USA
{sb,myl}@ldc.upenn.edu
Abstract
In recent work we have presented a formal
framework for linguistic annotation based on
labeled acyclic digraphs. These ‘annotation graphs’
offer a simple yet powerful method for representing
complex annotation structures incorporating
hierarchy and overlap. Here, we motivate and
illustrate our approach using discourse-level
annotations of text and speech data drawn from
the CALLHOME, COCONUT, MUC-7, DAMSL
and TRAINS annotation schemes. With the help
of domain specialists, we have constructed a hybrid
multi-level annotation for a fragment of the Boston
University Radio Speech Corpus which includes
the following levels: segment, word, breath, ToBI,
Tilt, Treebank, coreference and named entity. We
show how annotation graphs can represent hybrid
multi-level structures which derive from a diverse
set of file formats. We also show how the approach
facilitates substantive comparison of multiple
annotations of a single signal based on different
theoretical models. The discussion shows how
annotation graphs open the door to wide-ranging
integration of tools, formats and corpora.
1 Annotation Graphs
When we examine the kinds of speech transcription
and annotation found in many existing ‘communi-
ties of practice’, we see commonality of abstract
form along with diversity of concrete format. Our
survey of annotation practice (Bird and Liberman,
1999) attests to this commonality amidst diversity.
(See [www.ldc.upenn.edu/annotation] for pointers to
online material.) We observed that all annotations
of recorded linguistic signals require one unavoidable
basic action: to associate a label, or an ordered
sequence of labels, with a stretch of time in the
recording(s). Such annotations also typically distin-
guish labels of different types, such as spoken words
vs. non-speech noises. Different types of annota-
tion often span different-sized stretches of recorded
time, without necessarily forming a strict hierarchy:
thus a conversation contains (perhaps overlapping)
conversational turns, turns contain (perhaps inter-
rupted) words, and words contain (perhaps shared)
phonetic segments. Some types of annotation are
systematically incommensurable with others: thus
disfluency structures (Taylor, 1995) and focus struc-
tures (Jackendoff, 1972) often cut across conversa-
tional turns and syntactic constituents.
A minimal formalization of this basic set of prac-
tices is a directed graph with fielded records on the
arcs and optional time references on the nodes. We
have argued that this minimal formalization in fact
has sufficient expressive capacity to encode, in a
reasonably intuitive way, all of the kinds of linguis-
tic annotations in use today. We have also argued
that this minimal formalization has good properties
with respect to creation, maintenance and searching
of annotations. We believe that these advantages
are especially strong in the case of discourse anno-
tations, because of the prevalence of cross-cutting
structures and the need to compare multiple anno-
tations representing different purposes and perspec-
tives.
Translation into annotation graphs does not mag-
ically create compatibility among systems whose
semantics are different. For instance, there are many
different approaches to transcribing filled pauses in
English – each will translate easily into an annota-
tion graph framework, but their semantic incompati-
bility is not thereby erased. However, it does enable
us to focus on the substantive differences without
having to be concerned with diverse formats, and
without being forced to recode annotations in an
agreed, common format. Therefore, we focus on the
structure of annotations, independently of domain-
specific concerns about permissible tags, attributes,
and values.
As reference corpora are published for a wider
range of spoken language genres, annotation
work is increasingly reusing the same primary
data. For instance, the Switchboard corpus
[www.ldc.upenn.edu/Catalog/LDC93S7.html] has
been marked up for disfluency (Taylor, 1995).
See [www.cis.upenn.edu/~treebank/switchboard-
sample.html] for an example, which also includes a
separate part-of-speech annotation and a Treebank-
style annotation. Hirschman and Chinchor (1997)
give an example of MUC-7 coreference annotation
applied to an existing TRAINS dialog annotation
marking speaker turns and overlap. We shall
encounter a number of such cases here.
The Formalism
As we said above, we take an annotation label to
be a fielded record. A minimal but sufficient set of
fields would be:
type this represents a level of an annotation, such
as the segment, word and discourse levels;
label this is a contentful property, such as a par-
ticular word, a speaker’s name, or a discourse
function;
class this is an optional field which permits the
arcs of an annotation graph to be co-indexed
as members of an equivalence class.1
One might add further fields for holding comments,
annotator id, update history, and so on.
Let T be a set of types, L be a set of labels, and
C be a set of classes. Let R = {〈t, l, c〉 | t ∈ T, l ∈
L, c ∈ C}, the set of records over T, L,C. Let N be
a set of nodes. Annotation graphs (AGs) are now
defined as follows:
Definition 1 An annotation graph G over R,N
is a set of triples having the form 〈n1, r, n2〉, r ∈ R,
n1, n2 ∈ N , which satisfies the following conditions:
1. 〈N, {〈n1, n2〉 | 〈n1, r, n2〉 ∈ A}〉 is a labelled
acyclic digraph.
2. τ : N ⇀ ℜ is an order-preserving map assigning
times to (some of) the nodes.
For detailed discussion of these structures, see
(Bird and Liberman, 1999). Here we present a frag-
ment (taken from Figure 8 below) to illustrate the
definition. For convenience the components of the
fielded records which decorate the arcs are separated
using the slash symbol. The example contains two
word arcs, and a discourse tag encoding ‘influence
on speaker’. No class fields are used. Not all nodes
have a time reference.
1
52.46
2
W/oh/
3
53.14D/IOS:Commit/
W/okay/
1 We have avoided using explicit pointers since we prefer
not to associate formal identifiers to the arcs. Equivalence
classes will be exemplified later.
The minimal annotation graph for this structure is
as follows:
T = {W, D}
L = {oh, okay, IOS:Commit}
C = ∅
N = {1, 2, 3}
τ = {〈1, 52.46〉 , 〈3, 53.14〉}
A =
〈1, W/oh/, 2〉 ,
〈2, W/okay/, 3〉 ,
〈1, D/IOS:Commit/, 3〉
XML is a natural ‘surface representation’ for
annotation graphs and could provide the primary
exchange format. A particularly simple XML
encoding of the above structure is shown below;
one might choose to use a richer XML encoding in
practice.
2 AGs and Discourse Markup
2.1 LDC Telephone Speech Transcripts
The LDC-published CALLHOME corpora include
digital audio, transcripts and lexicons for telephone
conversations in several languages, and are
designed to support research on speech recognition
[www.ldc.upenn.edu/Catalog/LDC96S46.html]. The
transcripts exhibit abundant overlap between
speaker turns. What follows is a typical fragment
of an annotation. Each stretch of speech consists of
a begin time, an end time, a speaker designation,
and the transcription for the cited stretch of time.
We have augmented the annotation with + and *
to indicate partial and total overlap (respectively)
with the previous speaker turn.
15
16
W/and
31
994.19
32
994.46
speaker/B
W/yeah
17
994.65W/%um
33
996.51
19
20
996.59W/%um
35
997.61
speaker/B
34
W/whatever’s
22
23
W/.11991.75
12
speaker/A
13
W/he 14
W/said W/, 18995.21 W/he
speaker/A
21
997.40 W/right
25
1002.55
speaker/A
24
W/so
W/helpful
Figure 1: Graph Structure for LDC Telephone Speech Example
speaker/
speaker/
W/
W/
.61.19 .46
.65 .21
.51
.40.59
,
A
he %umandsaid
B
yeah
A
so.right
B
helpfulwhatever’s
A
995994 996 997
%um righthe so.
Figure 2: Visualization for LDC Telephone Speech Example
962.68 970.21 A: He was changing projects every couple
of weeks and he said he couldn’t keep on top of it.
He couldn’t learn the whole new area
* 968.71 969.00 B: %mm.
970.35 971.94 A: that fast each time.
* 971.23 971.42 B: %mm.
972.46 979.47 A: %um, and he says he went in and had some
tests, and he was diagnosed as having attention deficit
disorder. Which
980.18 989.56 A: you know, given how he’s how far he’s
gotten, you know, he got his degree at &Tufts and all,
I found that surprising that for the first time as an
adult they’re diagnosing this. %um
+ 989.42 991.86 B: %mm. I wonder about it. But anyway.
+ 991.75 994.65 A: yeah, but that’s what he said. And %um
* 994.19 994.46 B: yeah.
995.21 996.59 A: He %um
+ 996.51 997.61 B: Whatever’s helpful.
+ 997.40 1002.55 A: Right. So he found this new job as a
financial consultant and seems to be happy with that.
1003.14 1003.45 B: Good.
Long turns (e.g. the period from 972.46 to 989.56
seconds) were broken up into shorter stretches for
the convenience of the annotators and to provide
additional time references. A section of this anno-
tation which includes an example of total overlap is
represented in annotation graph form in Figure 1,
with the accompanying visualization shown in Fig-
ure 2. (We have no commitment to this particular
visualization; the graph structures can be visualized
in many ways and the perspicuity of a visualization
format will be somewhat domain-specific.)
The turns are attributed to speakers using the
speaker/ type. All of the words, punctuation and
disfluencies are given the W/ type, though we could
easily opt for a more refined version in which these
are assigned different types. The class field is not
used here. Observe that each speaker turn is a dis-
joint piece of graph structure, and that hierarchical
organisation uses the ‘chart construction’ (Gazdar
and Mellish, 1989, 179ff). Thus, we make a logi-
cal distinction between the situation where the end-
points of two pieces of annotation necessarily coin-
cide (by sharing the same node) from the situation
where endpoints happen to coincide (by having dis-
tinct nodes which contain the same time reference).
The former possibility is required for hierarchical
structure, and the latter possibility is required for
overlapping speaker turns where words spoken by
different speakers may happen to sharing the same
boundary.
2.2 Dialogue Annotation in COCONUT
The COCONUT corpus is a set of dialogues in which
the two conversants collaborate on a task of deciding
what furniture to buy for a house (Di Eugenio et al.,
1998). The coding scheme augments the DAMSL
scheme (Allen and Core, 1997) by having some new
top-level tags and by further specifying some exist-
ing tags. An example is given in Figure 3.
The example shows five utterance pieces, identi-
fied (a-e), four produced by speaker S1 and one pro-
duced by speaker S2. The discourse annotations can
be glossed as follows: Accept - the speaker is agreeing
to a possible action or a claim; Commit - the speaker
potentially commits to intend to perform a future
specific action, and the commitment is not contin-
gent upon the assent of the addressee; Offer - the
speaker potentially commits to intend to perform a
future specific action, and the commitment is contin-
gent upon the assent of the addressee; Open-Option
- the speaker provides an option for the addressee’s
future action; Action-Directive - the utterance is
designed to cause the addressee to undertake a spe-
cific action.
In utterance (e) of Figure 3, speaker S1 simul-
taneously accepts to the meta-action in (d) of not
Accept, Commit S1: (a) Let’s take the blue rug for 250,
(b) my rug wouldn’t match
Open-Option (c) which is yellow for 150.
Action-Directive S2: (d) we don’t have to match...
Accept(d), Offer, Commit S1: (e) well then let’s use mine for 150
Figure 3: Dialogue with COCONUT Coding Scheme
D/
well then let’s use mine for 150 /ewe don’t have to match ... /dLet’s take the blue rug for 250 , /a which is yellow for 150 . /c
Accept /d
my rug wouldn’t match /b
Commit
Action-DirectiveOpen-Option Offer
Commit
Accept
Sp/ S1 S2 S1
Utt/
Figure 4: Visualization of Annotation Graph for COCONUT Example
having matching colors, and to the regular action of
using S1’s yellow rug. The latter acceptance is not
explicitly represented in the original notation, so we
shall only consider the former.
In representing this dialogue structure using anno-
tation graphs, we will be concerned to achieve the
following: (i) to treat multiple annotations of the
same utterance fragment as an unordered set, rather
than a list, to simplify indexing and query; (ii) to
explicitly link speaker S1 to utterances (a-c); (iii)
to formalize the relationship between Accept(d) and
utterance (d); and (iv) formalize the rest of the
annotation structure which is implicit in the textual
representation.
We adopt the types Sp (speaker), Utt (utterance)
and D (discourse). A more refined type system
could include other levels of representation, it could
distinguish forward versus backward communicative
function, and so on. For the names we employ:
speaker identifiers S1, S2; discourse tags Offer,
Commit, Accept, Open-Option, Action-Directive; and
orthographic strings representing the utterances.
For the classes (the third, optional field) we employ
the utterance identifiers a, b, c, d, e.
An annotation graph representation of the
COCONUT example can now be represented as in
Figure 4. The arcs are structured into three layers,
one for each type, where the types are written on
the left. If the optional class field is specified, this
information follows the name field, separated by a
slash. The Accept/d arc refers to the S2 utterance
simply by virtue of the fact that both share the
same class field.
Observe that the Commit and Accept tags for (a)
are unordered, unlike the original annotation. and
that speaker S1 is associated with all utterances (a-
c), rather than being explicitly linked to (a) and
implicitly linked to (b) and (c) as in Figure 3.
To make the referent of the Accept tag clear, we
make use of the class field. Recall that the third
component of the fielded records, the class field, per-
mits arcs to refer to each other. Both the referring
and the referenced arcs are assigned to equivalence
class d.
2.3 Coreference Annotation in MUC-7
The MUC-7 Message Understanding Conference
specified tasks for information extraction, named
entity and coreference. Coreferring expressions
are to be linked using SGML markup with
ID and REF tags (Hirschman and Chinchor,
1997). Figure 5 is a sample of text from
the Boston University Radio Speech Corpus
[www.ldc.upenn.edu/Catalog/LDC96S36.html],
marked up with coreference tags. (We are grateful
to Lynette Hirschman for providing us with this
annotation.)
Noun phrases participating in coreference are
wrapped with ... tags, which can
bear the attributes ID, REF, TYPE and MIN. Each such
phrase is given a unique identifier, which may be
referenced by a REF attribute somewhere else. Our
example contains the following references: 3 → 2,
4 → 2, 6 → 5, 7 → 5, 8 → 5, 12 → 11, 15 → 13.
The TYPE attribute encodes the relationship between
the anaphor and the antecedent. Currently, only
the identity relation is marked, and so coreferences
form an equivalence class. Accordingly, our example
contains the following equivalence classes: {2, 3, 4},
{5, 6, 7, 8}, {11, 12}, {13, 15}.
In our AG representation we choose the first num-
ber from each of these sets as the identifier for the
equivalence class. MUC-7 also contains a specifica-
tion for named entity annotation. Figure 7 gives an
example, to be discussed in §3.2. This uses empty
This woman
receives three hundred dollars a
month under
General Relief
, plus
four hundred dollars a month in
A.F.D.C. benefits
for
her
son
, who is
a U.S. citizen
.
She
’s among
an estimated five hundred illegal
aliens on
General Relief
out of
the state
’s total illegal immigrant
population of
one hundred thousand
.
General Relief
is for needy families and unemployable
adults who don’t qualify for other public
assistance. Welfare Department spokeswoman
Michael Reganburg says
the state
will save about one million dollars a year if
illegal aliens
are denied
General Relief
.
Figure 5: Coreference Annotation for BU Example
2
0.32
3
0.62
woman
13
7.06
14
7.19
her
CR//2 15
7.62
CR/son/9
4
2.74
6
3.80
CR//5
5
3.28
General
son
7
4.31
plus
16
7.83
who
8
4.52
9
4.80
hundred
17
7.97
is
12
6.87
for
19
8.40
20
8.96
citizen
1
0.0
This
CR/woman/2
receives...in Relief four
CR/four hundred dollars/16
10
5.61
dollars...in
CR/benefits/16
11
6.34
A.F.D.C. benefits
18
8.02
a
CR/citizen/9
U.S.
Figure 6: Annotation Graph for Coreference Example
tags to get around the problem of cross-cutting hier-
archies. This problem does not arise in the annota-
tion graph formalism; see (Bird and Liberman, 1999,
2.7).
3 Hybrid Annotations
There are many cases where a given corpus is anno-
tated at several levels, from discourse to phonetics.
While a uniform structure is sometimes imposed,
as with Partitur (Schiel et al., 1998), established
practice and existing tools may give rise to corpora
transcribed using different formats for different lev-
els. Two examples of hybrid annotation will be dis-
cussed here: a TRAINS+DAMSL annotation, and
an eight-level annotation of the Boston University
Radio Speech Corpus.
3.1 DAMSL annotation of TRAINS
The TRAINS corpus (Heeman and Allen, 1993) is a
collection of about 100 dialogues containing a total
of 5,900 speaker turns [www.ldc.upenn.edu/Catalog
/LDC95S25.html]. Part of a transcript is shown
below, where s and u designate the two speakers,
denotes silent periods, and + denotes
boundaries of speaker overlaps.
utt1 : s: hello can I help you
utt2 : u: yes um I have a problem here
utt3 : I need to transport one tanker of orange juice
to Avon and a boxcar of bananas to
Corning by three p.m.
utt4 : and I think it’s midnight now
utt5 : s: uh right it’s midnight
utt6 : u: okay so we need to
um get a tanker of OJ to Avon is the first
thing we need to do
utt7 : + so +
utt8 : s: + okay +
utt9 : so we have to make orange juice first
utt10 : u: mm-hm okay so we’re gonna pick up
an engine two from Elmira
utt11 : go to Corning pick up the tanker
utt12 : s: mm-hm
utt13 : u: go back to Elmira to get pick up
the orange juice
utt14 : s: alright um well we also need to
make the orange juice so we need to get
+ oranges to Elmira +
utt15 : u: + oh we need to pick up + oranges oh + okay +
utt16 : s: + yeah +
utt17 : u: alright so engine number two is going to
pick up a boxcar
Ac