为了正常的体验网站,请在浏览器设置里面开启Javascript功能!

自然语言处理软件MMAXv1实用手册

2012-04-15 10页 pdf 178KB 12阅读

用户头像

is_771605

暂无简介

举报
自然语言处理软件MMAXv1实用手册 ar X iv :c s/9 90 70 03 v1 [ cs .C L] 5 Ju l 1 99 9 Annotation Graphs as a Framework for Multidimensional Linguistic Data Analysis Steven Bird and Mark Liberman Linguistic Data Consortium, University of Pennsylvania 3615 Market St, Philadelphia, PA ...
自然语言处理软件MMAXv1实用手册
ar X iv :c s/9 90 70 03 v1 [ cs .C L] 5 Ju l 1 99 9 Annotation Graphs as a Framework for Multidimensional Linguistic Data Analysis Steven Bird and Mark Liberman Linguistic Data Consortium, University of Pennsylvania 3615 Market St, Philadelphia, PA 19104-2608, USA {sb,myl}@ldc.upenn.edu Abstract In recent work we have presented a formal framework for linguistic annotation based on labeled acyclic digraphs. These ‘annotation graphs’ offer a simple yet powerful method for representing complex annotation structures incorporating hierarchy and overlap. Here, we motivate and illustrate our approach using discourse-level annotations of text and speech data drawn from the CALLHOME, COCONUT, MUC-7, DAMSL and TRAINS annotation schemes. With the help of domain specialists, we have constructed a hybrid multi-level annotation for a fragment of the Boston University Radio Speech Corpus which includes the following levels: segment, word, breath, ToBI, Tilt, Treebank, coreference and named entity. We show how annotation graphs can represent hybrid multi-level structures which derive from a diverse set of file formats. We also show how the approach facilitates substantive comparison of multiple annotations of a single signal based on different theoretical models. The discussion shows how annotation graphs open the door to wide-ranging integration of tools, formats and corpora. 1 Annotation Graphs When we examine the kinds of speech transcription and annotation found in many existing ‘communi- ties of practice’, we see commonality of abstract form along with diversity of concrete format. Our survey of annotation practice (Bird and Liberman, 1999) attests to this commonality amidst diversity. (See [www.ldc.upenn.edu/annotation] for pointers to online material.) We observed that all annotations of recorded linguistic signals require one unavoidable basic action: to associate a label, or an ordered sequence of labels, with a stretch of time in the recording(s). Such annotations also typically distin- guish labels of different types, such as spoken words vs. non-speech noises. Different types of annota- tion often span different-sized stretches of recorded time, without necessarily forming a strict hierarchy: thus a conversation contains (perhaps overlapping) conversational turns, turns contain (perhaps inter- rupted) words, and words contain (perhaps shared) phonetic segments. Some types of annotation are systematically incommensurable with others: thus disfluency structures (Taylor, 1995) and focus struc- tures (Jackendoff, 1972) often cut across conversa- tional turns and syntactic constituents. A minimal formalization of this basic set of prac- tices is a directed graph with fielded records on the arcs and optional time references on the nodes. We have argued that this minimal formalization in fact has sufficient expressive capacity to encode, in a reasonably intuitive way, all of the kinds of linguis- tic annotations in use today. We have also argued that this minimal formalization has good properties with respect to creation, maintenance and searching of annotations. We believe that these advantages are especially strong in the case of discourse anno- tations, because of the prevalence of cross-cutting structures and the need to compare multiple anno- tations representing different purposes and perspec- tives. Translation into annotation graphs does not mag- ically create compatibility among systems whose semantics are different. For instance, there are many different approaches to transcribing filled pauses in English – each will translate easily into an annota- tion graph framework, but their semantic incompati- bility is not thereby erased. However, it does enable us to focus on the substantive differences without having to be concerned with diverse formats, and without being forced to recode annotations in an agreed, common format. Therefore, we focus on the structure of annotations, independently of domain- specific concerns about permissible tags, attributes, and values. As reference corpora are published for a wider range of spoken language genres, annotation work is increasingly reusing the same primary data. For instance, the Switchboard corpus [www.ldc.upenn.edu/Catalog/LDC93S7.html] has been marked up for disfluency (Taylor, 1995). See [www.cis.upenn.edu/~treebank/switchboard- sample.html] for an example, which also includes a separate part-of-speech annotation and a Treebank- style annotation. Hirschman and Chinchor (1997) give an example of MUC-7 coreference annotation applied to an existing TRAINS dialog annotation marking speaker turns and overlap. We shall encounter a number of such cases here. The Formalism As we said above, we take an annotation label to be a fielded record. A minimal but sufficient set of fields would be: type this represents a level of an annotation, such as the segment, word and discourse levels; label this is a contentful property, such as a par- ticular word, a speaker’s name, or a discourse function; class this is an optional field which permits the arcs of an annotation graph to be co-indexed as members of an equivalence class.1 One might add further fields for holding comments, annotator id, update history, and so on. Let T be a set of types, L be a set of labels, and C be a set of classes. Let R = {〈t, l, c〉 | t ∈ T, l ∈ L, c ∈ C}, the set of records over T, L,C. Let N be a set of nodes. Annotation graphs (AGs) are now defined as follows: Definition 1 An annotation graph G over R,N is a set of triples having the form 〈n1, r, n2〉, r ∈ R, n1, n2 ∈ N , which satisfies the following conditions: 1. 〈N, {〈n1, n2〉 | 〈n1, r, n2〉 ∈ A}〉 is a labelled acyclic digraph. 2. τ : N ⇀ ℜ is an order-preserving map assigning times to (some of) the nodes. For detailed discussion of these structures, see (Bird and Liberman, 1999). Here we present a frag- ment (taken from Figure 8 below) to illustrate the definition. For convenience the components of the fielded records which decorate the arcs are separated using the slash symbol. The example contains two word arcs, and a discourse tag encoding ‘influence on speaker’. No class fields are used. Not all nodes have a time reference. 1 52.46 2 W/oh/ 3 53.14D/IOS:Commit/ W/okay/ 1 We have avoided using explicit pointers since we prefer not to associate formal identifiers to the arcs. Equivalence classes will be exemplified later. The minimal annotation graph for this structure is as follows: T = {W, D} L = {oh, okay, IOS:Commit} C = ∅ N = {1, 2, 3} τ = {〈1, 52.46〉 , 〈3, 53.14〉} A =   〈1, W/oh/, 2〉 , 〈2, W/okay/, 3〉 , 〈1, D/IOS:Commit/, 3〉   XML is a natural ‘surface representation’ for annotation graphs and could provide the primary exchange format. A particularly simple XML encoding of the above structure is shown below; one might choose to use a richer XML encoding in practice. 2 AGs and Discourse Markup 2.1 LDC Telephone Speech Transcripts The LDC-published CALLHOME corpora include digital audio, transcripts and lexicons for telephone conversations in several languages, and are designed to support research on speech recognition [www.ldc.upenn.edu/Catalog/LDC96S46.html]. The transcripts exhibit abundant overlap between speaker turns. What follows is a typical fragment of an annotation. Each stretch of speech consists of a begin time, an end time, a speaker designation, and the transcription for the cited stretch of time. We have augmented the annotation with + and * to indicate partial and total overlap (respectively) with the previous speaker turn. 15 16 W/and 31 994.19 32 994.46 speaker/B W/yeah 17 994.65W/%um 33 996.51 19 20 996.59W/%um 35 997.61 speaker/B 34 W/whatever’s 22 23 W/.11991.75 12 speaker/A 13 W/he 14 W/said W/, 18995.21 W/he speaker/A 21 997.40 W/right 25 1002.55 speaker/A 24 W/so W/helpful Figure 1: Graph Structure for LDC Telephone Speech Example speaker/ speaker/ W/ W/ .61.19 .46 .65 .21 .51 .40.59 , A he %umandsaid B yeah A so.right B helpfulwhatever’s A 995994 996 997 %um righthe so. Figure 2: Visualization for LDC Telephone Speech Example 962.68 970.21 A: He was changing projects every couple of weeks and he said he couldn’t keep on top of it. He couldn’t learn the whole new area * 968.71 969.00 B: %mm. 970.35 971.94 A: that fast each time. * 971.23 971.42 B: %mm. 972.46 979.47 A: %um, and he says he went in and had some tests, and he was diagnosed as having attention deficit disorder. Which 980.18 989.56 A: you know, given how he’s how far he’s gotten, you know, he got his degree at &Tufts and all, I found that surprising that for the first time as an adult they’re diagnosing this. %um + 989.42 991.86 B: %mm. I wonder about it. But anyway. + 991.75 994.65 A: yeah, but that’s what he said. And %um * 994.19 994.46 B: yeah. 995.21 996.59 A: He %um + 996.51 997.61 B: Whatever’s helpful. + 997.40 1002.55 A: Right. So he found this new job as a financial consultant and seems to be happy with that. 1003.14 1003.45 B: Good. Long turns (e.g. the period from 972.46 to 989.56 seconds) were broken up into shorter stretches for the convenience of the annotators and to provide additional time references. A section of this anno- tation which includes an example of total overlap is represented in annotation graph form in Figure 1, with the accompanying visualization shown in Fig- ure 2. (We have no commitment to this particular visualization; the graph structures can be visualized in many ways and the perspicuity of a visualization format will be somewhat domain-specific.) The turns are attributed to speakers using the speaker/ type. All of the words, punctuation and disfluencies are given the W/ type, though we could easily opt for a more refined version in which these are assigned different types. The class field is not used here. Observe that each speaker turn is a dis- joint piece of graph structure, and that hierarchical organisation uses the ‘chart construction’ (Gazdar and Mellish, 1989, 179ff). Thus, we make a logi- cal distinction between the situation where the end- points of two pieces of annotation necessarily coin- cide (by sharing the same node) from the situation where endpoints happen to coincide (by having dis- tinct nodes which contain the same time reference). The former possibility is required for hierarchical structure, and the latter possibility is required for overlapping speaker turns where words spoken by different speakers may happen to sharing the same boundary. 2.2 Dialogue Annotation in COCONUT The COCONUT corpus is a set of dialogues in which the two conversants collaborate on a task of deciding what furniture to buy for a house (Di Eugenio et al., 1998). The coding scheme augments the DAMSL scheme (Allen and Core, 1997) by having some new top-level tags and by further specifying some exist- ing tags. An example is given in Figure 3. The example shows five utterance pieces, identi- fied (a-e), four produced by speaker S1 and one pro- duced by speaker S2. The discourse annotations can be glossed as follows: Accept - the speaker is agreeing to a possible action or a claim; Commit - the speaker potentially commits to intend to perform a future specific action, and the commitment is not contin- gent upon the assent of the addressee; Offer - the speaker potentially commits to intend to perform a future specific action, and the commitment is contin- gent upon the assent of the addressee; Open-Option - the speaker provides an option for the addressee’s future action; Action-Directive - the utterance is designed to cause the addressee to undertake a spe- cific action. In utterance (e) of Figure 3, speaker S1 simul- taneously accepts to the meta-action in (d) of not Accept, Commit S1: (a) Let’s take the blue rug for 250, (b) my rug wouldn’t match Open-Option (c) which is yellow for 150. Action-Directive S2: (d) we don’t have to match... Accept(d), Offer, Commit S1: (e) well then let’s use mine for 150 Figure 3: Dialogue with COCONUT Coding Scheme D/ well then let’s use mine for 150 /ewe don’t have to match ... /dLet’s take the blue rug for 250 , /a which is yellow for 150 . /c Accept /d my rug wouldn’t match /b Commit Action-DirectiveOpen-Option Offer Commit Accept Sp/ S1 S2 S1 Utt/ Figure 4: Visualization of Annotation Graph for COCONUT Example having matching colors, and to the regular action of using S1’s yellow rug. The latter acceptance is not explicitly represented in the original notation, so we shall only consider the former. In representing this dialogue structure using anno- tation graphs, we will be concerned to achieve the following: (i) to treat multiple annotations of the same utterance fragment as an unordered set, rather than a list, to simplify indexing and query; (ii) to explicitly link speaker S1 to utterances (a-c); (iii) to formalize the relationship between Accept(d) and utterance (d); and (iv) formalize the rest of the annotation structure which is implicit in the textual representation. We adopt the types Sp (speaker), Utt (utterance) and D (discourse). A more refined type system could include other levels of representation, it could distinguish forward versus backward communicative function, and so on. For the names we employ: speaker identifiers S1, S2; discourse tags Offer, Commit, Accept, Open-Option, Action-Directive; and orthographic strings representing the utterances. For the classes (the third, optional field) we employ the utterance identifiers a, b, c, d, e. An annotation graph representation of the COCONUT example can now be represented as in Figure 4. The arcs are structured into three layers, one for each type, where the types are written on the left. If the optional class field is specified, this information follows the name field, separated by a slash. The Accept/d arc refers to the S2 utterance simply by virtue of the fact that both share the same class field. Observe that the Commit and Accept tags for (a) are unordered, unlike the original annotation. and that speaker S1 is associated with all utterances (a- c), rather than being explicitly linked to (a) and implicitly linked to (b) and (c) as in Figure 3. To make the referent of the Accept tag clear, we make use of the class field. Recall that the third component of the fielded records, the class field, per- mits arcs to refer to each other. Both the referring and the referenced arcs are assigned to equivalence class d. 2.3 Coreference Annotation in MUC-7 The MUC-7 Message Understanding Conference specified tasks for information extraction, named entity and coreference. Coreferring expressions are to be linked using SGML markup with ID and REF tags (Hirschman and Chinchor, 1997). Figure 5 is a sample of text from the Boston University Radio Speech Corpus [www.ldc.upenn.edu/Catalog/LDC96S36.html], marked up with coreference tags. (We are grateful to Lynette Hirschman for providing us with this annotation.) Noun phrases participating in coreference are wrapped with ... tags, which can bear the attributes ID, REF, TYPE and MIN. Each such phrase is given a unique identifier, which may be referenced by a REF attribute somewhere else. Our example contains the following references: 3 → 2, 4 → 2, 6 → 5, 7 → 5, 8 → 5, 12 → 11, 15 → 13. The TYPE attribute encodes the relationship between the anaphor and the antecedent. Currently, only the identity relation is marked, and so coreferences form an equivalence class. Accordingly, our example contains the following equivalence classes: {2, 3, 4}, {5, 6, 7, 8}, {11, 12}, {13, 15}. In our AG representation we choose the first num- ber from each of these sets as the identifier for the equivalence class. MUC-7 also contains a specifica- tion for named entity annotation. Figure 7 gives an example, to be discussed in §3.2. This uses empty This woman receives three hundred dollars a month under General Relief , plus four hundred dollars a month in A.F.D.C. benefits for her son , who is a U.S. citizen . She ’s among an estimated five hundred illegal aliens on General Relief out of the state ’s total illegal immigrant population of one hundred thousand . General Relief is for needy families and unemployable adults who don’t qualify for other public assistance. Welfare Department spokeswoman Michael Reganburg says the state will save about one million dollars a year if illegal aliens are denied General Relief . Figure 5: Coreference Annotation for BU Example 2 0.32 3 0.62 woman 13 7.06 14 7.19 her CR//2 15 7.62 CR/son/9 4 2.74 6 3.80 CR//5 5 3.28 General son 7 4.31 plus 16 7.83 who 8 4.52 9 4.80 hundred 17 7.97 is 12 6.87 for 19 8.40 20 8.96 citizen 1 0.0 This CR/woman/2 receives...in Relief four CR/four hundred dollars/16 10 5.61 dollars...in CR/benefits/16 11 6.34 A.F.D.C. benefits 18 8.02 a CR/citizen/9 U.S. Figure 6: Annotation Graph for Coreference Example tags to get around the problem of cross-cutting hier- archies. This problem does not arise in the annota- tion graph formalism; see (Bird and Liberman, 1999, 2.7). 3 Hybrid Annotations There are many cases where a given corpus is anno- tated at several levels, from discourse to phonetics. While a uniform structure is sometimes imposed, as with Partitur (Schiel et al., 1998), established practice and existing tools may give rise to corpora transcribed using different formats for different lev- els. Two examples of hybrid annotation will be dis- cussed here: a TRAINS+DAMSL annotation, and an eight-level annotation of the Boston University Radio Speech Corpus. 3.1 DAMSL annotation of TRAINS The TRAINS corpus (Heeman and Allen, 1993) is a collection of about 100 dialogues containing a total of 5,900 speaker turns [www.ldc.upenn.edu/Catalog /LDC95S25.html]. Part of a transcript is shown below, where s and u designate the two speakers, denotes silent periods, and + denotes boundaries of speaker overlaps. utt1 : s: hello can I help you utt2 : u: yes um I have a problem here utt3 : I need to transport one tanker of orange juice to Avon and a boxcar of bananas to Corning by three p.m. utt4 : and I think it’s midnight now utt5 : s: uh right it’s midnight utt6 : u: okay so we need to um get a tanker of OJ to Avon is the first thing we need to do utt7 : + so + utt8 : s: + okay + utt9 : so we have to make orange juice first utt10 : u: mm-hm okay so we’re gonna pick up an engine two from Elmira utt11 : go to Corning pick up the tanker utt12 : s: mm-hm utt13 : u: go back to Elmira to get pick up the orange juice utt14 : s: alright um well we also need to make the orange juice so we need to get + oranges to Elmira + utt15 : u: + oh we need to pick up + oranges oh + okay + utt16 : s: + yeah + utt17 : u: alright so engine number two is going to pick up a boxcar Ac
/
本文档为【自然语言处理软件MMAXv1实用手册】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑, 图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。 本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。 网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。

历史搜索

    清空历史搜索