|
Due: September 28
Theme: Implementation of referential expression interpretation
Background
This assignment is designed to get you thinking about how to implement
some of the discourse processes we've read about and discussed in class
using a programming language of your choice. The point of this assignment
is to have you explore what demands are placed on work with discourse by
a computational environment. The particular discourse processing
you will implement has to do with the interpretation of referential expressions.
Please note that you are welcome to complete this assignment in pairs,
and we suggest that those of you who are not strong programmers choose
this option!
Procedure:
You will implement the basic technique for identifying the antecedents
of pronouns as described in section 14.2 - "A Simple Model of Anaphora
Based on History Lists" (Allen 1995). The primary structure is a
History
List that contains all previously evoked discourse entities, with the
most recent one listed first. For this exercise we will only be concerned
with Noun Phrase entities. We will further restrict the problem by
ignoring semantic forms and working without a knowledge base.
This means that when an NP is processed, you will either create for it
a new referent on the fly, giving it a name such as REF1 (as mentioned
on page 431, this will typically happen with indefinite NPs), or you will
find a link to a previous NP in the History List (definite NPs including
pronouns usually predict such a link). So in a sense, the the only
knowledge
you have is constructed as you go.
The algorithm should implement the recency constraint, which
states that the antecendent should be the most recently mentioned object
that satisfies all of a number of constraints. It is up to you how
many constraints you implement (see 14.2), but minimally the NPs should
be linked based on agreement featuers (number, person and gender).
Since we are not using a knowledge base, you will not be able to link on
semantic features. Thus, your program should be able to link "a client"
and "he" in the following sentences, but not "it" and "the flight" (since
without some semantic analysis, "it" could easilty refer to "Chicago" as
well as "the flight").
A client took the flight to Chicago yesterday.
He almost missed it.
Because of the restrictions we've imposed on the problem, we don't expect
your program to produce the correct result in every case. You should,
however, be prepared to explain why your program fails in some cases.
Input
You will be given a set of input utterances that have already been
parsed or you can use any set of utterance from the data you collected
that are interesting for this purpose. If you use your own data,
simply parse it by hand before you feed it to your system. A parsing
of the previous example might look something like the following in Allen's
representational scheme:
( (S SUBJ (NP1 DET a
HEAD client
AGR (3 SING UNSPEC))
MAIN-V take
TENSE past
OBJ (NP2 DET the
HEAD flight
AGR (3 SING NEUT)
MODS (PP PREP to
POBJ (NP3 NAME chicago)))
MODS (ADV yesterday))
(S SUBJ (NP4 PRO he
AGR (3 SING MASC))
MAIN-V miss
TENSE past
OBJ (NP5 PRO it
AGR (3 SING NEUT))))
To simplify the problem, you can assume that the input to your program
consists of a list of sentences where each sentence is a list of its Noun
Phrases. Each sentence may have one SUBJ NP, one OBJ NPs and
other NPs embedded in modifiers such as prepositional phrases. The
input for the example above would have the following form:
( (CLAUSE :SUBJ (NP :ID 1 :DET a
:HEAD client
:AGR (AGR :PERSON 3 :NUMBER SING :GENDER UNSPEC))
:OBJ (NP :ID 2 :DET the
:HEAD flight
:AGR (AGR :PERSON 3 :NUMBER SING :GENDER NEUT)
:MODS (NP :ID 3 :NAME chicago
:AGR (AGR :PERSON 3 :NUMBER SING :GENDER NEUT)))))
(CLAUSE :SUBJ (NP :ID 4 :PRO he
:AGR (AGR :PERSON 3 :NUMBER SING :GENDER MASC))
:OBJ (NP :ID 5 :PRO it
:AGR (AGR :PERSON 3 :NUMBER SING :GENDER NEUT))))
For each NP, the following slots may be included:
-
:NAME, when the object has an identifying name (e.g. Jack)
-
:PRO, when the object is referenced by a pronoun
-
:DET, from which definiteness can be derived
-
:HEAD, a noun
-
:AGR, the agreement feature (always given)
-
:MODS
The AGR slots include person (3,2 or 1), number (SING or PLUR) and gender
(MASC, FEM, NEUT or UNSPEC). You can assume that the NEUT gender is always
specified for NPs representing non-human entities. MASC and FEM are specified
for NPs representing human entities when the gender is known. UNSPEC is
given for NPs representing humans when the gender is not known (e.g. client).
The MODS slots may contain any number of OBJ or ADJ slots. An OBJ slot
contains an NP that represents the object of a preposition. An ADJ slot
contains an adjectival modifier.
Input Files
Example 1
Example 2
Example 3
Output
The OUTPUT of your program should consist of two lists. The first should
be a list of the form ((NP1 REF1) (NP2 REF2) ... (NPn REFn)), where REFx
is a unique identifier for each entity mentioned in the discourse. The
second list should contain a description of each referent. You may wish
to simply use the most complete referring expression for each referent.
So, the output for the example above might be:
((NP1 ref1) (NP2 ref2) (NP3 ref3) (NP4 ref1) (NP5 unknown))
((ref1 DET a HEAD client)
(ref2 DET the HEAD flight)
(ref3 NAME chicago))
Note that these lists can be printed in any format (i.e., without parentheses)
as long as the above-specified information is provided.
Discussion:
For each case where your program fails to produce the proper result,
explain why your program fails and what kinds of information your program
would need to succeed. Finally, briefly explain (i.e. these can be embedded
comments) the algorithmic and representational choices you made (how items
are represented on the history list, how you choose among competing antecedents,
etc.). You should hand in a hardcopy of your program (written in the language
you choose) along with at least three sample inputs (in parsed form) and
outputs.
Programming Notes & Hints
-
You can use any language you like. LISP yields by far the easiest solution.
Note, however, that if you choose a non-symbolic language (e.g., C) you
will spend most of your effort in creating the appropriate data structures
and parsing the input files.
-
If you choose to do the assignment in Java, we have created some simple
data structures and a file parser for you.
-
LObject.java - represents an object in the input
data file, where an object can be a CLAUSE, an NP, etc. Each object has
a type (String type) and a list of properties (Hashtable properties), indexed
by the property name (of the form ":property"). Each property is either
a String, an Integer, an LObject, or a list of these values, represented
as a Vector of Strings, Integers and/or LObjects.
-
LParser.java - parses a data file and returns
a Vector of top-level LObjects found in the file. The top-level call is
LParser.parse(String filename). You must handle all exceptions. There is
also a pretty-print function and a sample "main" showing how these are
called.
-
Note that the order of the object constituents is lost following the parse,
however the NP IDs retain the order they were found in the text.
-
Feel free to modify the parser as you need, or write your own.
-
Hint: Your first task should be to form a list of the NPs in the order
they appear in the text.
|