On Techniques and Strategies of Text Matching

The Victorian Laptop posits an important question regarding the nature of the interaction and a text retrieval system. Unlike traditional information retrieval, where user need is specifically defined, the Victorian Laptop's text matching system attempts to transcend tradition measures of recall and precision to produce what is "relevant." 

The difficulty of this task can be imagined merely from the subjectivity of the term "relevant." What is the user looking for exactly, if anything? What will stimulate him or her to write more? Such are questions that can only be answered through experimentation.  Furthermore, the texts being matched differ significantly in genre from the texts in conventional story matching problems; they are narratives. Most text matching systems handle news text.

The Victorian Laptop currently extracts two types of information from user texts.

  • Proper Nouns By performing matches on proper nouns extracted from the corpus of stories, the general subjects can most often be mapped together. Proper nouns consisting of more than one word (such as "Boston Common") are scored more highly than those with only one word (such as "Boston"). The appearance of proper nouns, most often as names of locations and people, is frequently enough to establish a general link between user input and reference text.
  • Keywords Using third party text analysis tools, keywords embodying semantics are extracted. The keywords are collated into groups based on topic (determined by experimentation). The matcher attempts to assign a topic to the user text based on the topic keywords found, and then picks a story from the corpus that is under the same topic and contains the most keywords found in the user story to return as a match.
  • Dates Another strategy employed by the Victorian Laptop is text matching based on dates. Experiments show that a temporal link between user and text is often very important.
The Victorian Laptop`s text matching and retrieval system uses a blend of the above three techniques to produce what is most "relevant." 



Page maintained by Petra Chong.