Зарегистрироваться
Восстановить пароль
FAQ по входу

Delmonte R. Computational Linguistic Text Processing: Lexicon, Grammar, Parsing, and Anaphora Resolution

  • Файл формата djvu
  • размером 3,27 МБ
  • Добавлен пользователем
  • Описание отредактировано
Delmonte R. Computational Linguistic Text Processing: Lexicon, Grammar, Parsing, and Anaphora Resolution
Издательство Nova Science, 2008, -401 pp.
Getaruns (General Text And Reference UNderstander System) is a system for text understanding. The aim of the system is to build a model world where relations and entities introduced and referred to in the text are asserted, searched for and ranked according to their relevance. In addition to that, the system is able to generate text, in the form of answers to queries, and in the form of short paraphrases or summaries of the input text(s). In some cases, it can also generate stories and Questions and Answers randomly from a plan and a Discourse Model.
GETARUNS is a general multilingual text and reference understander which represents a linguistically based approach to text understanding and embodies a number of general strategies on how to implement linguistic principles in a running system. The system addresses one main issue: the need to restrict access to extralinguistic knowledge of the world by contextual reasoning, i.e. reasoning from linguistically available cues.
Another important issue addressed by the system is multilinguality. In GETARUNS the user may switch from one language to another by simply unloading the current lexicon and uploading the lexicon for the new language: at present Italian, German and English are implemented. Multilinguality has been implemented to support the theoretical linguistic subdivision of Universal Grammar into a Core and a Peripheral set of rules. The system is organized around another fundamental assumption: the architecture of such a system must be modular thus requiring a pipeline of sequential feeding processes of information, each module providing one chunk of knowledge, backtracking being barred at intermodular level and allowed only within each single module. The architecture of the system is organized in such a way as to allow feedback into the parser from Anaphoric Binding: however, when pronominals have been finally bound or left free no more changes are allowed on the f- structure output of the parser.
Thus we can think of the system as being subdivided into two main meta-modules or levels: Low Level System, containing all modules that operate at Sentence Level; High Level System, containing all the modules that operate at Discourse and Text Level by updating the Discourse Model.
The books are organized as an experimental exercise: they contain both theoretical background and the output of the system, GETARUNS that enacts and applies the theory. The architecture of the system is strictly related to the structure of the books. To better describe it, we decided to dedicate one book to the lower level part of the system and another book to the Vlll Rodolfo Delmonte higher level system components. In this way, each component or module is presented in at least one chapter of the book.
Thus, we can think of the book as being organized around two scientifically distinct but in fact strictly interrelated fields of research:
- sentence level linguistic phenomena
- text or discourse level linguistic phenomena
the former to be described by means of grammatical theories, the latter requiring the intervention of extralinguistic knowledge, i.e. knowledge of the world. This distinction is usually drawn for scientific purposes and is obviously an artificial one: the sentence being at the same time the smallest domain at which rigorous linguistic analysis can hopefully be applied; but also the basic complete semantic unit whereby meaning can be conveyed, depending on the text/discourse context. We are aware of the fact that this subdivision is mainly wrought out for scientific reasons and does not really imply that such a neat subdivision of tasks can be actually envisaged in real text processing. As shall be discussed in detail in the books, semantic issues need to be tackled already at the beginning. This notwithstanding, the separation has its own "raison d'etre" and we will try to validate it in the books.
Book 1 - or the current book - addresses sentence grammar or what is usually referred to as such by theoretical linguists. It does it by dividing up - somewhat ideally and sometimes arbitrarily - what must or needs to be computed at sentence level from what need not or cannot be computed at the same level, and consequently belongs to discourse grammar. In that sense, the subdivision is not totally an arbitrary one, even though overlappings are normal cases and will be discussed where needed.
The book also indirectly does another (un)intended subdivision: the one existing between syntax and semantics. Again, it would be impossible not to deal with semantically related issues when talking about syntax or the lexicon. However, semantics with uppercase S, is only treated in Book 2 - already published - where discourse and text level grammar is tackled.
So eventually, this book deals with lexicon, morphology, tagging, treebanks, parsing, quantifiers and anaphoric or pronominal binding. In other words, all that concerns the level of sentence grammar in a computational environment.
An important contribution the books make is the argument against the simplistic idea that texts are a "bag of words" or that they can be processed in a satisfactory way using treebanks derived statistical approaches. Not that treebanks are useless as sources of grammatical information: as will be discussed in a chapter of the book, this does not support the statement that all the grammar there is to learn is contained in a single treebank.
Additionally, it cannot be proven that statistics and "bag of words" approaches are useless for NLP tasks. On the contrary, in some cases they constitute the only appropriate and sensible approach - and more than one chapter will discuss at length the pros and cons. The question is just wrongly posed: statistics cannot be treated as a panacea for all problems raised by half a century of linguistic studies and represented by a(ny) text.
Sentence level parsing covers in our perspective all the issues tackled in this book. In this sense, it speaks against those approaches - the majority in nowadays computational linguistics - that reduce sentence level parsing to a phrase structure parenthesized representation problem, with word tags and constituency labels in the style proposed and made into a standard de facto by the Penn Treebank initiative. Nor can it be represented by Dependency Structure with or without grammatical relation labels.
Sentence level Grammar - as has been purported in linguistic theories - takes care of all grammatical and linguistic relations that belong to that level. Knowledge of the world and semantic disambiguation do not interfere with the rules of sentence grammar, and can be thought of as a separate level of computation, provided that the lexicon be structured in such a way to allow such a subdivision of tasks.
Inducing Fully Specified Lexical Representations
Treebanking: From Phrase Structure to Dependency Representation
Parsing 1: The Partial Parser for Arguments and Adjuncts
Parsing 2: Deep Linguistically-Based Parsing
Parsing 3: Deep Parser between Grammar and Structure
Anaphoric Binding
Quantifiers and Anaphora
Discourse Anaphora Resolution
Linguistic Information Extraction for Text Correction and Summarization
  • Чтобы скачать этот файл зарегистрируйтесь и/или войдите на сайт используя форму сверху.
  • Регистрация