Springer, 2014. — 477 p.
Modern communication technologies, such as the television and the Internet, have made readily available massive amounts of information in many languages. More such data is being generated in real time, 24 h a day and 7 days a week, aided by social networking sites such as Facebook and Twitter. This information explosion is in the form of multilingual audio, video, and Web content. The task of processing this large amount of information demands effective, scalable, multilingual media processing, monitoring, indexing, and search solutions. Natural Language Processing (NLP) technologies have long been used to address this task, and several researchers have developed several technical solutions for it. In the last two decades, NLP researchers have developed exciting algorithms for processing large amounts of text in many different languages. Nowadays the English language has obtained the lion’s share in terms of available resources as well as developed NLP technical solutions. In this book, we address another group of interesting and challenging languages for NLP research, that is, the Semitic languages. The Semitic languages have existed in written form since a very early date, with texts written in a script adapted from Sumerian cuneiform. Most scripts used to write Semitic languages are abjads, a type of alphabetic script that omits some or all of the vowels. This is feasible for these languages because the consonants in the Semitic languages are the primary carriers of meaning. Semitic languages have interesting morphology, where word roots are not themselves syllables or words, but isolated sets of consonants (usually three characters). Words are composed out of roots by adding vowels to the root consonants (although prefixes and suffixes are often added as well). For example, in Arabic, the root meaning write has the form k - t - b. From this root, words are formed by filling in the vowels, e.g., kitAb book, kutub books, kAtib writer, kuttAb writers, kataba he wrote, yaktubu he writes, etc. Semitic languages, as stated in Wikipedia, are spoken by more than 270 million people. The most widely spoken Semitic languages today are Arabic (206 million native speakers), Amharic (27 million), Hebrew (7 million), Tigrinya (6.7 million), Syriac (1million), and Maltese (419 thousand).NLP research applied to Semitic languages has been the focus of attention of many researchers for more than a decade, and several technical solutions have been proposed, especially Arabic NLP where we find a very large amount of accomplished research. This will be reflected in this book, where Arabic will take the lion’s share. Hebrew also has been the center of attention of several NLP research works, but to a smaller degree when compared to Arabic. Most of the key published research works in Hebrew NLP will be discussed in this book. For Amharic, Maltese, and Syriac, because of the very limited amount of NLP research publicly available, we didn’t limit ourselves to present key techniques, but we also proposed solutions inspired from Arabic and Hebrew. Our aim for this book is to provide a one-stop shop to all the requisite background and practical advice when building NLP applications for Semitic languages. While this is quite a tall order, we hope that, at a minimum, you find this book a useful resource.
Similar to English, the dominant approach in NLP for Semitic languages has been to build a statistical model that can learn from examples. In this way, a model can be robust to changes in the type of text and even the language of text on which it operates. With the right design choices, the same model can be trained to work in a new domain simply by providing new examples in that domain. This approach also obviates the need for researchers to lay out, in a painstaking fashion, all the rules that govern the problem at hand and the manner in which those rules must be combined.A statistical system typically allows for researchers to provide an abstract expression of possible features of the input, where the relative importance of those features can be learned during the training phase and can be applied to new text during the decoding, or inference, phase.While this book will devote some attention to cutting-edge algorithms and techniques, the primary purpose will be a thorough explication of best practices in the field. Furthermore, every chapter describes how the techniques discussed apply to Semitic languages.
Natural Language Processing Core-Technologies.
Linguistic Introduction: The Orthography, Morphology and Syntax of Semitic Languages.
Morphological Processing of Semitic Languages.
Syntax and Parsing of Semitic Languages.
Semantic Processing of Semitic Languages.
Language Modeling.
Natural Language Processing Applications.
Statistical Machine Translation.
Named Entity Recognition.
Anaphora Resolution.
Relation Extraction.
Information Retrieval.
Question Answering.
Automatic Summarization.
Automatic Speech Recognition.