WSDM2021

A Decade in Quotes: Minimally Supervised Quotation Attribution in Massive News Corpora

Timote Vaucher 1 Andreas Spitz 1 Michele Catasta 2 Robert West 1
1Ecole Polytechnique Fédérale de Lausanne, Switzerland
2Stanford University, USA

Identifying and attributing quotations is a key component of knowledge extraction from Web sources. Prior contributions have largely relied on heuristic solutions, manually created patterns, or labeled training data. To fully benefit from the performance of neural models at Web scale, we introduce Quobert, a minimally supervised framework for extracting and attributing quotations from massive corpora. The framework avoids the necessity of manual input and instead exploits the redundancy of the corpus by leveraging bootstrapping to extract training data for a deep neural model. Quobert is language- and corpus-agnostic, and correctly attributes 87% of quotations in our experiments. We use this framework on a corpus 129 million news articles to create Quotebank, an open repository of quotation-speaker pairs for use by the community that contains quotations from a decade of news.