3

Towards a Cleaner Document-Oriented Multilingual Crawled Corpus

we take the existing multilingual web corpus OSCAR and its pipeline Ungoliant that extracts and classifies data from Common Crawl at the line level, and propose a set of improvements and automatic annotations in order to produce a new document-oriented version of OSCAR.

Julien Abadji, Pedro Ortiz Suarez, Laurent Romary, Benoît Sagot

Towards a Cleaner Document-Oriented Multilingual Crawled Corpus