Humongous Corpus




OSCAR or Open Super-large Crawled Aggregated coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the Ungoliant architecture.

Grab the latest OSCAR release here!
Join our Discord community here!

Data is distributed by language in both original and deduplicated form. There are currently 166 different languages available. If you use OSCAR please consider giving us some feedback by writing to our mail address down below. Also consider citing our papers.

If you want to contribute to OSCAR, for example by tokenizing one of the corpora for a particular language, or by helping us translate our webpage, please open a pull request here.

The corpus was put together by Pedro Ortiz Suarez, Julien Abadji, Benoît Sagot, and Laurent Romary.

If you are interested in OSCAR and would like to access the corpus, send us a mail using the mail address down below, with “OSCAR Access Request” as mail title. Please include your name, last name, affiliation, contact details, which languages do you need and a brief description of how you intend to use OSCAR.

Even though OSCAR is not Postcardware, we do appreciate when our users send us a postcard. If you want to send us one, you can find the address in the contact section down below.