Humongous Corpus



The OSCAR data repository is temporarily unavailable: OSCAR corpora cannot be downloaded at the moment.
We are aware of the situation and are working hard to solve the problem as soon as possible. Thank you for your patience and understanding.

OSCAR or Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

OSCAR is currently shuffled at line level and no metadata is provided. Thus it is mainly intended to be used in the training of unsupervised language models for NLP.

Data is distributed by language in both original and deduplicated form. There are currently 166 different languages available. If you use OSCAR please consider giving us some feedback using the contact form down below. Also consider citing our paper.

If you want to contribute to OSCAR, for example by tokenizing one of the corpora for a particular language, or by helping us translate our webpage, please open a pull request here.

The corpus was put together by Pedro J. Ortiz, Benoît Sagot, and Laurent Romary.