OSCAR

Open Source Project on Multilingual Resources for Machine Learning

OSCAR

The OSCAR project (Open Super-large Crawled Aggregated coRpus) is an Open Source project aiming to provide web-based multilingual resources and datasets for Machine Learning (ML) and Artificial Intelligence (AI) applications. The project focuses specifically in providing large quantities of unannotated raw data that is commonly used in the pre-training of large deep learning models. The OSCAR project has developed high-performance data pipelines specifically conceived to classify and filter large amounts of web data. The project has also put special attention in improving the data quality of web-based corpora as well as providing data for low-resource languages, so that these new ML/AI technologies are accessible to as many communities as possible.

The new OSCAR 23.01 is finally available, check it out here! 🚀

Join our Discord community here! 💬

Data is distributed by language in both original and deduplicated form. There are currently 166 different languages available. If you use OSCAR please consider giving us some feedback by writing to our mail address down below. Also consider citing our papers.

If you want to contribute to OSCAR, please open a pull request here.

Since 2019, The OSCAR Project has been funded by Inria (project-team ALMAnaCH) and the PRAIRIE institute. Starting in 2023, DFKI and the German Federal Ministry for Economic Affairs and Climate Action (BMWK) through the project OpenGPT-X, have joined Inria, ALMAnaCH and the PRAIRIE institute in providing funding for the OSCAR Project. During 2022 and at the beginning of 2023, OSCAR was also shortly funded by The University of Mannheim.

If you are interested in OSCAR and would like to access the corpus, send us a mail using the mail address down below, with “OSCAR Access Request” as mail title. Please include your name, last name, affiliation, contact details, which languages do you need and a brief description of how you intend to use OSCAR.

Blog posts

News: OSCAR 23.01 Release

New filters, new categories, more compression, a fresh corpus and more

Pedro Ortiz Suarez, Julien Abadji

Last updated on Feb 22, 2023 1 min read news

OSCAR News: September 2021

New tools, metadata, and a fresh corpus.

Julien Abadji, Pedro Ortiz Suarez

Last updated on Feb 22, 2023 2 min read news

Publications

Towards a Cleaner Document-Oriented Multilingual Crawled Corpus

we take the existing multilingual web corpus OSCAR and its pipeline Ungoliant that extracts and classifies data from Common Crawl at the line level, and propose a set of improvements and automatic annotations in order to produce a new document-oriented version of OSCAR.

Julien Abadji, Pedro Ortiz Suarez, Laurent Romary, Benoît Sagot

Towards a Cleaner Document-Oriented Multilingual Crawled Corpus

Ungoliant: An Optimized Pipeline for the Generation of a Very Large-Scale Multilingual Web Corpus

We propose a new pipeline that is faster, modular, parameterizable, and well documented. We use it to create a corpus similar to OSCAR but larger and based on recent data.

Julien Abadji, Pedro Ortiz Suarez, Laurent Romary, Benoît Sagot

Ungoliant: An Optimized Pipeline for the Generation of a Very Large-Scale Multilingual Web Corpus

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

We audit 5 multilingual corpora, finding that lower-resource corpora have systematic issues.

Isaac Caswell, Julia Kreutzer, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, Monang Setyawan, Supheakmungkol Sarin, Sokhar Samb, Benoît Sagot, Clara Rivera, Annette Rios, Isabel Papadimitriou, Salomey Osei, Pedro Ortiz Suarez, Iroro Orife, Kelechi Ogueji, Rubungo Andre Niyongabo, Toan Q. Nguyen, Mathias Müller, André Müller, Shamsuddeen Hassan Muhammad, Nanda Muhammad, Ayanda Mnyakeni, Jamshidbek Mirzakhalov, Tapiwanashe Matangira, Colin Leong, Nze Lawson, Sneha Kudugunta, Yacine Jernite, Mathias Jenny, Orhan Firat, Bonaventure F. P. Dossou, Sakhile Dlamini, Nisansa de Silva, Sakine Çabuk Ballı, Stella Biderman, Alessia Battisti, Ahmed Baruwa, Ankur Bapna, Pallavi Baljekar, Israel Abebe Azime, Ayodele Awokoya, Duygu Ataman, Orevaoghene Ahia, Oghenefego Ahia, Sweta Agrawal, Mofetoluwa Adeyemi

A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages

We explore the impact of the training corpus on contextualized word embeddings in five mid-resource languages.

Pedro Ortiz Suarez, Laurent Romary, Benoît Sagot

A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages

Establishing a New State-of-the-Art for French Named Entity Recognition

We explore convert the NER annotations of the French TreeBank to a more user-friendly format and establish a new state of the art for French NER.

Pedro Ortiz Suarez, Yoann Dupont, Benjamin Muller, Laurent Romary, Benoît Sagot

Establishing a New State-of-the-Art for French Named Entity Recognition

Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures

We propose a new pipeline to filter, clean and classify Common Crawl by language, we publish the final corpus under the name OSCAR.

Pedro Ortiz Suarez, Benoît Sagot, Laurent Romary

Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures

Talks

Ungoliant: An Optimized Pipeline for the Generation of a Very Large-Scale Multilingual Web Corpus

We propose a new pipeline that is faster, modular, parameterizable, and well documented. We use it to create a corpus similar to OSCAR but larger and based on recent data.

Julien Abadji, Pedro Ortiz Suarez, Laurent Romary, Benoît Sagot

Last updated on Sep 3, 2021

A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages

We explore the impact of the training corpus on contextualized word embeddings in five mid-resource languages.

Pedro Ortiz Suarez, Laurent Romary, Benoît Sagot

Last updated on Sep 3, 2021

Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures

We propose a new pipeline to filter, clean and classify Common Crawl by language, we publish the final corpus under the name OSCAR.

Pedro Ortiz Suarez, Benoît Sagot, Laurent Romary

Last updated on Sep 3, 2021

Lincense

Corpus License

These data are released under this licensing scheme:

We do not own any of the text from which these data has been extracted.
We license the actual packaging and annotations of these data under the Creative Commons CC0 license (“no rights reserved”).
To the extent possible under French law, Inria has waived all copyright and related or neighboring rights to OSCAR.
To the extent possible under German law, DFKI GmbH and Universität Mannheim have waived all copyright and related or neighboring rights to OSCAR.
This work is published from: France.

Code Licenses

All of the software repositories produced by the OSCAR Project are available on GitHub and include repository-specific licensing information. For more information please visit the OSCAR Project Organization on GitHub.

Notice and take down policy

Notice: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
Clearly identify the copyrighted work claimed to be infringed.
Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
And use the contact form below.

Take down: We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

OSCAR

Open Source Project on Multilingual Resources for Machine Learning

OSCAR

Funding provided by

Funding Organization

Funding Lab

Funding Institute

Funding Organization

Funding Project

Blog posts

Publications

Talks

The OSCAR Team

Core

Researcher

Research Engineer

Research Engineer

Senior Researcher

Senior Researcher

Collaborators

Crawl Engineer & Data Scientist

Ph.D. student at LMU Munich

Contributors

Ph.D. Student

Data Scientist

Partners

Partner Organization

Partner Group

Partner University

Lincense

Corpus License

Code Licenses

Notice and take down policy

Contact