OSCAR 2019

OSCAR 2019 is the original 2019 release of the OSCAR corpus. It has been generated from Common Crawl corpus using the goclassy architecture.

Table of Contents

Features

OSCAR 2019 is shuffled at line level and no metadata is provided. Thus it is mainly intended to be used in the training of unsupervised language models for NLP.

Data is distributed by language in both original and deduplicated form.

If you need the unshuffled version of OSCAR, please contact us using the contact form. Please include your name, affiliation, contact details, which languages do you need and a brief description of how you intend to use OSCAR. You can also download it using HuggingFace’s datasets library.

Even though OSCAR is not Postcardware, we do appreciate when our users send us a postcard. If you want to send us one, you can find the address in the contact section down below.

Citing OSCAR

If you use OSCAR to train a language model, text generation model or any other ML model in general please consider citing our latest paper:

@inproceedings{ortiz-suarez-etal-2020-monolingual,
    title = "A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages",
    author = "Ortiz Su{\'a}rez, Pedro Javier  and
      Romary, Laurent  and
      Sagot, Beno{\^\i}t",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-main.156",
    pages = "1703--1714",
    abstract = "We use the multilingual OSCAR corpus, extracted from Common Crawl via language classification, filtering and cleaning, to train monolingual contextualized word embeddings (ELMo) for five mid-resource languages. We then compare the performance of OSCAR-based and Wikipedia-based ELMo embeddings for these languages on the part-of-speech tagging and parsing tasks. We show that, despite the noise in the Common-Crawl-based OSCAR data, embeddings trained on OSCAR perform much better than monolingual embeddings trained on Wikipedia. They actually equal or improve the current state of the art in tagging and parsing for all five languages. In particular, they also improve over multilingual Wikipedia-based contextual embeddings (multilingual BERT), which almost always constitutes the previous state of the art, thereby showing that the benefit of a larger, more diverse corpus surpasses the cross-lingual benefit of multilingual embedding architectures.",
}

The Unshuffled OSCAR

If you need a copy of any of the unshuffled sub-corpora, please contact us using the contact form down below. Please include your name, affiliation, contact details, which languages do you need and a brief description of how you intend to use OSCAR. We will evaluate your request and answer accordingly.

The unshuffled OSCAR is now available in HuggingFace’s datasets library

They have obtained our permission to redistribute the unshuffled OSCAR and they allow users to download a corpus all at once as opposed to file by file. You can get more information about how to download OSCAR using their library by visiting OSCAR’s dataset card.

Downloading OSCAR

All the data is distributed by language, both the original and the deduplicated versions of the data are available. To download a file just click the desired link on the table below. Languages are split in shards of around 700MB, these shards are standalone. A plain text file with checksums is also provided.

The OSCAR corpus is yet to be filtered, so please be careful when using it, specially for text generation tasks! To see which sub-corpora have been audited, please refer to the list of publications above for more information.

You’ll be asked to create an HumanID account in order to download a corpus. This is intended, and we do it in order to limit traffic and reduce abuse of the infrastructure. The OSCAR corpus is hosted by Huma-Num, you can read more about them on their website.

All sizes are for the uncompressed files.

Language Words original Size original File original Words deduplicated Size deduplicated File deduplicated
Afrikaans 43,482,801 241M af 29,533,437 163M af
Albanian 374,196,110 2.3G sq 186,856,699 1.2G sq
Alemannic 841,750 5.0M als 459,001 2.8M als
Amharic 28,301,601 360M am 16,086,628 206M am
Arabic 8,117,162,828 82G ar 3,171,221,354 32G ar
Aragonese 52,896 1.3M an 45,669 801K an
Armenian 273,919,388 3.7G hy 110,196,043 1.5G hy
Assamese 6,956,663 113M as 4,366,570 71M as
Asturian 381,005 2.4M ast 325,237 2.0M ast
Avaric 24,720 409K av 19,478 324K av
Azerbaijani 322,641,710 2.8G az 167,742,296 1.5G az
Bashkir 9,796,764 128M ba 6,922,589 90M ba
Basque 120,456,652 848M eu 45,359,710 342M eu
Bavarian 399 503 bar 399 503 bar
Belarusian 144,579,630 1.8G be 83,499,037 1.1G be
Bengali 623,575,733 11G bn 363,766,143 5.8G bn
Bihari 8,848 110K bh 2,875 34K bh
Bishnupriya 198,286 4.1M bpy 96,940 1.7M bpy
Bosnian 106,448 447K bs 20,485 116K bs
Breton 5,013,241 29M br 2,890,384 16M br
Bulgarian 2,947,648,106 32G bg 1,268,114,977 14G bg
Burmese 56,111,184 1.9G my 30,102,173 1.1G my
Catalan 1,360,212,450 8.0G ca 729,333,440 4.3G ca
Cebuano 6,603,567 39M ceb 3,675,024 24M ceb
Central Bikol 312 885 bcl 312 885 bcl
Central Khmer 20,690,610 1.1G km 10,082,245 581M km
Central Kurdish 48,478,334 487M ckb 18,726,721 226M ckb
Chavacano 130 520 cbk 130 520 cbk
Chechen 711,051 8.3M ce 568,146 6.7M ce
Chinese 14,986,424,850 508G zh 6,350,215,113 249G zh
Chuvash 3,041,614 39M cv 2,054,810 26M cv
Cornish 8,329 44K kw 2,704 14K kw
Croatian 34,232,765 226M hr 16,727,640 110M hr
Czech 7,715,977,441 53G cs 3,540,997,509 24G cs
Danish 2,637,463,889 16G da 1,620,091,317 9.5G da
Dhivehi 7,559,472 126M dv 4,726,660 79M dv
Dimli 19 146 diq 19 146 diq
Dutch 13,020,136,373 78G nl 6,598,786,137 39G nl
Eastern Mari 565,992 7.2M mhr 469,297 6.0M mhr
Egyptian Arabic 7,305,151 66M arz 3,659,419 33M arz
Emilian-Romagnol 6,376 25K eml 6,121 24K eml
English 418,187,793,408 2.3T en 215,841,256,971 1.2T en
Erzya 90 1.4K myv 78 1.2K myv
Esperanto 48,486,161 299M eo 37,324,446 228M eo
Estonian 643,163,730 4.8G et 309,931,463 2.3G et
Finnish 3,196,666,419 27G fi 1,597,855,468 13G fi
French 46,896,036,417 282G fr 23,206,776,649 138G fr
Galician 102,011,291 620M gl 63,600,602 384M gl
Georgian 171,950,621 3.6G ka 91,569,739 1.9G ka
German 44,878,908,446 308G de 21,529,164,172 145G de
Goan Konkani 124,277 2.2M gom 102,306 1.8M gom
Guarani 7,382 36K gn 4,680 24K gn
Gujarati 72,045,701 1.1G gu 50,023,432 722M gu
Haitian 1,014 3.9K ht 832 3.3K ht
Hebrew 2,067,753,528 20G he 1,032,018,056 9.8G he
Hindi 1,372,234,782 17G hi 745,774,934 8.9G hi
Hungarian 5,163,936,345 40G hu 2,339,127,555 18G hu
Icelandic 219,900,094 1.5G is 129,818,331 846M is
Ido 25,702 147K io 22,773 130K io
Iloko 142,942 874K ilo 105,564 636K ilo
Indonesian 4,574,692,265 30G id 2,394,957,629 16G id
Interlingua 180,231 662K ia 100,019 360K ia
Interlingue 5,352 24K ie 602 1.6K ie
Irish 14,483,593 88M ga 10,017,303 60M ga
Italian 22,248,707,341 137G it 11,250,012,896 69G it
Japanese 4,962,979,182 216G ja 1,123,067,063 106G ja
Javanese 104,896 659K jv 86,654 583K jv
Kalmyk 10,277 113K xal 10,155 112K xal
Kannada 81,186,863 1.7G kn 49,343,462 1.1G kn
Karachay-Balkar 185,436 2.6M krc 166,496 2.3M krc
Kazakh 191,126,469 2.7G kk 108,388,743 1.5G kk
Kirghiz 44,194,823 600M ky 28,982,620 388M ky
Komi 201,404 2.3M kv 95,243 1.2M kv
Korean 2,368,765,142 24G ko 1,120,375,149 12G ko
Kurdish 15,561,003 94M ku 9,946,440 60M ku
Lao 4,133,311 174M lo 2,583,342 114M lo
Latin 4,122,201 26M la 1,328,038 8.3M la
Latvian 520,761,977 4.0G lv 236,428,905 1.8G lv
Lezghian 247,646 3.3M lez 224,871 3.0M lez
Limburgan 4,730 29K li 4,283 27K li
Lithuanian 1,159,661,742 8.8G lt 516,183,525 3.9G lt
Lojban 154,330 736K jbo 141,973 678K jbo
Lombard 75,229 443K lmo 73,665 433K lmo
Low German 2,906,347 18M nds 2,146,417 13M nds
Lower Sorbian 1,787 13K dsb 966 7.1K dsb
Luxembourgish 4,403,577 29M lb 3,087,650 21M lb
Macedonian 189,289,873 2.1G mk 102,849,595 1.2G mk
Maithili 69,161 317K mai 874 11K mai
Malagasy 3,068,360 21M mg 1,872,044 13M mg
Malay 16,696,882 111M ms 6,045,753 42M ms
Malayalam 189,534,472 4.9G ml 95,892,551 2.5G ml
Maltese 2,995,654 24M mt 2,163,358 17M mt
Marathi 162,609,404 2.7G mr 82,130,803 1.4G mr
Mazanderani 73,870 691K mzn 64,481 602K mzn
Minangkabau 5,682 608K min 4,825 310K min
Mingrelian 299,098 5.8M xmf 228,629 4.4M xmf
Mirandese 171 1.2K mwl 152 1.1K mwl
Modern Greek 5,479,180,137 62G el 2,412,419,435 27G el
Mongolian 181,307,167 2.2G mn 68,362,013 838M mn
Nahuatl languages 1,234 12K nah 1,193 11K nah
Neapolitan 5,282 17K nap 4,147 13K nap
Nepali 107,448,208 1.8G ne 71,628,317 1.2G ne
Newari 564,697 5.5M new 288,995 4.1M new
Northern Frisian 1,516 4.4K frr 1,516 4.4K frr
Northern Luri 8,022 76K lrc 6,740 63K lrc
Norwegian 1,344,326,388 8.0G no 804,894,377 4.7G no
Norwegian Nynorsk 14,764,980 85M nn 9,435,139 54M nn
Occitan 750,301 5.8M oc 512,678 3.7M oc
Oriya 14,938,567 248M or 11,321,740 188M or
Ossetian 1,031,268 13M os 878,765 11M os
Pampanga 130 760 pam 52 304 pam
Panjabi 61,847,806 763M pa 37,555,835 460M pa
Persian 9,096,554,121 79G fa 4,363,505,319 38G fa
Piemontese 362,013 2.1M pms 337,246 1.9M pms
Polish 15,277,255,137 109G pl 6,708,709,674 47G pl
Portuguese 20,641,903,898 124G pt 10,751,156,918 64G pt
Pushto 46,559,441 361M ps 31,347,348 242M ps
Quechua 10,186 78K qu 8,691 67K qu
Romanian 3,984,317,058 25G ro 1,741,794,069 11G ro
Romansh 1,093 7.4K rm 960 6.5K rm
Russia Buriat 963 13K bxr 809 11K bxr
Russian 92,522,407,837 1.2T ru 46,692,691,520 568G ru
Sanskrit 4,331,569 93M sa 1,713,930 37M sa
Scottish Gaelic 310,689 1.9M gd 207,110 1.3M gd
Serbian 364,395,411 3.9G sr 207,561,168 2.2G sr
Serbo-Croatian 5,292,184 25M sh 1,040,573 5.8M sh
Sicilian 554 3.3K scn 468 2.8K scn
Sindhi 43,530,158 347M sd 33,028,015 263M sd
Sinhala 93,053,465 1.4G si 50,864,857 802M si
Slovak 1,322,247,763 9.1G sk 656,346,179 4.5G sk
Slovenian 387,399,700 2.5G sl 193,926,684 1.3G sl
Somali 1,202 61K so 472 16K so
South Azerbaijani 2,175,054 27M azb 1,528,709 19M azb
Spanish 47,545,122,279 278G es 25,928,290,729 149G es
Sundanese 30,321 211K su 20,278 141K su
Swahili 2,211,927 13M sw 1,376,963 8.1M sw
Swedish 7,155,994,312 44G sv 4,106,120,608 25G sv
Tagalog 98,949,299 573M tl 70,121,601 407M tl
Tajik 31,758,142 379M tg 21,029,893 249M tg
Tamil 420,537,132 9.3G ta 226,013,330 5.1G ta
Tatar 51,034,893 670M tt 23,825,695 305M tt
Telugu 123,711,517 2.5G te 79,094,167 1.6G te
Thai 951,743,087 36G th 368,965,202 16G th
Tibetan 1,483,589 187M bo 936,556 138M bo
Turkish 7,577,388,700 60G tr 3,365,734,289 27G tr
Turkmen 1,113,869 11M tk 752,326 6.8M tk
Tuvinian 759 12K tyv 540 7.9K tyv
Uighur 8,657,141 122M ug 5,852,225 83M ug
Ukrainian 4,204,381,276 53G uk 2,252,380,351 28G uk
Upper Sorbian 545,351 4.2M hsb 236,867 1.8M hsb
Urdu 331,817,982 2.7G ur 218,030,228 1.7G ur
Uzbek 2,450,256 21M uz 1,381,644 12M uz
Venetian 3,492 18K vec 3,199 17K vec
Vietnamese 12,036,845,359 68G vi 5,577,159,843 32G vi
Volapük 321,121 2.0M vo 318,568 2.0M vo
Walloon 50,720 273K wa 37,543 203K wa
Waray 397,315 2.5M war 336,311 2.2M war
Welsh 37,422,441 213M cy 23,574,673 133M cy
Western Frisian 5,691,077 35M fy 4,223,816 26M fy
Western Mari 93,338 1.2M mrj 87,780 1.1M mrj
Western Panjabi 1,426,986 12M pnb 1,111,112 9.0M pnb
Wu Chinese 11,189 109K wuu 4,333 32K wuu
Yakut 2,547,623 42M sah 1,789,174 26M sah
Yiddish 13,834,320 141M yi 8,212,970 84M yi
Yoruba 8,906 55K yo 3,518 27K yo
Yue Chinese 186 3.7K yue 128 2.2K yue

License

These data are released under this licensing scheme:

  • We do not own any of the text from which these data has been extracted.
  • We license the actual packaging of these data under the Creative Commons CC0 license (“no rights reserved”).
  • To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR.
  • This work is published from: France.

CC0

Notice and take down policy

Notice: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

  • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
  • Clearly identify the copyrighted work claimed to be infringed.
  • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
  • And use the contact form below.

Take down: We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

Models

Here is a list of some language models that have been trained using the OSCAR corpus or that are part of the OSCAR project:

Model Language Corpus Authors Paper Files License
ELMo Bulgarian OSCAR Pedro J. Ortiz, Benoît Sagot and Laurent Romary ACL 2020 bg.zip MIT
ELMo Bulgarian Wikipedia Pedro J. Ortiz, Benoît Sagot and Laurent Romary ACL 2020 bg.zip MIT
ELMo Catalan OSCAR Pedro J. Ortiz, Benoît Sagot and Laurent Romary ACL 2020 ca.zip MIT
ELMo Catalan Wikipedia Pedro J. Ortiz, Benoît Sagot and Laurent Romary ACL 2020 ca.zip MIT
ELMo Danish OSCAR Pedro J. Ortiz, Benoît Sagot and Laurent Romary ACL 2020 da.zip MIT
ELMo Danish Wikipedia Pedro J. Ortiz, Benoît Sagot and Laurent Romary ACL 2020 da.zip MIT
ELMo French OSCAR Pedro J. Ortiz, Yoann Dupont, Benjamin Muller, Laurent Romary and Benoît Sagot LREC 2020 fr.zip MIT
ELMo Finnish OSCAR Pedro J. Ortiz, Benoît Sagot and Laurent Romary ACL 2020 fi.zip MIT
ELMo Finnish Wikipedia Pedro J. Ortiz, Benoît Sagot and Laurent Romary ACL 2020 fi.zip MIT
ELMo Indonesian OSCAR Pedro J. Ortiz, Benoît Sagot and Laurent Romary ACL 2020 id.zip MIT
ELMo Indonesian Wikipedia Pedro J. Ortiz, Benoît Sagot and Laurent Romary ACL 2020 id.zip MIT

Here is a list of Language models trained by the community:

Model Language Cased Corpus Authors Paper Website Files License
AraBERT Arabic Cased OSCAR, Wikipedia, 1.5B words Arabic Corpus, OSIAN, Assafir Wissam Antoun, Fady Baly and Hazem Hajj ACL Anthology GitHub Hugging Face N/A
Arabic-BERT Arabic Cased OSCAR and Wikipedia Ali Safaya ArXiv GitHub Hugging Face MIT
AraELECTRA Arabic Cased OSCAR, Wikipedia, 1.5B words Arabic Corpus, OSIAN, Assafir Wissam Antoun, Fady Baly and Hazem Hajj ArXiV GitHub Hugging Face N/A
AraGPT2 Arabic Cased OSCAR, Wikipedia, 1.5B words Arabic Corpus, OSIAN, Assafir Wissam Antoun, Fady Baly and Hazem Hajj ArXiv GitHub Hugging Face N/A
CamemBERT French Cased OSCAR Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot ACL 2020 camembert-model.fr camembert-base.tar.gz MIT
CamemBERT French Cased Subsample of OSCAR (4 GB of text) Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot ACL 2020 camembert-model.fr camembert-base-oscar-4gb.tar.gz MIT
LePetit French Cased Subsample of OSCAR (2 GB of text) Vincent Micheli, Martin d’Hoffschmidt, Quentin Heinrich Medium blog illuin.tech Hugging Face MIT
GigaBERT Persian Cased and Uncased OSCAR, Wikipedia, Gigaword Wuwei Lan, Yang Chen, Wei Xu, Alan Ritter ACL Anthology GitHub Hugging Face MIT
ELECTRA Norwegian Cased OSCAR and OPUS Viktor Alm N/A Hugging Face Hugging Face N/A
BERT Romanian Cased OSCAR, Wikipedia and OPUS Dumitrescu Stefan and Andrei Avram SOON GitHub Hugging Face MIT
BERT Romanian Uncased OSCAR, Wikipedia and OPUS Dumitrescu Stefan and Andrei Avram SOON GitHub Hugging Face MIT
RoBERTa Sinhala N/A OSCAR Keshan Sodimana N/A Hugging Face Hugging Face N/A
BERT Turkish Cased and Uncased OSCAR, Wikipedia and OPUS Stefan Schweter Zenodo GitHub Hugging Face MIT
ELECTRA Turkish Cased OSCAR, Wikipedia and OPUS Stefan Schweter Zenodo GitHub Hugging Face MIT

If you have trained a model using the OSCAR corpus and would like to have it featured here, please open a pull request in our GitHub repo. Help us grow the community!

Pedro Ortiz Suarez
Pedro Ortiz Suarez
PhD Student

I’m a PhD student in Computer Science at Sorbonne Université and at the ALMAnaCH research team at Inria

Related