OSCAR

OSCAR

Humongous Corpus

Inria

ALMAnaCH

OSCAR

OSCAR or Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

OSCAR is currently shuffled at line level and no metadata is provided. Thus it is mainly intended to be used in the training of unsupervised language models for NLP.

Data is distributed by language in both original and deduplicated form. There are currently 166 different languages available. If you use OSCAR please consider giving us some feedback using the contact form down below. Also consider citing our paper.

If you want to contribute to OSCAR, for example by tokenizing one of the corpora for a particular language, or by helping us translate our webpage, please open a pull request here.

The corpus was put together by Pedro J. Ortiz, Benoît Sagot, and Laurent Romary.

New: If you need the unshuffled version of OSCAR, please contact us using the contact form down below. Please include your name, affiliation, contact details, which languages do you need and a brief description of how you intend to use OSCAR.

Even though OSCAR is not Postcardware, we do appreciate when our users send us a postcard. If you want to send us one, you can find the address in the contact section down below.

Corpus

Citing OSCAR

If you use OSCAR to train a language model, text generation model or any other ML model in general please consider citing our latest paper:

@inproceedings{ortiz-suarez-etal-2020-monolingual,
    title = "A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages",
    author = "Ortiz Su{\'a}rez, Pedro Javier  and
      Romary, Laurent  and
      Sagot, Beno{\^\i}t",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-main.156",
    pages = "1703--1714",
    abstract = "We use the multilingual OSCAR corpus, extracted from Common Crawl via language classification, filtering and cleaning, to train monolingual contextualized word embeddings (ELMo) for five mid-resource languages. We then compare the performance of OSCAR-based and Wikipedia-based ELMo embeddings for these languages on the part-of-speech tagging and parsing tasks. We show that, despite the noise in the Common-Crawl-based OSCAR data, embeddings trained on OSCAR perform much better than monolingual embeddings trained on Wikipedia. They actually equal or improve the current state of the art in tagging and parsing for all five languages. In particular, they also improve over multilingual Wikipedia-based contextual embeddings (multilingual BERT), which almost always constitutes the previous state of the art, thereby showing that the benefit of a larger, more diverse corpus surpasses the cross-lingual benefit of multilingual embedding architectures.",
}

If you however use the goclassy pipeline or just use OSCAR for any non machine learning related application, please consider citing the original paper:

@inproceedings{OrtizSuarezSagotRomary2019,
  author    = {Pedro Javier {Ortiz Su{\´a}rez} and Beno{\^i}t Sagot and Laurent Romary},
  title     = {Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures},
  series = {Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019},
  editor    = {Piotr Bański and Adrien Barbaresi and Hanno Biber and Evelyn Breiteneder and Simon Clematide and Marc Kupietz and Harald L{\"u}ngen and Caroline Iliadi},
  publisher = {Leibniz-Institut f{\"u}r Deutsche Sprache},
  address   = {Mannheim},
  doi       = {10.14618/ids-pub-9021},
  url       = {http://nbn-resolving.de/urn:nbn:de:bsz:mh39-90215},
  pages     = {9 -- 16},
  year      = {2019},
  abstract  = {Common Crawl is a considerably large, heterogeneous multilingual corpus comprised of crawled documents from the internet, surpassing 20TB of data and distributed as a set of more than 50 thousand plain text files where each contains many documents written in a wide variety of languages. Even though each document has a metadata block associated to it, this data lacks any information about the language in which each document is written, making it extremely difficult to use Common Crawl for monolingual applications. We propose a general, highly parallel, multithreaded pipeline to clean and classify Common Crawl by language; we specifically design it so that it runs efficiently on medium to low resource infrastructures where I/O speeds are the main constraint. We develop the pipeline so that it can be easily reapplied to any kind of heterogeneous corpus and so that it can be parameterised to a wide range of infrastructures. We also distribute a 6.3TB version of Common Crawl, filtered, classified by language, shuffled at line level in order to avoid copyright issues, and ready to be used for NLP applications.},
  language  = {en}
}

Citing our papers can help us with both funding and project visibility, so please consider doing it!

The Unshuffled OSCAR

Due to ethic and copyright reasons, the unshuffled OSCAR is not currently distributed freely, if you need a copy of any of the unshuffled sub-corpora, please contact us using the contact form down below. Please include your name, affiliation, contact details, which languages do you need and a brief description of how you intend to use OSCAR. We will evaluate your request and answer accordingly.

Send us a postcard!

Even though OSCAR is not Postcardware, we do appreciate when our users send us a postcard. If you want to send us one, you can find the address in the contact section down below.

Downloading OSCAR

All the data is distributed by language, both the original and the deduplicated versions of the data are available. To download a file just click the desired link on the table below. We recommend the use of pigz to decompress the bigger files in OSCAR.

All sizes are for the uncompressed files.

Language Words original Size original File original Words deduplicated Size deduplicated File deduplicated
Afrikaans 43,482,801 241M af.txt.gz 29,533,437 163M af_dedup.txt.gz
Albanian 374,196,110 2.3G sq.txt.gz 186,856,699 1.2G sq_dedup.txt.gz
Amharic 28,301,601 360M am.txt.gz 16,086,628 206M am_dedup.txt.gz
Arabic 8,117,162,828 82G ar.txt.gz 3,171,221,354 32G ar_dedup.txt.gz
Aragonese 52,896 1.3M an.txt.gz 45,669 801K an_dedup.txt.gz
Armenian 273,919,388 3.7G hy.txt.gz 110,196,043 1.5G hy_dedup.txt.gz
Assamese 6,956,663 113M as.txt.gz 4,366,570 71M as_dedup.txt.gz
Asturian 381,005 2.4M ast.txt.gz 325,237 2.0M ast_dedup.txt.gz
Avaric 24,720 409K av.txt.gz 19,478 324K av_dedup.txt.gz
Azerbaijani 322,641,710 2.8G az.txt.gz 167,742,296 1.5G az_dedup.txt.gz
Bashkir 9,796,764 128M ba.txt.gz 6,922,589 90M ba_dedup.txt.gz
Basque 120,456,652 848M eu.txt.gz 45,359,710 342M eu_dedup.txt.gz
Bavarian 399 503 bar.txt.gz 399 503 bar_dedup.txt.gz
Belarusian 144,579,630 1.8G be.txt.gz 83,499,037 1.1G be_dedup.txt.gz
Bengali 623,575,733 11G bn.txt.gz 363,766,143 5.8G bn_dedup.txt.gz
Bihari 8,848 110K bh.txt.gz 2,875 34K bh_dedup.txt.gz
Bishnupriya 198,286 4.1M bpy.txt.gz 96,940 1.7M bpy_dedup.txt.gz
Bosnian 106,448 447K bs.txt.gz 20,485 116K bs_dedup.txt.gz
Breton 5,013,241 29M br.txt.gz 2,890,384 16M br_dedup.txt.gz
Bulgarian 2,947,648,106 32G bg.txt.gz 1,268,114,977 14G bg_dedup.txt.gz
Burmese 56,111,184 1.9G my.txt.gz 30,102,173 1.1G my_dedup.txt.gz
Catalan 1,360,212,450 8.0G ca.txt.gz 729,333,440 4.3G ca_dedup.txt.gz
Cebuano 6,603,567 39M ceb.txt.gz 3,675,024 24M ceb_dedup.txt.gz
Central Bikol 312 885 bcl.txt.gz 312 885 bcl_dedup.txt.gz
Central Khmer 20,690,610 1.1G km.txt.gz 10,082,245 581M km_dedup.txt.gz
Central Kurdish 48,478,334 487M ckb.txt.gz 18,726,721 226M ckb_dedup.txt.gz
Chavacano 130 520 cbk.txt.gz 130 520 cbk_dedup.txt.gz
Chechen 711,051 8.3M ce.txt.gz 568,146 6.7M ce_dedup.txt.gz
Chinese 14,986,424,850 508G zh.txt.gz 6,350,215,113 249G zh_dedup.txt.gz
Chuvash 3,041,614 39M cv.txt.gz 2,054,810 26M cv_dedup.txt.gz
Cornish 8,329 44K kw.txt.gz 2,704 14K kw_dedup.txt.gz
Croatian 34,232,765 226M hr.txt.gz 16,727,640 110M hr_dedup.txt.gz
Czech 7,715,977,441 53G cs.txt.gz 3,540,997,509 24G cs_dedup.txt.gz
Danish 2,637,463,889 16G da.txt.gz 1,620,091,317 9.5G da_dedup.txt.gz
Dhivehi 7,559,472 126M dv.txt.gz 4,726,660 79M dv_dedup.txt.gz
Dimli 19 146 diq.txt.gz 19 146 diq_dedup.txt.gz
Dutch 13,020,136,373 78G nl.txt.gz 6,598,786,137 39G nl_dedup.txt.gz
Eastern Mari 565,992 7.2M mhr.txt.gz 469,297 6.0M mhr_dedup.txt.gz
Egyptian Arabic 7,305,151 66M arz.txt.gz 3,659,419 33M arz_dedup.txt.gz
Emilian-Romagnol 6,376 25K eml.txt.gz 6,121 24K eml_dedup.txt.gz
English 418,187,793,408 2.3T en.txt.gz 215,841,256,971 1.2T en_dedup.txt.gz
Erzya 90 1.4K myv.txt.gz 78 1.2K myv_dedup.txt.gz
Esperanto 48,486,161 299M eo.txt.gz 37,324,446 228M eo_dedup.txt.gz
Estonian 643,163,730 4.8G et.txt.gz 309,931,463 2.3G et_dedup.txt.gz
Finnish 3,196,666,419 27G fi.txt.gz 1,597,855,468 13G fi_dedup.txt.gz
French 46,896,036,417 282G fr.txt.gz 23,206,776,649 138G fr_dedup.txt.gz
Galician 102,011,291 620M gl.txt.gz 63,600,602 384M gl_dedup.txt.gz
Georgian 171,950,621 3.6G ka.txt.gz 91,569,739 1.9G ka_dedup.txt.gz
German 44,878,908,446 308G de.txt.gz 21,529,164,172 145G de_dedup.txt.gz
Goan Konkani 124,277 2.2M gom.txt.gz 102,306 1.8M gom_dedup.txt.gz
Guarani 7,382 36K gn.txt.gz 4,680 24K gn_dedup.txt.gz
Gujarati 72,045,701 1.1G gu.txt.gz 50,023,432 722M gu_dedup.txt.gz
Haitian 1,014 3.9K ht.txt.gz 832 3.3K ht_dedup.txt.gz
Hebrew 2,067,753,528 20G he.txt.gz 1,032,018,056 9.8G he_dedup.txt.gz
Hindi 1,372,234,782 17G hi.txt.gz 745,774,934 8.9G hi_dedup.txt.gz
Hungarian 5,163,936,345 40G hu.txt.gz 2,339,127,555 18G hu_dedup.txt.gz
Icelandic 219,900,094 1.5G is.txt.gz 129,818,331 846M is_dedup.txt.gz
Ido 25,702 147K io.txt.gz 22,773 130K io_dedup.txt.gz
Iloko 142,942 874K ilo.txt.gz 105,564 636K ilo_dedup.txt.gz
Indonesian 4,574,692,265 30G id.txt.gz 2,394,957,629 16G id_dedup.txt.gz
Interlingua 180,231 662K ia.txt.gz 100,019 360K ia_dedup.txt.gz
Interlingue 5,352 24K ie.txt.gz 602 1.6K ie_dedup.txt.gz
Irish 14,483,593 88M ga.txt.gz 10,017,303 60M ga_dedup.txt.gz
Italian 22,248,707,341 137G it.txt.gz 11,250,012,896 69G it_dedup.txt.gz
Japanese 4,962,979,182 216G ja.txt.gz 1,123,067,063 106G ja_dedup.txt.gz
Javanese 104,896 659K jv.txt.gz 86,654 583K jv_dedup.txt.gz
Kalmyk 10,277 113K xal.txt.gz 10,155 112K xal_dedup.txt.gz
Kannada 81,186,863 1.7G kn.txt.gz 49,343,462 1.1G kn_dedup.txt.gz
Karachay-Balkar 185,436 2.6M krc.txt.gz 166,496 2.3M krc_dedup.txt.gz
Kazakh 191,126,469 2.7G kk.txt.gz 108,388,743 1.5G kk_dedup.txt.gz
Kirghiz 44,194,823 600M ky.txt.gz 28,982,620 388M ky_dedup.txt.gz
Komi 201,404 2.3M kv.txt.gz 95,243 1.2M kv_dedup.txt.gz
Korean 2,368,765,142 24G ko.txt.gz 1,120,375,149 12G ko_dedup.txt.gz
Kurdish 15,561,003 94M ku.txt.gz 9,946,440 60M ku_dedup.txt.gz
Lao 4,133,311 174M lo.txt.gz 2,583,342 114M lo_dedup.txt.gz
Latin 4,122,201 26M la.txt.gz 1,328,038 8.3M la_dedup.txt.gz
Latvian 520,761,977 4.0G lv.txt.gz 236,428,905 1.8G lv_dedup.txt.gz
Lezghian 247,646 3.3M lez.txt.gz 224,871 3.0M lez_dedup.txt.gz
Limburgan 4,730 29K li.txt.gz 4,283 27K li_dedup.txt.gz
Lithuanian 1,159,661,742 8.8G lt.txt.gz 516,183,525 3.9G lt_dedup.txt.gz
Lojban 154,330 736K jbo.txt.gz 141,973 678K jbo_dedup.txt.gz
Lombard 75,229 443K lmo.txt.gz 73,665 433K lmo_dedup.txt.gz
Low German 2,906,347 18M nds.txt.gz 2,146,417 13M nds_dedup.txt.gz
Lower Sorbian 1,787 13K dsb.txt.gz 966 7.1K dsb_dedup.txt.gz
Luxembourgish 4,403,577 29M lb.txt.gz 3,087,650 21M lb_dedup.txt.gz
Macedonian 189,289,873 2.1G mk.txt.gz 102,849,595 1.2G mk_dedup.txt.gz
Maithili 69,161 317K mai.txt.gz 874 11K mai_dedup.txt.gz
Malagasy 3,068,360 21M mg.txt.gz 1,872,044 13M mg_dedup.txt.gz
Malay 16,696,882 111M ms.txt.gz 6,045,753 42M ms_dedup.txt.gz
Malayalam 189,534,472 4.9G ml.txt.gz 95,892,551 2.5G ml_dedup.txt.gz
Maltese 2,995,654 24M mt.txt.gz 2,163,358 17M mt_dedup.txt.gz
Marathi 162,609,404 2.7G mr.txt.gz 82,130,803 1.4G mr_dedup.txt.gz
Mazanderani 73,870 691K mzn.txt.gz 64,481 602K mzn_dedup.txt.gz
Minangkabau 5,682 608K min.txt.gz 4,825 310K min_dedup.txt.gz
Mingrelian 299,098 5.8M xmf.txt.gz 228,629 4.4M xmf_dedup.txt.gz
Mirandese 171 1.2K mwl.txt.gz 152 1.1K mwl_dedup.txt.gz
Modern Greek 5,479,180,137 62G el.txt.gz 2,412,419,435 27G el_dedup.txt.gz
Mongolian 181,307,167 2.2G mn.txt.gz 68,362,013 838M mn_dedup.txt.gz
Nahuatl languages 1,234 12K nah.txt.gz 1,193 11K nah_dedup.txt.gz
Neapolitan 5,282 17K nap.txt.gz 4,147 13K nap_dedup.txt.gz
Nepali 107,448,208 1.8G ne.txt.gz 71,628,317 1.2G ne_dedup.txt.gz
Newari 564,697 5.5M new.txt.gz 288,995 4.1M new_dedup.txt.gz
Northern Frisian 1,516 4.4K frr.txt.gz 1,516 4.4K frr_dedup.txt.gz
Northern Luri 8,022 76K lrc.txt.gz 6,740 63K lrc_dedup.txt.gz
Norwegian 1,344,326,388 8.0G no.txt.gz 804,894,377 4.7G no_dedup.txt.gz
Norwegian Nynorsk 14,764,980 85M nn.txt.gz 9,435,139 54M nn_dedup.txt.gz
Occitan 750,301 5.8M oc.txt.gz 512,678 3.7M oc_dedup.txt.gz
Oriya 14,938,567 248M or.txt.gz 11,321,740 188M or_dedup.txt.gz
Ossetian 1,031,268 13M os.txt.gz 878,765 11M os_dedup.txt.gz
Pampanga 130 760 pam.txt.gz 52 304 pam_dedup.txt.gz
Panjabi 61,847,806 763M pa.txt.gz 37,555,835 460M pa_dedup.txt.gz
Persian 9,096,554,121 79G fa.txt.gz 4,363,505,319 38G fa_dedup.txt.gz
Piemontese 362,013 2.1M pms.txt.gz 337,246 1.9M pms_dedup.txt.gz
Polish 15,277,255,137 109G pl.txt.gz 6,708,709,674 47G pl_dedup.txt.gz
Portuguese 20,641,903,898 124G pt.txt.gz 10,751,156,918 64G pt_dedup.txt.gz
Pushto 46,559,441 361M ps.txt.gz 31,347,348 242M ps_dedup.txt.gz
Quechua 10,186 78K qu.txt.gz 8,691 67K qu_dedup.txt.gz
Romanian 3,984,317,058 25G ro.txt.gz 1,741,794,069 11G ro_dedup.txt.gz
Romansh 1,093 7.4K rm.txt.gz 960 6.5K rm_dedup.txt.gz
Russia Buriat 963 13K bxr.txt.gz 809 11K bxr_dedup.txt.gz
Russian 92,522,407,837 1.2T ru.txt.gz 46,692,691,520 568G ru_dedup.txt.gz
Sanskrit 4,331,569 93M sa.txt.gz 1,713,930 37M sa_dedup.txt.gz
Scottish Gaelic 310,689 1.9M gd.txt.gz 207,110 1.3M gd_dedup.txt.gz
Serbian 364,395,411 3.9G sr.txt.gz 207,561,168 2.2G sr_dedup.txt.gz
Serbo-Croatian 5,292,184 25M sh.txt.gz 1,040,573 5.8M sh_dedup.txt.gz
Sicilian 554 3.3K scn.txt.gz 468 2.8K scn_dedup.txt.gz
Sindhi 43,530,158 347M sd.txt.gz 33,028,015 263M sd_dedup.txt.gz
Sinhala 93,053,465 1.4G si.txt.gz 50,864,857 802M si_dedup.txt.gz
Slovak 1,322,247,763 9.1G sk.txt.gz 656,346,179 4.5G sk_dedup.txt.gz
Slovenian 387,399,700 2.5G sl.txt.gz 193,926,684 1.3G sl_dedup.txt.gz
Somali 1,202 61K so.txt.gz 472 16K so_dedup.txt.gz
South Azerbaijani 2,175,054 27M azb.txt.gz 1,528,709 19M azb_dedup.txt.gz
Spanish 47,545,122,279 278G es.txt.gz 25,928,290,729 149G es_dedup.txt.gz
Sundanese 30,321 211K su.txt.gz 20,278 141K su_dedup.txt.gz
Swahili 2,211,927 13M sw.txt.gz 1,376,963 8.1M sw_dedup.txt.gz
Swedish 7,155,994,312 44G sv.txt.gz 4,106,120,608 25G sv_dedup.txt.gz
Tagalog 98,949,299 573M tl.txt.gz 70,121,601 407M tl_dedup.txt.gz
Tajik 31,758,142 379M tg.txt.gz 21,029,893 249M tg_dedup.txt.gz
Tamil 420,537,132 9.3G ta.txt.gz 226,013,330 5.1G ta_dedup.txt.gz
Tatar 51,034,893 670M tt.txt.gz 23,825,695 305M tt_dedup.txt.gz
Telugu 123,711,517 2.5G te.txt.gz 79,094,167 1.6G te_dedup.txt.gz
Thai 951,743,087 36G th.txt.gz 368,965,202 16G th_dedup.txt.gz
Tibetan 1,483,589 187M bo.txt.gz 936,556 138M bo_dedup.txt.gz
Tosk Albanian 841,750 5.0M als.txt.gz 459,001 2.8M als_dedup.txt.gz
Turkish 7,577,388,700 60G tr.txt.gz 3,365,734,289 27G tr_dedup.txt.gz
Turkmen 1,113,869 11M tk.txt.gz 752,326 6.8M tk_dedup.txt.gz
Tuvinian 759 12K tyv.txt.gz 540 7.9K tyv_dedup.txt.gz
Uighur 8,657,141 122M ug.txt.gz 5,852,225 83M ug_dedup.txt.gz
Ukrainian 4,204,381,276 53G uk.txt.gz 2,252,380,351 28G uk_dedup.txt.gz
Upper Sorbian 545,351 4.2M hsb.txt.gz 236,867 1.8M hsb_dedup.txt.gz
Urdu 331,817,982 2.7G ur.txt.gz 218,030,228 1.7G ur_dedup.txt.gz
Uzbek 2,450,256 21M uz.txt.gz 1,381,644 12M uz_dedup.txt.gz
Venetian 3,492 18K vec.txt.gz 3,199 17K vec_dedup.txt.gz
Vietnamese 12,036,845,359 68G vi.txt.gz 5,577,159,843 32G vi_dedup.txt.gz
Volapük 321,121 2.0M vo.txt.gz 318,568 2.0M vo_dedup.txt.gz
Walloon 50,720 273K wa.txt.gz 37,543 203K wa_dedup.txt.gz
Waray 397,315 2.5M war.txt.gz 336,311 2.2M war_dedup.txt.gz
Welsh 37,422,441 213M cy.txt.gz 23,574,673 133M cy_dedup.txt.gz
Western Frisian 5,691,077 35M fy.txt.gz 4,223,816 26M fy_dedup.txt.gz
Western Mari 93,338 1.2M mrj.txt.gz 87,780 1.1M mrj_dedup.txt.gz
Western Panjabi 1,426,986 12M pnb.txt.gz 1,111,112 9.0M pnb_dedup.txt.gz
Wu Chinese 11,189 109K wuu.txt.gz 4,333 32K wuu_dedup.txt.gz
Yakut 2,547,623 42M sah.txt.gz 1,789,174 26M sah_dedup.txt.gz
Yiddish 13,834,320 141M yi.txt.gz 8,212,970 84M yi_dedup.txt.gz
Yoruba 8,906 55K yo.txt.gz 3,518 27K yo_dedup.txt.gz
Yue Chinese 186 3.7K yue.txt.gz 128 2.2K yue_dedup.txt.gz

License

These data are released under this licensing scheme:

  • We do not own any of the text from which these data has been extracted.
  • We license the actual packaging of these data under the Creative Commons CC0 license (“no rights reserved”).
  • To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR.
  • This work is published from: France.

CC0

Notice and take down policy

Notice: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

  • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
  • Clearly identify the copyrighted work claimed to be infringed.
  • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
  • And use the contact form below.

Take down: We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

Models

Here is a list of some language models that have been trained using the OSCAR corpus or that are part of the OSCAR project:

Model Language Corpus Authors Paper Files License
ELMo Bulgarian OSCAR Pedro J. Ortiz, Benoît Sagot and Laurent Romary ACL 2020 bg.zip MIT
ELMo Bulgarian Wikipedia Pedro J. Ortiz, Benoît Sagot and Laurent Romary ACL 2020 bg.zip MIT
ELMo Catalan OSCAR Pedro J. Ortiz, Benoît Sagot and Laurent Romary ACL 2020 ca.zip MIT
ELMo Catalan Wikipedia Pedro J. Ortiz, Benoît Sagot and Laurent Romary ACL 2020 ca.zip MIT
ELMo Danish OSCAR Pedro J. Ortiz, Benoît Sagot and Laurent Romary ACL 2020 da.zip MIT
ELMo Danish Wikipedia Pedro J. Ortiz, Benoît Sagot and Laurent Romary ACL 2020 da.zip MIT
ELMo French OSCAR Pedro Javier Ortiz Suárez, Yoann Dupont, Benjamin Muller, Laurent Romary and Benoît Sagot LREC 2020 fr.zip MIT
ELMo Finnish OSCAR Pedro J. Ortiz, Benoît Sagot and Laurent Romary ACL 2020 fi.zip MIT
ELMo Finnish Wikipedia Pedro J. Ortiz, Benoît Sagot and Laurent Romary ACL 2020 fi.zip MIT
ELMo Indonesian OSCAR Pedro J. Ortiz, Benoît Sagot and Laurent Romary ACL 2020 id.zip MIT
ELMo Indonesian Wikipedia Pedro J. Ortiz, Benoît Sagot and Laurent Romary ACL 2020 id.zip MIT

Here is a list of Language models trained by the community:

Model Language Cased Corpus Authors Paper Website Files License
BERT Arabic Cased OSCAR and Wikipedia Ali Safaya SOON GitHub Hugging Face MIT
CamemBERT French Cased OSCAR Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot ACL 2020 camembert-model.fr camembert-base.tar.gz MIT
CamemBERT French Cased Subsample of OSCAR (4 GB of text) Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot ACL 2020 camembert-model.fr camembert-base-oscar-4gb.tar.gz MIT
ELECTRA Norwegian Cased OSCAR and OPUS Viktor Alm N/A Hugging Face Hugging Face N/A
BERT Romanian Cased OSCAR, Wikipedia and OPUS Dumitrescu Stefan and Andrei Avram SOON GitHub Hugging Face MIT
BERT Romanian Uncased OSCAR, Wikipedia and OPUS Dumitrescu Stefan and Andrei Avram SOON GitHub Hugging Face MIT
BERT Turkish Cased and Uncased OSCAR, Wikipedia and OPUS Stefan Schweter Zenodo GitHub Hugging Face MIT
ELECTRA Turkish Cased OSCAR, Wikipedia and OPUS Stefan Schweter Zenodo GitHub Hugging Face MIT

If you have trained a model using the OSCAR corpus and would like to have it featured here, please contact us using the form below. Help us grow the community!

Contact