We explore the impact of the training corpus on contextualized word embeddings in five mid-resource languages.
We explore convert the NER annotations of the French TreeBank to a more user-friendly format and establish a new state of the art for French NER.
We propose a new pipeline to filter, clean and classify Common Crawl by language, we publish the final corpus under the name OSCAR.