Ungoliant v1

The first version of the new pipeline generation tool.

Ungoliant v1.0.0

This is the first release of Ungoliant, a project that provides tools to generate corpora from CommonCrawl. Ungoliant also includes already established pipeline(s), in particular to generate OSCAR-like corpora.

Ungoliant also replaces goclassy.

Get the release from the Releases tab or via cargo: cargo install ungoliant.

Features

  • Feature: Downloading of CommonCrawl. Ungoliant features an asynchronous multithreaded downloader that is faster than the previous solution used for OSCAR.
  • Feature: Generation of both OSCAR v1 and OSCAR v1.1 corpora. The new OSCAR v1.1 is a backward compatible corpus including metadata.
  • Feature: Deduplication using runiq. Ungoliant currently uses a fork that enables library access.
  • Feature: Splitting, compression and packaging. These three operations facilitates generated corpora preparation for ulterior distribution. Note that these operations are not yet performed on the fly, and may need huge free space.

Changes

These changes are feature evolutions from goclassy

  • Pipelines and tools are available through the ungoliant command-line interface: Downloading, corpus generation, deduplication and packaging are all available through Ungoliant’s CLI.
  • Downloading and compilation of fasttext is not needed anymore. Be sure to have cmake installed if you plan on compiling Ungoliant yourself.
  • General performance improvements when using implemented pipelines. This has been possible by using a multithreading of a finer granularity, using rayon.rs.
Julien Abadji
Julien Abadji
Research Engineer

I’m a research engineer at ALMAnaCH research team at Inria

Pedro Ortiz Suarez
Pedro Ortiz Suarez
PhD Student

I’m a PhD student in Computer Science at Sorbonne Université and at the ALMAnaCH research team at Inria