MaSS Multilingual Speech-to-Speech Corpus

An aligned speech extension from the CMU dataset

About

We extended the CMU multilingual corpus to speech by aligning the content from the Bible with corresponding audiobooks. The result is an 8,130-sentence long speech parallel dataset across eight languages. The eight covered languages are: English, Spanish, French, Hungarian, Romanian, Basque, Russian, and Finnish. The dataset covers any-to-any speech-to-speech translation. The data is shared under the MIT License.

Downloading the data

  • Dataset and code: GitHub

Citing us

When using our dataset, please cite the following paper:

@inproceedings{zanon-boito-etal-2020-mass,
    title = "{M}a{SS}: A Large and Clean Multilingual Corpus of Sentence-aligned Spoken Utterances Extracted from the {B}ible",
    author = "Boito, Marcely Zanon  and
      Havard, William  and
      Garnerin, Mahault  and
      Le Ferrand, {\'E}ric  and
      Besacier, Laurent",
    editor = "Calzolari, Nicoletta  and
      B{\'e}chet, Fr{\'e}d{\'e}ric  and
      Blache, Philippe  and
      Choukri, Khalid  and
      Cieri, Christopher  and
      Declerck, Thierry  and
      Goggi, Sara  and
      Isahara, Hitoshi  and
      Maegaard, Bente  and
      Mariani, Joseph  and
      Mazo, H{\'e}l{\`e}ne  and
      Moreno, Asuncion  and
      Odijk, Jan  and
      Piperidis, Stelios",
    booktitle = "Proceedings of the Twelfth Language Resources and Evaluation Conference",
    month = may,
    year = "2020",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://aclanthology.org/2020.lrec-1.799/",
    pages = "6486--6493",
    language = "eng",
    ISBN = "979-10-95546-34-4",
}
Creative Commons License