Marcely Zanon Boito | MaSS Multilingual Speech-to-Speech Corpus

About

We extended the CMU multilingual corpus to speech by aligning the content from the Bible with corresponding audiobooks. The result is an 8,130-sentence long speech parallel dataset across eight languages. The eight covered languages are: English, Spanish, French, Hungarian, Romanian, Basque, Russian, and Finnish. The dataset covers any-to-any speech-to-speech translation. The data is shared under the MIT License.

Downloading the data

Dataset and code:

Citing us

When using our dataset, please cite the following paper:

@inproceedings{zanon-boito-etal-2020-mass,
    title = "{M}a{SS}: A Large and Clean Multilingual Corpus of Sentence-aligned Spoken Utterances Extracted from the {B}ible",
    author = "Boito, Marcely Zanon  and
      Havard, William  and
      Garnerin, Mahault  and
      Le Ferrand, {\'E}ric  and
      Besacier, Laurent",
    editor = "Calzolari, Nicoletta  and
      B{\'e}chet, Fr{\'e}d{\'e}ric  and
      Blache, Philippe  and
      Choukri, Khalid  and
      Cieri, Christopher  and
      Declerck, Thierry  and
      Goggi, Sara  and
      Isahara, Hitoshi  and
      Maegaard, Bente  and
      Mariani, Joseph  and
      Mazo, H{\'e}l{\`e}ne  and
      Moreno, Asuncion  and
      Odijk, Jan  and
      Piperidis, Stelios",
    booktitle = "Proceedings of the Twelfth Language Resources and Evaluation Conference",
    month = may,
    year = "2020",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://aclanthology.org/2020.lrec-1.799/",
    pages = "6486--6493",
    language = "eng",
    ISBN = "979-10-95546-34-4",
}