MaSS Multilingual Speech-to-Speech Corpus
An aligned speech extension from the CMU dataset
About
We extended the CMU multilingual corpus to speech by aligning the content from the Bible with corresponding audiobooks. The result is an 8,130-sentence long speech parallel dataset across eight languages. The eight covered languages are: English, Spanish, French, Hungarian, Romanian, Basque, Russian, and Finnish. The dataset covers any-to-any speech-to-speech translation. The data is shared under the MIT License.
Downloading the data
Citing us
When using our dataset, please cite the following paper:
@inproceedings{zanon-boito-etal-2020-mass,
title = "{M}a{SS}: A Large and Clean Multilingual Corpus of Sentence-aligned Spoken Utterances Extracted from the {B}ible",
author = "Boito, Marcely Zanon and
Havard, William and
Garnerin, Mahault and
Le Ferrand, {\'E}ric and
Besacier, Laurent",
editor = "Calzolari, Nicoletta and
B{\'e}chet, Fr{\'e}d{\'e}ric and
Blache, Philippe and
Choukri, Khalid and
Cieri, Christopher and
Declerck, Thierry and
Goggi, Sara and
Isahara, Hitoshi and
Maegaard, Bente and
Mariani, Joseph and
Mazo, H{\'e}l{\`e}ne and
Moreno, Asuncion and
Odijk, Jan and
Piperidis, Stelios",
booktitle = "Proceedings of the Twelfth Language Resources and Evaluation Conference",
month = may,
year = "2020",
address = "Marseille, France",
publisher = "European Language Resources Association",
url = "https://aclanthology.org/2020.lrec-1.799/",
pages = "6486--6493",
language = "eng",
ISBN = "979-10-95546-34-4",
}