Mboshi-French Parallel Speech Corpus
5,130 utterances from a true documentation setting
About
This speech corpus was collected during a realistic language documentation process. It is made up of 5,130 speech utterances in Mboshi (Bantu C25) aligned to French text translations. Speech transcriptions are also made available: they correspond to a non-standard graphemic form close to the language phonology. The dataset is made available to the community for reproducible computational language documentation experiments and their evaluation. This data was collected in the context of the BULB project.
Downloading the data
Citing us
When using our dataset, please cite the following paper:
@inproceedings{godard-etal-2018-low,
title = "A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments",
author = "Godard, Pierre and
Adda, Gilles and
Adda-Decker, Martine and
Benjumea, Juan and
Besacier, Laurent and
Cooper-Leavitt, Jamison and
Kouarata, Guy-Noel and
Lamel, Lori and
Maynard, H{\'e}l{\`e}ne and
Mueller, Markus and
Rialland, Annie and
Stueker, Sebastian and
Yvon, Fran{\c{c}}ois and
Boito, Marcely Zanon",
editor = "Calzolari, Nicoletta and
Choukri, Khalid and
Cieri, Christopher and
Declerck, Thierry and
Goggi, Sara and
Hasida, Koiti and
Isahara, Hitoshi and
Maegaard, Bente and
Mariani, Joseph and
Mazo, H{\'e}l{\`e}ne and
Moreno, Asuncion and
Odijk, Jan and
Piperidis, Stelios and
Tokunaga, Takenobu",
booktitle = "Proceedings of the Eleventh International Conference on Language Resources and Evaluation ({LREC} 2018)",
month = may,
year = "2018",
address = "Miyazaki, Japan",
publisher = "European Language Resources Association (ELRA)",
url = "https://aclanthology.org/L18-1531/",
}@inproceedings{zanonboito:hal-02895895,
TITLE = {How Does Language Influence Documentation Workflow? Unsupervised Word Discovery Using Translations in Multiple Languages},
AUTHOR = {Boito, Marcely Zanon and Villavicencio, Aline and Besacier, Laurent},
URL = {https://hal.science/hal-02895895},
BOOKTITLE = {Journ{\'e}es Scientifiques du Groupement de Recherche: Linguistique Informatique, Formelle et de Terrain (LIFT).},
ADDRESS = {Orl{\'e}ans, France},
YEAR = {2019},
MONTH = Nov,
KEYWORDS = {multilingual approaches ; language documentation ; unsupervised word discovery ; approches multilingues ; documentation des langues ; d{\'e}couverte non supervis{\'e}e du lexique},
PDF = {https://hal.science/hal-02895895v1/file/1910.05154.pdf},
HAL_ID = {hal-02895895},
HAL_VERSION = {v1},
}