Mboshi-French Parallel Speech Corpus

5,130 utterances from a true documentation setting

About

This speech corpus was collected during a realistic language documentation process. It is made up of 5,130 speech utterances in Mboshi (Bantu C25) aligned to French text translations. Speech transcriptions are also made available: they correspond to a non-standard graphemic form close to the language phonology. The dataset is made available to the community for reproducible computational language documentation experiments and their evaluation. This data was collected in the context of the BULB project.

Downloading the data

  • Dataset: GitHub
  • Multilingual extension: GitHub

Citing us

When using our dataset, please cite the following paper:

@inproceedings{godard-etal-2018-low,
    title = "A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments",
    author = "Godard, Pierre  and
      Adda, Gilles  and
      Adda-Decker, Martine  and
      Benjumea, Juan  and
      Besacier, Laurent  and
      Cooper-Leavitt, Jamison  and
      Kouarata, Guy-Noel  and
      Lamel, Lori  and
      Maynard, H{\'e}l{\`e}ne  and
      Mueller, Markus  and
      Rialland, Annie  and
      Stueker, Sebastian  and
      Yvon, Fran{\c{c}}ois  and
      Boito, Marcely Zanon",
    editor = "Calzolari, Nicoletta  and
      Choukri, Khalid  and
      Cieri, Christopher  and
      Declerck, Thierry  and
      Goggi, Sara  and
      Hasida, Koiti  and
      Isahara, Hitoshi  and
      Maegaard, Bente  and
      Mariani, Joseph  and
      Mazo, H{\'e}l{\`e}ne  and
      Moreno, Asuncion  and
      Odijk, Jan  and
      Piperidis, Stelios  and
      Tokunaga, Takenobu",
    booktitle = "Proceedings of the Eleventh International Conference on Language Resources and Evaluation ({LREC} 2018)",
    month = may,
    year = "2018",
    address = "Miyazaki, Japan",
    publisher = "European Language Resources Association (ELRA)",
    url = "https://aclanthology.org/L18-1531/",
}
Use this following bibtex for citing the mmboshi corpus:
@inproceedings{zanonboito:hal-02895895,
  TITLE = {How Does Language Influence Documentation Workflow? Unsupervised Word Discovery Using Translations in Multiple Languages},
  AUTHOR = {Boito, Marcely Zanon and Villavicencio, Aline and Besacier, Laurent},
  URL = {https://hal.science/hal-02895895},
  BOOKTITLE = {Journ{\'e}es Scientifiques du Groupement de Recherche: Linguistique Informatique, Formelle et de Terrain (LIFT).},
  ADDRESS = {Orl{\'e}ans, France},
  YEAR = {2019},
  MONTH = Nov,
  KEYWORDS = {multilingual approaches ; language documentation ; unsupervised word discovery ; approches multilingues ; documentation des langues ; d{\'e}couverte non supervis{\'e}e du lexique},
  PDF = {https://hal.science/hal-02895895v1/file/1910.05154.pdf},
  HAL_ID = {hal-02895895},
  HAL_VERSION = {v1},
}