Adding phonetic transcripts to under-resourced languages in the DoReCo project

Motivation

It’s a hen and egg problem: the development of speech technology crucially depends on the existence of large annotated speech databases, and creating such large speech databases requires the use of speech technology… The project “Language Documentation Reference Corpus” (https://doreco.info) is a bold attempt to create one unified database containing a large number of under-resourced languages. This database will then be used to develop or adapt speech processing tools to the needs of its languages, thus facilitating the creation of new and extended resources.

Currently, DoReCo has received more than 50 language documentation speech corpora from all over the world, ranging from Sino-Tibetan Anal to Yurakaré from central Bolivia, from N||ng in South Africa to Gurindji in Australia. Some of these languages are endangered, others have only small populations. The state of documentation of these languages varies, as well as the degree of annotation and the formats used – and none of these languages is targeted by market-driven speech technology.

Objectives

The aim of DoReCo is to perform a technical validation of the corpora it has received, and to add to the existing documentation two time-aligned annotation tiers: an orthographic tier containing the words, and a phonetic tier containing the individual speech sound segments and their phonetic IPA or SAMPA label. The result is a suite of well-structured machine readable speech data sets in adapted formats to allow content search, a mapping of linguistic and phonetic properties, and the application of signal processing for feature extraction.

Solution

DoReCo relies on the web services provided by the Bavarian Archive of Speech Signals, a CLARIN-D centre with a focus on spoken language. The workflow consists of both manual and automatic processing steps. First, an audio signal is transcribed orthographically by language experts. This orthographic transcript is then fed into a pipeline web service which consists of a grapheme-to-phoneme (G2P) converter and the segmentation system (MAUS). The result is a sequence of word segments. Then, these word segments are corrected manually. In a second call of the web service, the word boundaries are left unchanged, and for every word MAUS computes the phoneme segments anew. The resulting highly consistent time-aligned annotation is then returned to extend the existing corpus documentation.

Review by Community

In June 2020, DoReCo had fully processed 5 language corpora: Arapaho, Kamas, Svan, Urum, and Yongning Na, and these databases are the basis for “several exciting phonetic and morphological studies”.

References

Paschen, Ludger; Delafontaine, François; Draxler, Christoph; Fuchs, Susanne; Stave, Matthew; Seifart, Frank (2020) Building a Time-Aligned Cross-Linguistic Reference Corpus from Language Documentation Data (DoReCo); in: Proceedings of the 12th Conference on Language Resources and Evaluation, May 2020, pg. 2657-2666