Motivation

In face-to-face interaction, speakers take turns at an extremely high speed. When speaker A finishes speaking, speaker B typically starts within 200 milliseconds. Sometimes two speakers overlap at turn transition, but the overlap is typically minimal. How do we do this? How do we know when the current speaker in a conversation is going to stop? Rapid smooth turn transition is an extremely complex, almost miraculous ability of humans and we are still a long way of understanding how exactly it works. In fact, it is currently widely assumed that all humans in all societies can do this and do it in the same way. But this has never been thoroughly tested and investigated. In order to be able to do this, one needs specimens of everyday face-to-face interaction from all around the globe, small and large speech communities, mountain dwellers and seafarers, etc. to see in what way conversational exchanges are universally the same, and where culture-specific routines are found.

The basis for addressing the research challenge are collections of language data that are nowadays known as language documentations, i.e. corpora of audio-visual recordings from as many different speech communities as possible. These should be findable via relevant portals (e.g. VLO, OLAC), metadata and annotations should be openly available and searchable. Metadata should reflect community standards (e.g. CMDI profiles BLAM or lat-session). Annotations should be provided in ELAN or EXMARaLDA file formats. Especially promising is the application of speech recognition, speaker diarisation, overlap-detection and forced alignment to those corpora that will provide a solution for the “transcription bottleneck” and unlock language documentation data for analysis of turn-taking. The repositories should facilitate this type of analysis.

As a researcher in linguistic typology interested in interaction, I want to investigate universals in conversation by creating annotated audio-visual corpora that allow a cross linguistic analysis of the fine-grained temporal mechanisms of turn-taking. (104-01 Allgemeine und Vergleichende Sprachwissenschaft, Typologie, Außereuropäische Sprachen)

Richly annotated audio-visual corpora containing natural speech data are still very limited and overwhelmingly portray major languages, leaving an investigation into linguistically diverse data and thus into universal structures of language all but impossible.

Objectives

Basic principles of interaction can be studied in a truly cross-linguistic dataset: Questions concerning turn duration, transition times between speakers and temporal distribution of overlaps and backchannels can be analysed to set a typologically valid baseline for timing in conversation. This type of analysis requires access to annotated audio-visual corpora.

Unlocking data collections from language documentation for corpus linguistic research into conversation will provide a way to address these questions.

Solution

To fulfil this user-story, audio-visual corpora should be findable via relevant portals (e.g. VLO, OLAC), metadata and annotations should be openly available and searchable. Metadata should reflect community standards (e.g. CMDI profiles BLAM or lat-session). Annotations should be provided in ELAN or EXMARaLDA file formats. Especially promising is the application of speech recognition, speaker diarisation, overlap-detection and forced alignment to those corpora that will provide a solution for the “transcription bottleneck” and unlock language documentation data for analysis of turn-taking. The repositories should facilitate this type of analysis.