Towards improved research results in computational linguistics via Text+

Motivation

As digitalization and Artificial Intelligence become infused throughout society, increasing amounts of data become available. This is also the case for the field of Computational Linguistics and Natural Language Processing (DFG-Fachsystematik: 104-04 Computational Linguistics/104 Linguistics), which has recently seen tremendous advances. These advances lead to an ever increasing number of corpora (i.e., written and spoken language texts) and lexical resources. Thus, the need for a centralized infrastructure managing high-quality, open-access, easily-accessible, annotated, multi-genre, multilingual, multimodal, unbiased and comparable data is stronger than ever.

We observe this need in our own research fields: Computational Historical Linguistics, Argumentation Mining and Question-Answering. In Computational Historical Linguistics, research is impeded due to the natural sparsity of historical data. Moreover, there is a lack of a uniform annotation scheme and of metadata about time periods, languages and sources. This lack is also observed in Argumentation Mining, along with further challenges: inadequacy of multi-genre and multi-modal corpora. The former can ensure that findings are not an artifact of the genre and the latter can lead to improved research results. This aspect is also particularly relevant in Question-Answering, where distinguishing between types of questions/answers requires more “context” than what a mono-modal corpus offers.

Objectives

Community members pursue these goals, but their endeavour cannot be fully exploited without a consistent, uniform effort.

First, current resources are available for different languages and different purposes, thus featuring various characteristics. This leads to an abundance of annotation formats encoding different kinds of linguistic information. Some resources even hold multiple layers of annotation (e.g., phonetic, syntactic, semantic) or different modalities of the data (e.g., recordings, videos, transcripts). This variety of annotation standards is cognitively heavy and does not facilitate the use of computational tools. Moreover, different annotation formats impede the comparability of results.

A further problem is that the available resources are typically spread across institutions, which end up maintaining different versions of a resource. Thus, it is often difficult to gather the relevant data and unclear which corpus version is studied. Overall, this has consequences for research results, again impeding comparability. Furthermore, this issue feeds into the general reproducibility crisis of scientific results.

Computational linguistic resources contain data of different genres and time periods, produced by speakers of different varieties. These factors bias the resulting corpora, since they influence the linguistic features involved. It is therefore vital to collect extra-linguistic properties and provide them as meta-data along with the resource.

Another challenge arises from applications, whose results depend on grounding the data within a broader “context”. First, many available corpora do not accommodate the use of their full content because they employ user interfaces which only allow for queries into specific search patterns, i.e., specific keywords. Second, there is currently a lack of multi-modal corpora, whose different modalities can act as broadened “context”, e.g., audio and transcribed data.

Thus, we identify the following demands to an infrastructure like Text+:

Centralized, open-access platform for data storage, management and maintenance, including version control and licensing support
Unified, multilayered annotation scheme/standard
Metadata management system

Solution

Text+ can aid researchers in various ways. First, it can determine a uniform annotation scheme that all contributed resources should conform to. Text+ could benefit from popular initiatives like the Text Encoding Initiative (TEI), which precisely targets the humanities, social sciences and linguistics. Such a uniform format for the data/metadata and its annotation also facilitates the management and usability of the resources. First, it allows for more effective data curation and provenance. Second, it contributes to higher quality of research results because data are comparable and easily linkable. Particularly, Text+ can organize data in a way that similar data/metadata are linked together in a large network, following similar initiatives like the Linked Open Data Cloud (LODC). In this way, researchers can locate similar types of resources and easily create multi-genre, multilingual, multimodal, extended corpora. Moreover, consistent data allows for the implementation of suitable analysis and search tools. Text+ can provide tools that perform basic but indispensable tasks, e.g., frequency metrics, pattern-based search, part-of-speech tagging. This functionality should help researchers minimize the time they need to perform the same action for different resources. Apart from the uniform data format, to support data provenance and curation, Text+ should aim at a version control infrastructure. This can also increase the visibility of curated resources as researchers stay up-to-date to improved or newly-added data. Additionally, Text+ should provide guidance for licensing issues, through a user interface listing suitable options for different purposes. Overall, these and further functionalities can contribute to better quality control of the whole data management system.

Challenges

Like all similar initiatives at their infancy, Text+ will struggle with contributions not conforming to the predetermined standards. This can be the result of unwillingness to conform or plain technical difficulty to do so. The former will be treated as time passes and the benefits of uniform, centralized infrastructures become clearer. The latter can be solved through the support of Text+, especially for contributors with no expertise in the computational field. A further, more challenging issue will be posed by linguistic uncertainty. The uncertain, often overlapping and fine-grained nuances of natural language will complicate the adoption of a uniform annotation framework. To this end, experts of all subfields should consult on the annotation needs of each subfield and the infrastructure should be flexible enough to be respectively augmented.

Review by community

Yes, we are willing to review the services provided by Text+.