CLiGS: Textbox

Motivation

The Digital Humanities junior research group “Computational Literary Genre Stylistics” (CLiGS) was funded by the German Ministry for Education and Research (BMBF) between 2014 and 2020. It was affiliated to the Department for Literary Computing at the University of Würzburg (Prof. Dr. Fotis Jannidis) and led by Christof Schöch. Within the group, a series of corpora were gathered and curated. A selection of them was released as the “CLiGS Textbox” in an early phase of the project. There are nine different corpora in Romance languages (Spanish, French, Italian, and Portuguese). They were released on GitHub and archived in Zenodo. These platforms were chosen because of their robustness and large acceptance within computational studies.

The corpora have been annotated in XML-TEI, which is a frequent format for collections and editions within the DH community. Several types of metadata (administrative, descriptive, procedural) are included in the TEI file. Further files for the project specific TEI schema and the validation of metadata fields were made available. However, neither the release platform (GitHub) nor the archive option (Zenodo) are specific to the community. They are not necessarily the first place to look for literary corpora, do not offer special features for text or literary texts, such as presenting the works together with other versions (different editions, different languages), nor are the texts or metadata properly indexed by these tools. They also do not facilitate to move the texts to other DH tools.

Objectives

As researchers, we would need Text+ to cover the following aspects:

the corpora should be archived in the long term, for example in TextGrid or DariahRep as community specific repositories
the corpora should be citable and referenced unambiguously
the repository should fully index the metadata and data of the corpora
it should clearly mark important distinctions, such as the language of the text or the degree of modernization of the edition (i.e. whether the text contains the original orthography and punctuation or whether it has been modernized by a specialist, and if so, which modifications were made. This is especially important for texts between the medieval and the Early modern period)
it should help other users to download both single texts or the entire corpora in one click
it should facilitate combining these texts with other corpora (with third party corpora within the repository or with their own texts)
it should facilitate sending the texts to further tools, both single texts and entire corpora
it should facilitate the conversion to other formats (txt, xhtml, epub, pdf)
it should convert or help in the conversion of the metadata into RDF formats

Solution

Within Text+, the service that is closest to our requirements is clearly TextGrid (Rep and Lab). Some objectives stated above are already satisfied (long-term repository, index of text, distinction of several editions of the same work, download option, combination of corpora, sending texts to further tools, conversion of texts into some formats). However, others need to be improved, in our opinion.

One aspect is that the portal contains predominantly texts in German. This was so obvious that the language information has not been stated explicitly in almost any text, and therefore it is impossible to retrieve those few cases that contain texts in other languages. The information about the language should be added retrospectively, at least to the “Digitale Bibliothek”.

A further problem is constituted by the fact that the many metadata fields are not indexed. That makes these fields invisible to the search engine.

It would be necessary to offer workshops at conferences both at a German-speaking (Germanistentag, Romanistentag, Anglistentag, Bibliothekartag, DHd…) and at an international level (DH, TEI, EADH). For our case, workshops about the import of the data would be especially interesting. Such workshops could also have the form of tutorials (perhaps as videos) or blog posts of projects showing examples of how they have imported the files. For these formats, the preparation of exemplary files would be of great help, that means, files of how your data needs to be structured in order to be successfully uploaded. For reading purposes, PDF is a standard format. The texts should also be converted into this format.

The options in TextGrid about sending the texts to further tools (CLARIN switchboard, Voyant) are of great use. However, almost none of them accept TEI as import format and require the texts to be first transformed into txt.

It would be of great interest to connect more linguistic tools, for example to train specific models such as topic modeling or word embeddings based on the corpora.

More linguistic tools working in other languages than German and English should be considered, especially European languages.

Challenges

Any long-term preservation is a commitment that needs to be supported by a regular funding. If Text+ is not funded, the stability of one of the largest Open Access corpora is in serious danger.