Building Corpora for Comparative Analysis to Identify Declarations of Love in Letters

Motivation

In the master exercise seminar/class “Computer-aided functional text analysis with materials from the love letter archive”, which is conducted as a cooperative online course at the Institute for German Studies at the University of Koblenz-Landau supported by the Institute for Linguistics and Literature Studies at the Technical University of Darmstadt, I am to scientifically transcribe, annotate and analyse sample material from the love letter archive (LBA). In this exercise/class I will use the TextGrid Laboratory to build up a text data corpus in order to formulate linguistic questions about the sample material and examine them with computer-philological instruments such as Voyant and Antconc. In a double-keying process, my seminar group and I will use the existing material-specific transcription guidelines to build up the full text corpus with the TextGridLab. From this, my corrected full text master files of the letters are created, in which I now mark the salutation, declarations of love and the greeting formula(s). During this work, first questions about the use and construction of declarations of love arise. I am now trying to sharpen and test them with the suggested tools.

One challenge in this exercise is to faithfully reproduce the heterogeneous sample material, i.e. the different letter testimonies from the entire German-speaking area and over long periods of time. Up to now, this has been done exclusively by hand, as there is no comprehensive, already transcribed reference corpus of the individual manuscripts. A further challenge is the annotation of the material based on TEI-XML. For declarations of love there is no TEI-XML definition yet. The survey is therefore explorative and is intended to produce a definition of a new element <declove> and thus extend the existing guidelines. A further challenge arises from the distributed technical infrastructure and the analysis tools used. Although the TextGridLab supports collaborative work on text resources, it does not provide an interface to Voyant and Antconc. Therefore, although I can comfortably create the test corpora together with my group – although I lack in-depth knowledge of XML – I have to export the corpus and analyse it outside of TextGrid for the analyses.

For the collaborative seminar exercise it would therefore make sense if a) it were possible to generate full texts of very good quality more easily, faster and in larger quantities using an automatic text recognition procedure, b) an extended validation schema for the TEI-XML and the new <declove> element already existed and was made available and finally, c) these mentioned low-threshold tools or a similar feature set were integrated and available in the TextGridLab.

Objectives

Working with the TextGridLab allows for efficient collaboration across different locations, the analysis tools used are available online or as open source. However, the use also involves certain entry barriers, especially for students from non-DH programmes. This is accompanied by a high level of supervision for the teaching staff. To support the online course, specific step-by-step tutorials and introductions along the processing pipeline are now available, but this primarily requires independent virtual work alone and in consultation with the group.

For students, the TextGridLab is perfectly suitable and also useful for producing uniform Unicode-UTF-8 character-encoded full-text transcripts of optimal quality. However, due to the heterogeneity of the material and the characteristics of the individual handwritings, this work step is time-consuming and costly. Since an adequate reference corpus is missing, automatic text recognition tools do not yet provide satisfactory results. Therefore, only limited amounts of text can be created and edited during a seminar. This also means that the indexing of the love letter archive over a longer period of time must be handled consistently on the one hand, and flexibly on the other hand with regard to the constant expansion.
For the creation of valid TEI-XML annotated transcriptions basic knowledge of XML is necessary, which can neither be assumed nor taught in this exercise. Therefore, only rudimentary excellent transcriptions can be created by the students in this exercise. The main purpose of these transcriptions is to develop and advance a definition of the concept “declaration of love, <declove>”. In this case it would be helpful if a schema extension was available in TextGrid or if valid TEI-XML documents could be generated from the rudimentarily awarded documents by means of an automatic transformation.
In particular, the range of functions of Voyant and Antconc (both tools shall only exemplarily present the spectrum of digital analysis possibilities at this point) supports the answering of the research question and the definition of declarations of love. So far, the TextGridLab does not allow a corpus selection to be imported directly into one of the tools, nor does it allow visualisations or quantitative linguistic analyses to be carried out within the environment.

Although the TextGridLab provides a tool for collaborative corpus creation, there are no interfaces to widely used low-threshold analysis tools or simple ways to use these tools directly. Such possibilities would be especially useful for newcomers to digital work. Furthermore, there is still a lack of tools for the collection of individual text testimonies that are heterogeneous in terms of handwriting and language variety; likewise, training and reference corpora would be helpful.

From these projects, very specific requirements for standard extension according to the specific research questions are coming up. It would be helpful and motivating for participation in standardisation initiatives if such requirements were bundled and introduced into the appropriate committees (e.g. TEI extensions).

Solution

For my work task in the master’s seminar it would be desirable in principle if larger amounts of text were already available and usable, since the generation of full texts and the associated quality controls take a lot of time. It would also be useful to extend the material-specific processing pipeline in the TextGridLab with low-threshold visualisation and linguistic analysis tools, to correct existing software errors and to relieve the support team with short webinars and tutorials on XML, TEI, linguistic analysis, etc.

Challenges

One risk is that I lose myself and my research group along the processing pipeline and thus collaborative work can no longer be continued. This is especially true if I have to leave the TextGridLab as a working environment.

Review by the community

The examination of the offered services is possible e.g. in further seminars.