Soldiers’ letters of the 18th and 19th centuries: From the PDF edition to reusable, interoperable research data

Motivation

For the investigation of linguistic phenomena, dedicated corpora of certain types of texts, periods of time or authorships are required. Due to the lack of available data for his research on everyday writing, in particular on the syntax in soldiers’ letters of the 18th and 19th century, Dr. Marko Neumann has created such a corpus especially for his doctoral thesis. To this end he has transcribed unique sources in laborious archival work, supplemented by selected letters from older, sufficiently reliable editions. This alone is not the common practice, even less is it common that such data is then made freely available for subsequent use for other researchers—among other things because this takes a lot of effort, but at the same time brings comparatively little reputation in return.

However, Dr. Neumann has dared to take exactly this step by publishing the corpus of 170 letters relevant for research in various disciplines as an appendix to his dissertation—and this free of charge via the website of the Universitätsverlag Winter in Heidelberg, where the dissertation was also published. The transcribed letters, together with the most important metadata about the writer—his military rank, the dialect or regional variety of German used in the letter and its time of origin—can be downloaded in full as a PDF document.

The motivation behind this is, in addition to the intended re-use of the letter transcriptions, to ensure the greatest possible transparency of one’s own research results. Unfortunately—and in this respect the user story outlined here is prototypical and transferable to many other examples—there are obstacles that should be addressed within the framework of the NFDI (see Objectives and Challenges).

Objectives

There are at least two major obstacles to the subsequent use of the data, which is expressly requested by the data provider and by the wider community:

Legal uncertainty or (unintentional) restrictions: Although the data is available free of charge on the publisher’s website, it is marked with the standard “© 2019 Universitätsverlag Winter GmbH Heidelberg”. Subsequent use, even with the author’s permission (as desired or intended by the author), but without the publisher’s permission, is therefore not permitted. The publisher at least demands royalties for the subsequent use and would not generally exempt it.
Technical or format-specific obstacles: The 170 letters have been published in a coherent PDF file, which is not further subdivided into sections for each letter, and text and metadata is not distinguished in a machine-readable manner. A text box marked in green with metadata information on the respective letter is always followed by a text field which is not further subdivided. Paragraph and line boundaries, highlights, superscripts, etc. are typographically implemented, but are not encoded in a machine-readable form; diacritical characters such as square brackets or editor’s comments are formally indistinguishable in the middle of the text of the transcriptions; currency, abbreviation and other specific characters are also not encoded in a machine-readable form (e.g. as Unicode entities), etc.

The (not very specific, but very widespread) problem therefore consists first of all in the fact that the research data published in this way, despite the exemplary willingness and far from inconsiderable efforts of the data provider to make this possible, are simply not usable as such. NFDI, together with the community of scholars, should therefore raise awareness of the problem that researchers do not have to accept unwanted licensing barriers for (publishing) publications. Secondly, continuous consultation, training and technical support are required during implementation to achieve the (common) goal of preparing data in a standard-compliant manner and oriented toward best practices, thus making it usable again without considerable (largely manual) post-processing effort. Already during the creation of the corpus, but even more so during its publication, many resources could have been saved in the case described here as an example, as in many other similar cases, and important research data could have been made directly usable for the community.

Solution

Instead of its publication exclusively in print or layout-oriented, unstructured PDF format, valuable research data, such as the collection of primary sources mentioned here as an example, should be used in the following manner:

It should be published under a clearly stated license that permits subsequent use and editing (inevitably necessary for data curation)—instead of the usual copyright protection that the publisher secures together with the publication rights.
It should be provided parallel to the publishing house publication in a format that is as open, reusable, documented, rich in structure and information as possible—for the letter transcriptions this would ideally be TEI-XML (e.g. following DTABf for manuscripts), but at least DOCX, ODF or similar; for the letter metadata this would ideally also be TEI-XML/DTABf, CMIF, Dublin Core, EAD or at least XLSX, CSV or similar.

Challenges

Three factors are particularly important and can only be optimised by joint efforts of the scientific and NFDI communities:

Increase the motivation to publish research data, including those generated in the course of qualification works, i.e. this must be recognised as an independent scientific achievement.
An infrastructure that supports every step of the research data cycle, from the creation of data to quality assurance to its reusable publication, archiving and dissemination (as is possible in principle in the DTA extension module of the German Text Archive), must be anchored in the consciousness of the scientific community and be available on a long-term basis.
Training and advice, support and provision of tools (schemata, templates, etc.) must be guaranteed on an ongoing basis, so that the disciplinary scholar(s) can concentrate as far as possible on his or her specialist work of creating and evaluating data instead of on ‘technical’ questions. This would ultimately also increase the motivation to publish such data.