Overcoming the dichotomy of code and corpora

Motivation

This user story refers to experiences I have gained from ongoing cooperation with the German National Library. This cooperation aims to create a comprehensive corpus of all digitally published dime novels and subsequent statistical analysis motivated by research questions from literary studies.
During the last two years, I developed a pipeline to extract plain text from ebooks, sort out unwanted pages (author biographies, ads, etc.), tokenize, lemmatize and recognize Named Entities, POS tags, morphological features, and dependency trees. The resulting information is stored in various data formats with corresponding metadata from the National Libraries catalog and publishers. Today the collection contains more than 30.000 novels, a volume that is only possible with access to the National Libraries archives.

In the relatively young field of computational literary studies, the need to share data and code to guarantee reproducible research is essential. The standard procedure is currently two-stage: Code is made available in git repositories simultaneously with the publication of new results. Text corpora are released (mostly in TEI) at the end of long-term projects at zenodo or other persistent hosting services (DTA, TextGrid, etc.).

It would be more practical if code and data were made available at the same time. But this approach stands in contrast to the tradition of textual criticism, which has a strong influence in the field and whose high quality demands on the publication of resources have a deterrent effect. In other cases (dime novels), the impossibility of publishing data protected by copyright prevents good practice in terms of reproducibility.

However, delaying corpora release is not the only problem, because even when made available, the relationship between data and code is not always clear. If we imagine a released corpus as the product of a long process, previous experiments must have been conducted on a subset or with lesser data quality.

At the dime novel project, I work on a long-term basis towards the unattainable ideal of a finished corpus, which requirements continuously change due to new pre-processing methods. Besides this, I perform experiments on an unfinished state of this corpus and create documentation on everything to provide reproducibility. It would be desirable to represent such a process and some of its results in the infrastructure developed by Text+.

Objectives

Of course, Text+ cannot claim to change the practice of publishing text data in the Digital Humanities, because those impulses need to come out of the community itself. What Text+ can do about it, and in my opinion, this is the prerequisite to evoke such a behavioral change in DH, is to provide a technical infrastructure that overcomes the dichotomy of code and corpora release. The infrastructure should also represent the creation of a corpus as a process in time with the option to define checkpoints referring to experiments, which can be accessed even after the final release. For every derivative within a project, the infrastructure should require code and documentation of the process and used external code or tools.

The issue of copyright-protected research data should be in the focus of Text+ because this topic cannot be addressed solely by a group of researchers or even a field as DH, but needs higher institutional organization.

Solution

The Text+ infrastructure should orient towards git, but with a few extra features. First, the technical issue with git is the missing support to store extensive text collections. These collections are either hundreds of small files or several huge ones. Both lead to massive performance decrease and prevent a fluent code development in the same repository. A solution could be a metamodel where a collection of text is maintained in another versioning environment, but its metadata is interconnected with the experimental code from git repositories.

Source data and derivatives should always be linked with an ID system, which, in my imagination, could be connected with the National Libraries catalog and automatically imports available metadata. Derivative formats should be standardized (e.g. CoNLL format), and their quality needs to be checked by some kind of validation mechanism.

The publication of copyright-protected material may be realized by providing a set of derivatives without copyright restrictions, no matter how the source material’s legal status is constituted. There are already promising efforts pointing towards this direction^[1]^.

Challenges

The main obstacles on the way to the infrastructure described above are of course restrictions that emerge from new legal declarations on copyright and to keep track of technical changes in the field of NLP during development to prevent being outdated on release.

Review by community

I am willing to give feedback on the development of Text+.

^[1]^ Katharina de la Durantaye, Benjamin Raue: Urheberrecht und Zugang in einer digitalen Welt – Urheberrechtliche Fragestellungen des Zugangs für Gedächtnisinstitutionen und die Digital Humanities – in: RuZ – Recht und Zugang, Pages: 83-94.

Christof Schöch et al., Abgeleitete Textformate: Text und Data Mining mit urheberrechtlich geschützten Textbeständen, to appear in ZfdG.