Motivation

Within the CLiGS project at the University of Würzburg (2015-2020) one of the corpora gathered was a middle-size corpus of Spanish novels. It contains novels by Spanish authors published between 1880 and 1939. One of its characteristics is that a section of the corpus (around a third), is still under copyright, because the authors lived up to the year 2000*.*

The corpus is encoded in XML-TEI. It contains dozens of fields of metadata (in the teiHeader) about the plot, the author or the publication. Besides, it has been linguistically and textually annotated through several tools (narrative, grammatical, semantic information). In the original proposal, it was pointed out that the data would be made available, without further specification of what to do if the texts are still under copyright.

Objectives

As a researcher, I need support about the legal framework of publishing extracted data from texts still in copyright.

I also need to use repositories for the archiving of these texts to allow other researchers to access my data. This can be offered in several ways:

  1. Original and complete data after a series of identification steps or registration for materials that are still protected.
  2. Extracted features in large spans of texts (frequencies per volume or chapter, sentence).
  3. Extracted features in shorter spans of texts (paragraphs, sentences, verses).
  4. Extracted features based on the collocation or n-grams.
  5. Further models to download, such as topic modeling or word-embeddings.
  6. Linguistic annotation from the entire text.
  7. Metadata.
  8. Markup without text (to analyze the structure of the text, such as number of paragraphs, number of verses, etc.).

Some of these features should be published openly, without any kind of registration (metadata annotated by me, frequencies of markup). Text+ should allow archiving but not making available data that is still in copyright. It should be defined at what year which text is free to be published, and this should be done automatically or semi-automatically (the author died in 1955, which is marked in the TEI in a specific element or in other metadata fields of Text+. That means that the text will be free in 2025; in that year the text should be published).

These requirements relate to the data domain collections.

Solution

It should be necessary to associate clearly the original file and the files with the extracted features and the original files, even when the original files are not published. One possibility is to publish the teiHeader without the text, with an explanation about why the text is missing. A further possibility is to add the extracted features also to the TEI file. How exactly should be discussed.

Some of the steps for the extraction of these features and creations of the models can be accomplished through the tools in Switchboard.

The user registration could use the already existing service of AAI registry.

It is unclear in which formats the extracted features should be accessible. In practice, these are normally in CSV files, without any kind of standardization as far as I know. The basic decision for the CSV composition (separator, whether values are in quotations marks) and the columns names should be defined. Furthermore, the index (names of the rows) should be clear. Other formats (json, xml, TEI) are also worth discussing.

Challenges

Archiving and publishing texts under copyright present certain legal risks. Academia has been very reluctant to publish any extracted data from works which are still in copyright. However, the practices in other fields, like companies, are more open. The best known example are the Google n-grams from Google Books.

Other research areas, such as medicine, have experience in allowing researchers the use of sensitive data in ways that is anonymized but still useful for research. This idea is present in the FAIR principles. Humanities data should not be a greater problem than health records.

However, the risk of stopping publishing data still in copyright could be greater in the long run: we are blind to a great part of the 20th century and not giving answers to current problems.

Besides, without being able to publish somehow protected data, many research papers are not reproducible.