This user story is focussing on research relying on collections of literary or factual texts that are subject to copyright restrictions and therefore posing limitations to research.
In many cases, the transparency and reproducibility of the research and the sustainability and reusability of the research data are not given, because the underlying data cannot be freely published. Similar concerns apply to research on such data if it is available in libraries or by publishers: often researchers have to deal with severe restrictions, if only a certain platform or tool for analysis may be used or if the research has to take place on the library’s premises.
The aim is therefore to implement a strategy to make copyrighted textual materials (especially entire collections of texts suitable for quantitative methods of analysis) as freely available as possible outside of restrictive licenses and platforms with limited flexibility without infringing copyright law.
The technical solution pursued here will make use not of the full texts as such, but of detailed statistical information derived from linguistically-annotated full texts. Various formats are feasible in theory, but a prototypical format would start by providing a token-level linguistic annotation, at least by lemma and part-of-speech. Each text is then split into segments of a fixed length (e.g. 100 words). The order of the segments in each text is maintained, but the order of the tokens in each segment is randomized. Arguably, such derived features are not subject to copyright restrictions, as the original text cannot be read or reconstructed from such a format. However, quantitative methods of text analysis such as stylometry, topic modeling or keyness analysis can be performed with little or no loss in accuracy.
What are necessary measures or basic steps for the implementation of such a strategy?
- A thorough legal review and approval of the specific formats to be published (if necessary, in the measure “Community Services: Legal Issues”).
- Designing, implementing, documenting and making available standardized processes for creating the formats to be re-usable by other researchers (possibly a task for the infrastructure group in Text+).
- Transform selected copyrighted collections accordingly and include them in the Text+ database.
- Organize certification of processes and formats to establish their provenance, transparency, reliability and sustainability and encourage widespread re-use.
Derived textual features require a trade-off between the information richness of the format on the one hand and the certainty of their legal distribution on the other hand. In addition, in order to ensure widespread uptake, it must be ensured that the standardization of the format and the trustworthiness of the creation process in terms of reliable provenance are met, for example by certifying the formats and processes.
Review by community
The positive effects of the described user story on the NFDI can be outlined as following: it will tap into innovative potential in the area of research data management, because new text resources can be opened up for research. Additionally, a comparatively new solution is being tested for a known problem hopefully leading to positive effects with regard to efficiency.
The issue is also an eminently infrastructural task, because a consensus of the community on the one hand and a reliable certification of formats and processes on the other hand are necessary, both things that individual researchers cannot achieve without broad coordination.
Finally, this issue provides a concrete opportunity for cooperation beyond European partners, as the HathiTrust in the USA has been experimenting with the principle described here for several years (see HathiTrust “Extracted Features Dataset”).
Christof Schöch, Frédéric Döhl, Achim Rettinger, Evelyn Gius, Peer Trilcke, Peter Leinen, Fotis Jannidis, Maria Hinzmann, Jörg Röpke: „Abgeleitete Textformate: Text und Data Mining mit urheberrechtlich geschützten Textbeständen“, Zeitschrift für digitale Geisteswissenschaften (submitted).