Standards and harmonized components of technical/structural infrastructures for long-term archiving and publishing of complex and heterogenous data packages

Motivation

Computational Literary Studies, i.e. research on literary texts ( DFG-Fachsystematik: 105 Literaturwissenschaft) supported by methods from computational linguistics and computer science, are an emerging field within the Digital Humanities. Since 2020, the DFG is funding a priority program in Computational Literary Studies that includes 10 research projects at universities in Germany and Switzerland. Researchers in these projects, though pursuing individual research agendas, naturally share various interests, objectives and obstacles and produce diverse scholarly outcome. One of our tasks as coordinators of the program is to identify and/or develop comprehensive and demand-oriented solutions for archiving all scholarly results and outcomes of the individual projects sustainably and making them long-term findable, accessible, interoperable, reusable and reproducible.

The results of projects within Computational Literary Studies are highly heterogenous and commonly consists of the texts themselves, annotations, annotation guidelines, software code, lexical resources, diverse metadata and living systems, e.g. web services, tools, demonstrators and interactive websites.

With regard to archiving these results and making them long-term findable, accessible, interoperable, reusable and reproducible, there are already existing infrastructural solutions that are used by researchers in this context, e.g. various local/regional solutions, DARIAH-DE repository, Deutsches Text Archiv (DTA), TextGrid, Github or Zenodo.

These solutions are either (1) profoundly domain- or object-specific, (2) local/regional or institutional restricted offers or (3) national/international solutions that are mostly generic and, in many cases, connected to proprietary companies within different jurisdictions. In other words, they have different strategic and structural focuses, are technically and content wise hardly oriented to each other and are varying in their individual alignment with the heterogamous demands of the Computational Literary Studies.

Accordingly, it is nearly impossible to define and establish a comprehensive strategy and make complex datasets, living systems and additional outcome of research projects in the context of the Computational Literary Studies accessible, reusable and reproducible in a comprehensive and a well-formed, linked way. In addition, standards and best practices, e.g. in relation to metadata frameworks, data formats, documentation, process descriptions and technology-stacks still need to be expanded in the field.

Objectives

The priority program “Computational Literary Studies” would therefore benefit greatly from a coordinated, comprehensive and sustainable infrastructure(-network) to archive complex outcomes of research projects within the field, make them highly interconnected, long-term available and reproducible.

A vital requirement of the priority program “Computational Literary Studies” in this context is the possibility to archive different data types, formats and living systems closely connected to each other for creating complex and heterogenous data packages under individual access restrictions, legal conditions and on infrastructure located in the European Union.

Regarding to this, the priority program would also benefit from the fostering of the coordinated and continuous development of domain-specific standards, best practices and guidelines, e.g. in relation to data formats, annotation formats, metadata frameworks, theory-independent representation formats, documentation, process descriptions and technology-stacks to increase data quality and data connectivity. This also includes the standardization and harmonization of the development of living systems to optimize and ensure their archiving and accessibility.

Solution

As a solution, the program would need the organizational and technical harmonization of existing infrastructural components, that are already broadly aligned to demands of and comprehensively established within the community to foster their interconnectivity.

When necessary, one should either expand existing infrastructural solutions, or establish nationally hosted repositories and hosting-infrastructure for domain-/object-specific data formats and types as well as for living systems and integrate them in a harmonized infrastructural landscape to ensure the ability to archive, publish, reuse und reproduce all elements of complex and functional data packages.

Through the establishing of a continuous expert group, consisting of both data management experts and representatives of the community, domain-specific standards, best practices and guidelines can be defined and directly connected to the described processes and solutions above.