Motivation
Starting point: For years, there has been a growing interest in corpus linguistics. Numerous language corpora already exist or are in the process of being created, and the scientific use of such corpora is constantly increasing. In the field of historical linguistics, this desideratum has been fulfilled by the publication of the historical reference corpora. The Reference Corpus Middle Low German/Low Rhenish (1200–1650) (ReN) (online since 9/2019) deals with a structured selection of Middle Low German und Low Rhenish monuments of speech from 1200 to 1650. It is based on 146 transcribed texts annotated according to PoS, inflectional morphology and lemma with 1,415,362 tokens (subcorpus “ReN_anno”) and 89 non annotated texts with 908,682 tokens (subcorpus “ReN_trans”). Both subcorpora are available via the search and visualization tool ANNIS.
Since an overarching infrastructure for historical language corpora is yet a research desideratum, we used the local infrastructure. So far, the ReN data are maintained by the Hamburg Center for Language Corpora (HZSK). The data will be transferred to the Center for Sustainable Research Data Management (FDM) at the University of Hamburg. The FDM guarantees the maintenance of the data for ten years, but not a continuous maintenance (e.g. adaptation to new tools, e.g . to updated versions of ANNIS). An overarching infrastructure would guarantee the maintenance and usability of the data (see below) for a longer period.
Objectives
Users of corpora such as the ReN would like corpora to be made easier to use and more accessible, for example, by offering corpora in such a way that queries can be performed across several comparable corpora (e.g., the historical corpora of German). This requires a homogenization of metadata as well as data and annotation formats. Instead of an access to the data via different ways, e.g. project websites, several repositories, etc., a central administration and a single point of contact for data access is desirable.
For us corpus creators / administrators, an important aim is to ensure the availability of the data for the community. Therefore it is necessary to ensure a continuous access to the corpus, in addition to long-term storage (even beyond 10 years). For this purpose, continuous work must be carried out which, after the end of the project, will neither be financed by a funding institution nor by the university. For example, for the ReN, adjustments to updated ANNIS versions must be made.
Furthermore, we aim to keep the ReN data up to date with regard to new formats and tools. Hence, it is necessary to adapt the corpus data to new requirements, e.g. to transfer them to new corpus tools for search and visualization as well as for annotation. We also have an interest in the addition of new texts to the corpus and an extension of the annotations by the community. Correction suggestions and feedback from the community (both in terms of content and functionality) should help to optimize the ReN.
Often a single person (project management) is responsible for a corpus. An “orphanage” of a corpus after the project leader has left the university could be prevented by a central research data infrastructure.