Motivation

The Acta Pacis Westphalicae (APW, http://pax-westphalica.de/) project is one of the most significant historical initiatives. The first volume was published in 1962 and the project is still ongoing. In 2014, the Acta Pacis Westphalicae edition was made available online (as part of a DFG funded project from the research group early modern history at Bonn University, in collaboration with the Bavarian state library). The XML files from this edition have been created with historical research questions in mind. Using the data for linguistic purposes, and therefore differentiating between the text itself and the metatextual information, has been approved by the relevant institutions, as has the non-commercial publication of the formatted data (the publishers Aschendorff, the research group for early modern history at Bonn University, the Bavarian state library; the written approval can be made available at request).

On this basis, we created a linguistic corpus, APWCF, based on the XML version of documents from the Peace of Westphalia Congress. A first version is available on the server of the Berlin-Brandenburg Academy of Sciences and Humanities ( http://kaskade.dwds.de/dstar/apwcf/), but our goal is to have future versions of the corpus on a repository which enables even more annotations and more complex queries. In the research area h**istorical linguistics there is a wide range of standards and no clear advice on a specific data format to use or a suggested repository to upload the data. Although our corpus is based on an XML version, even within the context of this one data format, there are different standards and it is not clear which option is best. We looked into some of the more well-known repositories for data from the humanities, such as Dariah, and found that the interfaces were difficult to navigate and various technical problems resulted in us deciding to not use their services. Ideally, we would like to find a repository that not only stores the data, but also allows users to query the data in a way that is suitable for research questions in the (historical) linguistics domain.

Objectives

The most frustrating aspect is that there seem to have already been many attempts to create a repository for data from the humanities, but very few of them seem to have been developed with the user in mind. Therefore, it would be great if Text+ could offer some kind of repository which is user-friendly, and enables students and researchers to query the data in an intuitive way. My personal experience is that it is difficult to navigate the infrastructures of other data repositories for the humanities and even the language used is not easy to understand for those that are not particularly research data-affine. The development of Text+ services should be done in close contact with the target group, with constant feedback from potential users.

Solution

Once the Text+ services have been developed, they will probably require the data to be formatted in a specific way (i.e. XML, using a certain standard), it would be great if there could be clear guidelines specifying how the data should be formatted. I think it would also be a very useful resource to have suggestions for suitable computer programmes for editing the data and potentially also advice on which legal aspects should be considered. There are a wide range of data standards in the humanities which can be overwhelming for a student or researcher wanting to collect data which can be published and used by others. I believe it would be beneficial to specify just one or two data formats or standards that would work with the future Text+ infrastructure, so that it is easier to become a specialist on one data format and to produce publishable data of a high quality.

Challenges

By restricting the suitable data formats to just one or two, it will be inevitable that this will reduce the amount of information that data can be annotated with. Each data format has its own flaws and cannot be adapted to all types of data. Nevertheless, I believe that the advantages of having one clear data format substantially outweigh the disadvantages and will ultimately result in more users collecting data and making their data accessible. In the same vein, making a repository user-friendly means that the range of features available may need to be reduced, but again I feel strongly that a repository which is accessible to all users with basic features is worth more than a repository with more advanced features which is too complicated to use for the average researcher.