Motivation

As a researcher in Asian Studies who is based in China, my goal is to find topics which are of common interest to Chinese and Western academics and to make them available for both audiences. The history of the “Deutschland-Institut” in Beijing (1932–1950), an institution that played a crucial role in Sino-German academic collaboration, is such a topic. Sources (published and unpublished) on the history of this institute are both in Chinese and German and have not been systematically tapped into so far. To illuminate the history of the institution and the people involved, a large number of sources will have to be translated and categorized to allow for cross-reference. Supplying texts in translation with the corresponding metadata in Chinese and at least one Western language is thus crucial for this project.

  • In order to present the relevant documents on the Internet, the digital full texts are required. Since there are no digital copies of most of the relevant manuscripts and typescripts, digitization and optical character recognition (OCR) are necessary to create the digital full-texts. Although for a few years now, many OCR-Tools, like ABBYY FineReader, support the recognition of Chinese in general, digitizing historical documents, in particular manuscripts, remains a challenge. While there are some approaches for improving the optical character recognition of such texts, for instance systems that rely on Machine Learning Algorithms ( https://ieeexplore.ieee.org/abstract/document/7783877, Transkribus), there is still no generally reliable solution.
  • The digitized documents (images) and their transcriptions should be published on a web system that allows to provide metadata in three languages (English, German, Chinese) not only for the files, but also for the collections’ metadata.
  • The presentation-layer of the website should also be provided in three languages (English, German, Chinese). Ideally, a synoptical view of the digitized image and its respective transcription should be available.
  • The web system should be linked to a reference management software that supports Chinese. It should be able to deal with various types of bibliographical data and allow to manually adapt citation styles, both in Chinese and Western systems. Technical interfaces should furthermore allow to import literature, websites and other types of scholarly publications.
  • In the end, an unclear situation of user-related rights may give rise to legal issues: for some documents, it may be difficult to determine who holds the copyrights, the publication rights or the exploitation rights. So, to avoid litigations, professional consultation is very welcome to clarify the conditions and licenses under which some documents can be published and re-used.

The funders may have three demands:

  1. It must be ensured that the documents can be published without any doubt from a legal point of view.
  2. The documents have to be provided to the users and stored in a sustainable way, for example in accordance with the FAIR data principles.
  3. The project must include a strategy for publishing and preserving the data and metadata created in the course of the project to ensure its long-term preservation.

Objectives

Text+ could be a valuable partner for the project in three areas:

  1. Text+ could offer consultation and training in order to make the data created in the course of the project FAIR.
  2. Text+ could provide a repository for long term archiving of the data.
  3. Text+ could offer consulting on potential digital methods for further analysis of the documents in future projects.

The project can be assigned to the data domain “Collections”.

  • The members of the Text+ consortium have years of experience in working with digital textual data. They may suggest interesting scenarios for further development and reuse of the data created in the course of the project.
  • With TextGrid-Repository, Text+ includes a certified Repository that was designed to fulfil the demands of different user communities regarding text data. Thus, TextGrid-Repository would be an ideal solution for publishing the data created in the course of the project.
  • The Language Resource Switchboard is a valuable toolset for text analysis, as it provides access to a variety of tools for the analysis of text and language data.

Solution

  • Training may be needed to store the files in the TextGrid Repository.
  • Consulting is necessary for creating an appropriate metadata schema and assigning various kinds of metadata to the files and collections.
  • Both is important for storing and publishing the files according to the FAIR data principles.

Use of data sources

  • The data, mainly digital texts and images, will be created by retro-digitizing texts.

Use of services/tools/components

  • TextGrid Repository
  • Language Switchboard

Use of standards/methods

  • Digitization
  • Metadata: Authority Files in German and Chinese.