Supporting Information Retrieval in and for Multilingual Scholarly Editions/Text Resources

Motivation

In the context of the Akademienvorhaben “Travelling Humboldt – Science on the Move“, a hybrid documentary edition of Alexander von Humboldt’s travel journals, related correspondence and documents from Humboldt’s vast legacy collection is prepared. One of the particularities and challenges is the multilinguality of the sources, with a high amount of German and French texts, but also of texts and passages in English, Spanish, Latin and other languages. As such, the edition is of importance not only for Historians (102), and esp. the History of Science, but also for the philologies (104 & 105), including Historical and Computational Linguistics as well as Literature Studies. Due to the international reception of Humboldt’s works and legacy, the need for a multilingual approach to the constitution and delivery of the edition’s corpus becomes even more apparent. This is especially true for anglo-, hispano-, and francophone audiences.

Objectives

Currently, the digital edition’s search interface allows for querying all documents of the multilingual corpus, with rather limited, string-based searches and the most fundamental wildcard options. The search results cover only the language of the query resp. naively match the query string to any given language, and do not, for example, include translations of search terms and phrases into other languages of the corpus. In addition, the historical variance in spelling and meaning within the different languages poses further challenges to a functional search for the edition humboldt digital.

With the DTA::CAB webservice, developed within the Deutsches Textarchiv/CLARIN-D projects, there is a satisfying solution for lemmatising and orthographic normalising of historical German texts. DTA::CAB is used within the edition humboldt digital, primarily for the correspondences included in the corpus. Similar solutions would be desirable for English, Spanish, Latin and other languages to be found in the primary sources of said edition, which would have to be as easy to handle, to integrate and to re-use within the framework of the edition.

Ideally, the solutions should be combinable and include a translation and/or mapping service or algorithm for the query terms and phrases, additionally allowing in combination with a corpus-specific glossar or lexicon of stationary terms/concepts. This would allow to query the corpus language-independently, but at the same time with an awareness for historical variance, language-specific features and specifics of the corpus of documents edited by the edition humboldt digital. As a further benefit for our work within this edition, but also for users outside this context, the range of documents queried in this manner could be expanded to large text collections like Hathi Trust, Gallica, Google Books, etc., where numerous texts by and related to Alexander von Humboldt can be found in different languages.

Most of Humboldt’s monographical works are listed and linked to repositories in the encompassing table “Humboldt Digital: Die Digitalisate Bibliographie” with currently 245 entries. In 2021, the complete collection of Humboldt’s articles and smaller publications (~3600 printed works in total) will be made available as TEI-XML-structured full-text transcriptions accompanied by a structured bibliographic database by the University of Bern (cf. https://humboldt.unibe.ch/text). These bibliographical data and multilingual full-text resources can, on the one hand, be used as additional training data, and on the other hand, should be integrated into the search facilities envisioned here for the edition humboldt digital as optional further collections to be searched in the manner of a federated content search.

Solution

We would wish for a well-documented, freely accessible and re-usable set of language analysis tools that enable lemmatisation and orthographic normalisation for the above-mentioned languages. The individual tools would have to be combinable resp. be usable in a tool chain to be integrated within the framework of our scholarly edition, where we could add corpus-specific terms and phrases. At the same time, the toolset could be used by other projects working with similar, multilingual historical sources.

Challenges

The envisioned framework will have to be available as a stable, well-documented and re-usable software package compatible with the TEI-XML- and eXist-db-oriented workflow of the edition humboldt digital. The search has to be scalable and perform satisfyingly (on the fly) even with a largish number of TEI-XML-encoded documents, some of which feature complex markup and have a considerable size. For the translation and mapping of query terms, a controllable and adjustable machine learning-approach has to be combined with manual intervention by domain experts in a seamless and productive fashion.

Review by community

Yes.