Motivation

As a researcher in Modern and Current History (102-03), my task (DB) is to convert digitized and digital-born sources into structured markup according to the TEI Guidelines. As a researcher in American Studies and Digital History (102, 105), my task (JK) is to extract information from digitized texts to identify and categorize information fragments.

Objectives

At the German Historical Institute in Washington DC, we use a diverse range of digitized and digital-born sources to study transnational, global and migration history. In order to improve the extraction of useful information in our textual collections (a bilingual collection of primary sources on German History, a corpus of migrant letters written from Germany to family members in North America as well as historical German-language newspapers in the U.S.), we tag named entities such as persons, organizations, or place names and link them to authority files such as the GND. The objective is not only to discover information about a named entity such as, for instance, “Kossuth,” but likewise to be able to efficiently categorize it in order to distinguish between the famous Hungarian revolutionary or a namesake. Data management measures and services as envisioned by Text+ aim to support the creation and enrichment of such richly annotated collections.

Solution

While there is a wealth of freely available command line tools, libraries and APIs for entity recognition such as Stanford NER, Spacy or WebLicht that support the languages we use (German/English) and can easily be integrated into the projects’ workflow, there is a lack of off the shelf tools or services for entity linking, e.g. connecting a containing “Albert Einstein” to https://d-nb.info/gnd/118529579. In our projects, we therefore decided to use the proprietary API https://www.textrazor.com/. While the quality of the results is convincing and the resulting WikiData-identifiers can easily be mapped to corresponding GND-identifiers, we would prefer an open solution that a) provides introspection on its routines in order to foster the transparency and reproducibility of results and b) can be tuned, for instance, through custom name/identifier lists in the case of the migrant letters or the newspapers, where, as opposed to our sources with generally known public figures, most people do not have corresponding WikiData-identifiers.

Challenges

High-quality named entity recognition, classification and linking is crucial for our studies. However, our historical datasets give rise to a number of challenges: they are considerably smaller than “contemporary” corpora, contain different language varieties, and include a high rate of optical character recognition (OCR) errors. Though named entity linking has been actively researched in the open, e.g. in the context of DBpedia Spotlight project, it is not yet clear if a Text+-service building upon such research could readily provide results comparable to commercial solutions such as TextRazor.

Review by community

Our studies fill an empirical gap by considering historical (bilingual) datasets. In case Text+ decides to provide an entity linking API, we would commit to test the service and compare its results with our current API.

Daniel Burckhardt (DB), Jana Keck (JK)