Topic Detection and Cluster Labeling with Dornseiff

Motivation

Automatic detection and labeling of topics for any kind of text is important for a faster overview and (semi-)automatic categorization of words or texts. This is of relevance as nowadays the amounts of texts are steadily growing and it is impossible to process or read those manually. Many areas and applications can benefit from reliable and correct clustering and labeling, including news aggregation, summarization of keywords and texts and many more, which still employ only simple or no automatic methods.

I focus in my dissertation thesis and research on automatic clustering and topic labeling given sets of words or documents. I use the Dornseiff subject groups for topic keywords and descriptions that combined with current language processing tools like word and paragraph embeddings significantly improve prior approaches that mostly focus on simple statistics and have a high rate of errors. Embeddings allow exploiting semantic text similarity and more efficient processing in clustering, abstracting and similar tasks with their compact numeric representation of text. The Dornseiff as high-quality auxiliary data is of great help as it is manually curated with subject groups that span a wide range of topics so that it can be used for a variety of text genres and sources. There are no reasonable alternatives that do not require much manual post-processing to be of use. Each Dornseiff subject area contains a number of words that belong to and characterize it. Those words and labels can be used on its own to group and describe related words, or to infer new distinct clusters of words that are different to known subject areas. Generalizing the traits of the Dornseiff subject area labels, I can generate new labels that follow similar rules but were unknown beforehand. Combining this with keyword-extraction methods or more complex language models enables clustering and labeling for sentences or documents.

Objectives

The aim is to provide a tool or service that is able to automatically cluster / categorize and label sets of words or sentences (i.e. documents). It should be able to be adaptable to specific topic subsets given training data in the form of labelled categories if texts require a more granular distinction of topics. Another requirement is being fast enough to be used on-demand, i.e. as a web service, on large amounts of texts, and that it can be used platform-independently, e. g. remotely as a web service, locally in Docker or as a simple application.

Solution

For use in other languages besides German there is a need for similar digital datasets like Dornseiff that can be used as basis for topic categories and labels. For English there are some related resources in image captioning datasets and the WikiData graph that might be usable with enough manual curation, but for other languages and especially those without much tools and resources there is a clear need.