Topic Modeling Character Speeches Using the DARIAH-DE TopicsExplorer

Motivation

In my doctoral thesis in German literary studies, I am interested in character speeches and I am investigating, based on a small corpus of 19th century narratives, what influence speech representation has on the characterization of the speaking character. Although my description of this phenomenon is mainly based on manual analysis, I am very interested in whether and how it can be at least partially automated. Therefore, I aim at a comparison of the collected direct and indirect speech representations of individual characters. I have excluded other forms of speech representation such as free indirect speech, narrated speech or speech report, as the influence of the narrator’s voice is too strong in these forms and therefore the character’s voice is influenced too much. On the other hand, I have added thoughts to the evaluation, because they reproduce the voice of a character unaltered.

I am interested in a comparison of the collected speeches of individual characters beyond the individual work. I intentionally do not only want to compare the characters of a single author, but also compare them to characters from works of other authors. For this reason, my corpus only contains one narrative per author.

In a first step I would like to compare the speeches’ contents, i.e. which topics the speeches are dominated by.

Objectives

My goal is to find out which characters resemble each other in the topics they discuss. Either I expect similarity of character speeches of the same author, because his style might dominate the discussed topics. On the other hand, I expect similarity of character speeches from the same work, since the plot of a work is always strongly reflected in the speeches. In this case, characters from different narratives with a very similar plot would also resemble each other. Thirdly, I hope that the character speeches of the same types of characters will resemble each other, so that, for example, characters in love talk about similar topics – regardless of the work or author.

My second important goal is to find out how automatic tools can take over the work and which character speeches they consider similar. I do not primarily hope to simplify my work, but rather to apply these tools to a much larger corpus in order to show tendencies and similarities in literary history and to generally evaluate whether the conclusions I am trying to draw are plausible.

To answer this question based on my corpus, I used the DARIAH-DE TopicsExplorer, because it allows a low-threshold access to the method of topic modeling (e.g. Blei 2002), while nearly no programming knowledge is necessary to use this tool. It proved to be very helpful for me, because I could enter the collected speeches per character as complete files and thus compare them in few steps. Thereby the TopicsExplorer allows changing many parameters to make the evaluation easier and more diverse.

In contrast, the preceding process was cumbersome: The TopicsExplorer had to be downloaded and installed locally. At least the detailed documentation helped with this process. However, with large amounts of data, the own computer quickly reaches its limits. In this regard, Text+ could help to simplify the process.

It would also be desirable to compare the TopicsExplorer to other topic modeling tools or methods, like MALLET (McCallum 2002), in order to use other possible presettings or to evaluate topic modeling methods. Unfortunately, other methods have too high informatics barriers for a mainly literary-scientifically educated user. Here, too, Text+ might come into play.

Solution

With regard to the TopicsExplorer, Text+ could provide a web application to which a simple upload of the texts to be analysed is sufficient. It would be desirable to support as many file formats as possible, since format conversions are also often informatically challenging. All steps should be designed for easy and comfortable use, especially for users not involved in the DH. A web application would also reduce the need for local computing power, since external computing power could be accessed.

In this context, an integration to the Language Resource Switchboard (Zinn 2018) would be an appropriate option, since it currently only offers a Polish topic modeling tool and the topic modeling integrated in Voyant. It would also be desirable to provide further topic modeling methods and tools beyond the TopicsExplorer. This would not only be a benefit for the analysis of character speeches, but also allow numerous other evaluations.

Review by Community

I would be happy to test other topic modeling methods and would be interested to know how this offer will be judged in the community, especially in the non-informatics part.

References

Blei, David M. (2012): Probabilistic Topic Models. Communications of the ACM 55(4), pp. 77-84.

McCallum, Andrew Kachites (2002): MALLET: A Machine Learning for Language Toolkit, https://mallet.cs.umass.edu.

Zinn, Claus (2018): The Language Resource Switchboard. Computational Linguistics 44(4), pp. 631-639.