Motivation

My particular interest is centred around Heinrich von Kleist (1777-1811) and his poems, novels, and dramas. I have put together an extensive collection of his works in TEI p5 format. Recently, I have also developed a keen interest in the works by Andreas Gryphius (1616-1664) and constructed a similar TEI-based collection. In the near future, I would like to perform a systematic study to compare the theatre plays of Kleist with those of Gryphius, which I would later extend to their entire works.

One research question tackles the nature of the characters both writers introduce in their plays and how characters’ nature develop throughout the writers’ lifetime. To facilitate such research, it’d be useful to have (and possibly extend) the following tools:

  • the capability to easily create virtual collections of, say, theatre plays, to group together works of a given nature
  • to perform a study within the realm of a given virtual collection.

Within a given virtual collection, it’d be useful to have the following tools to:

  • convert TEI-based plays (or selected parts thereof) into plain text
  • expose plain text to various automatic text processing tasks such as part-of-speech tagging and named entity recognition
  • help create a lexicon where words can be linked to Wordnet, and where words give be pre-processed for sentiment analysis
  • help align the characters of a theatre play (or novel) with the emotions they elicit or are exposed to
  • help align characters and their characteristics on time lines to investigate whether authors’ characters develop in time
  • compare character time lines of different authors

Objectives

  • Have a tool that helps with the creation of virtual collections (and their proper description with semantically-rich metadata) so that I can operationalise virtual collections (see tool use) and share them with my research fellows.
  • Allow tool use within the context of a given virtual collection; this includes having a tool that helps me browsing a virtual collection (which may consist of other virtual collections or TEI-based / textual leaf nodes) and, from there, invoke the tools to process a selected collection node.
  • Computer support for attaching the output of tool use easily to the resource, say, as an additional annotation layer.

Solution

  • Have a virtual collection explorer that helps navigating through virtual collections and their parts, using metadata and content-based search.
  • Be able to create private virtual collections (off-line) that can be shared with the community at a later stage (made public and on-line).
  • Have a way to easily invoke text processing tools (browser-based tools or web services) in the context of the virtual collection.
  • Have a way to easily add the output of tools as additional annotation layers to the TEI-based content.

Challenges

Parts of the functionality are already in place by various tools, but they need to be intertwined to allow for an ease-of-use and flow across the various tools. The CLARIN Virtual Collection Registry allows users to create virtual collections but it is hard (impossible?) to generate, browse and operate on highly structured collections. Also, the Virtual Collection does not offer a private mode where I can test-run my studies on my research data before publishing it. Offering such a private mode (after login) would certainly be a challenge.

While the Switchboard helps identifying and invoking tools, there is no support to properly attach the output to these tools to existing annotation layers. Here, some “glue” support would help. I understand that WebLicht offers tool pipelines, where each tool in the pipeline attaches its result to the overall pipeline output. It would be nice to have something similar in place for tools outside of WebLicht, say for those offered by the Switchboard. Presumably, tools connected to the Switchboard would need to offer a standard (XML-based?) output such as TCF. This would make it easier t to interpret the results of the tools and to incorporate them into existing analyses. Having tool developers to supply such a common format (or to agree on a common format) is certainly a challenge.

The other tools I mentioned are particular to my project (alignment of characters, emotions on time lines), so presumably I would need to develop them myself. Is there a CLARIN forum in place, where I could share and discuss such ideas, possibly to find interesting parties that would contribute to such tools?