Creation and enrichment of html/text-based research collections

Motivation

Numerous digital resources in the humanities have been implemented as “Thematic Research Collections” in the form of multimedia websites (Unsworth 2000, https://people.virginia.edu/~jmu2m//MLA.00). Although multimedia plays an important role, the formal structure of the web pages is essentially text-based. However, the markup in HTML is semi-structured. Furthermore, much research information is still “processed” with word processing software (e.g. transcripts or translations) and thus remains unstructured, although at least semi-structured creation and representation (e.g. with html/xhtml) would be possible here as well.

Objectives

In contrast to more complex web-based systems and virtual work environments, some of these research collections are easy to host as static websites with less effort and especially with less service demands from commercial providers. Internet Archive, DNB and BSB also offer archiving options for websites. Text+ could help to make these resources more usable by a) extracting the texts, which are primarily generated in HTML, in order to be identifiable/addressable as a data set in the collection/website and b) enriching them, e.g. convert them into TEI, or adapt or suggest transformation scenarios for them, so that the data set becomes machine-readable at a deeper level. c) Could such “Thematic Resource Collections” also be created in the future beyond complex and database-based web environments, because they are simple to create and maintain. However, this would require suggestions for structuring and formalization. This could also be extended to recommendations for basic structures for text-based data on sources (transcripts, excerpts, translations), without having to use the XML-based markup standards of linguistics and editions, which do not fit in several research settings in the humanities disciplines.