Text+ Newsletter #3

Welcome!

We are pleased to present the third Text+ Newsletter. We warmly invite you to take a look at our project activities. We appreciate your feedback, questions, or suggestions, which can easily reach us via the Office.

The 3rd Text+ Plenary: Fascinating Insights into the World of Large Language Models (LLMs)

On October 10th and 11th, 2024, the historic premises of Mannheim Palace transformed into a center for innovation and exchange during the 3rd Text+ Plenary. Around 200 participants experienced an exciting program focused on “Large Language Models (LLMs) and their applications.”

Prior to the official start, the pre-conference tutorial “Large Language Models: A Practical Introduction,” presented by Jennifer Ecker, Pia Schwarz, and Rebecca Wilm, provided an excellent opportunity to familiarize oneself with the basics and practical applications of LLMs. Held in the lecture hall of the Leibniz Institute for the German Language, the tutorial was praised by participants as informative and inspiring. For those unable to attend, the presentation is available here.

The first day of the Plenary was dedicated to Artificial Intelligence and Machine Learning. Through a series of lectures, the use and development of these technologies in the humanities were demonstrated. Sina Zarrieß from Bielefeld University discussed how scientific questions can be examined even with smaller language models, while Juri Opitz spoke about the role of linguistics in automatic language processing. Maria Becker practically demonstrated how language models can help research the transfer of scientific findings to a broader public. Nils Reiter highlighted the variation in results depending on prompt changes to generative models, and Anne Lauscher pointed out the social components of using large language models. Lalith Manjunath concluded the lecture series with technical aspects, including his experience with openGPT-x. A poster session featuring 19 selected contributions provided additional opportunities to showcase a broad range of expertise and experience with LLMs, from their development to their concrete applications in research. The highlight of the session was the awarding of the Best Poster Award, with a prize of EUR 150. The winners were:

Margret Mundorf (University of Heidelberg) for “Legal Linguistic Memos with Large Language Models: Automated Capture and Classification of Case Descriptions in Family Law”
Steffen Steiner and Frank Krüger (University of Wismar) for “SwineBad: Table Extraction and Information Structuring from the Swinemünder Badeanzeiger”
Eric Dubey, Matteo Lorenzini, Martin Reisacher (University of Basel), and Tim Rüdiger (Central Library Zurich) for “SwissGB4Science - a Full-Text Corpus for Research”

Text+ Plenary 2024, Schloss Mannheim

The day concluded with a casual evening reception, where participants had the chance to exchange ideas in a relaxed atmosphere and make new contacts.

The second day was dedicated to internal project meetings of Text+. Around 100 project members came together to discuss their ongoing projects in working groups and task areas, strengthening collaboration.

A heartfelt thank you to all presenters and participants for their valuable contributions and active exchange! We look forward to the next Text+ Plenary, which will take place on June 16th and 17th in Göttingen, ahead of the European DARIAH Annual Event.

Coordination Committees Elections

The biennial elections for the Text+ Coordination Committees are approaching. The election date is November 6, 2024. Voting will be conducted electronically, with voting open from the election date until November 13, 2024.

The Coordination Committees are the primary decision-making bodies for the Text+ communities. They consist of three Scientific Coordination Committees, each responsible for one of the data domains (Collections, Editions, Lexical Resources), and one Operations Coordination Committee. Their role is to continuously evaluate and expand the portfolio of data, tools, and services. The committees are composed of experts from the respective domains and are elected every two years.

The following committees will be elected for Text+:

Scientific Coordination Committee for the Task Area Collection
Scientific Coordination Committee for the Task Area Lexical Resources
Scientific Coordination Committee for the Task Area Editions
Operations Coordination Committee for the Task Area Infrastructure/Operations

Representatives of eligible organizations and institutions will receive access credentials for the voting system on the election date.

For more information, please visit the Text+ Portal. Any questions can be directed to the election committee at office@text-plus.org.

Blog Highlights

In this section, we present interesting posts from the Text+ blog. The blog provides information about Text+ and supplements the website by offering work in progress or a closer look at individual topics. All posts are DOI-enabled and citable.

Guest contributions on topics of interest to the Text+ community are warmly welcome! Get in touch if you have a topic.

Textbox in the TextGrid Repository

Integrating existing resources is a central goal of the NFDI, and this post explores that topic. Textbox presents a series of corpora from Romance studies integrated into the TextGrid Repository, utilizing a new feature of TextGrid: genre classification via the GND.

The foundation for smooth data integration lies in the appropriate interfaces and workflows, allowing new content to be related to existing resources.

With this aim, the new import workflow, called Fluffy Import, was developed. It simplifies the upload of TEI documents to the TextGrid Repository and improves metadata quality. This workflow was introduced in a workshop, and we look forward to sharing more updates soon.

Textbox includes nine corpora of literary texts in Romance languages—French, Italian, Spanish, and Portuguese (sorted by the number of texts). Created within the CLiGs project under the leadership of Christof Schöch (University of Trier) at Fotis Jannidis’s chair in Würzburg, Textbox served as a test for the subsequent ELTeC corpora, which can also be found in the TextGrid Repository. Like ELTeC, Textbox is already available on GitHub and Zenodo. A related user story for Textbox was submitted in the Text+ User Stories Call.

Please read more in the Text+ Blog.

To cite this blog post:
José Calvo Tello (October 24, 2024). Textbox in TextGrid Repository. Text+ Blog. Accessed on October 28, 2024, from https://doi.org/10.58079/12kbl

Workshop Reports

Redesigned Website for Text+ Data and Competence Centers

The Text+ Centers website has been thoroughly updated and redesigned. The goal was to align the information with the respective partners, ensuring an updated, clear, and uniform presentation.

The data and competence centers support the distributed Text+ infrastructure: Data centers focus on the collection, storage,and provision of research data, while competence centers provide specialized support in data management. The Text+ centers are organized into Collections, Lexical Resources, Editions, and Infrastructure/Operations, working in thematic clusters to meet specific data requirements.

The website was designed so that data and competence centers can be filtered by work areas and clusters. By clicking on “More Info,” users can access detailed information about each center, including available data and services, the ability to accept data from third parties, and the relevant contact persons. The revised structure facilitates targeted navigation, enabling users to quickly find relevant information on the various centers.

Style Guide

The website has been expanded with information and downloads on corporate identity. This page offers guidance and files for easy and stylish use of the Text+ brand, providing orientation for creating project materials of all kinds. Key design elements of the consortium—logo, colors, and font—are covered.

Additionally, there is a suggestion for referencing Text+ funding by the DFG in publications, lectures, or websites.

Correct Use of the Logo

Zotero4NFDI: Cooperation between NFDI4Ing and Text+

At the Text+ Office, Python scripts are regularly developed to automate various tasks, including reading content from tables and databases or analyzing social media usage data. Similar approaches are also pursued within the NFDI consortium NFDI4Ing.

Since numerous scripts are developed and used in both consortia, it makes sense that other NFDI consortia could benefit from using them. Consequently, Text+ and NFDI4Ing are planning to publish their scripts together in umbrella publications.

As the first publication, scripts for extracting individual or all publications from a Zotero library have been uploaded, as both consortia systematically store their publication metadata in their own Zotero bibliographies. The Text+ publications can be found at https://zenodo.org/records/12605448 (DOI: 10.5281/zenodo.12605448), and those of NFDI4Ing at https://zenodo.org/records/12680673 (DOI: 10.5281/zenodo.12680673). All scripts are licensed under CC BY 4.0, and reuse is explicitly encouraged.

More joint publications are planned for the future.

Results of the 2024 tendering process for the promotion of collaboration projects through Text+."

Every year, the NFDI consortium Text+ awards funding for cooperation projects in order to continuously expand the data and services offered by Text+ and make them available to the research community in the long term. To this end, projects are funded for a maximum of one calendar year, the results of which can be integrated into the Text+ infrastructure.

We are pleased to announce that five collaborative projects have been awarded funding in the 2024 call for proposals:

‘Text+-Schnittstellen zu den Interview-Sammlungen in Oral-History.Digital (text+oh.d)’ submitted by Dr. Cord Pagenstecher (University Library of Freie Universität Berlin) for the data domain Collections
‘Graeco-Arabicum – Open Data (GlossGA – OD)’ submitted by Dr. Rüdiger Arnzen (Friedrich-Alexander-Universität Erlangen-Nürnberg) for the data domain Lexical Resources
‘GND-basierte Webservices – Beaconizer & Discoverer (Hagrid)’ submitted by Dr. Frank Grieshaber (Heidelberg Academy of Sciences and Humanities) in the Infrastructure/Operations
‘Aufbau einer offenen digitalen Sammlung historischer musiktheoretischer Texte aus dem deutschsprachigen Raum anhand von Beispielen aus dem 19. Jahrhundert (DigiMusTh)’ submitted by Prof. Dr. Fabian C. Moss (Julius-Maximilians-Universität of Würzburg) for the data domain Collections
‘LOD-Rollen-Modellierungen aus den Registern von Regestenwerken zum Mittelalter (LRM)’ submitted by Prof. Dr. Andreas Kuczera (Academy of Sciences and Literature Mainz) for the data domain Editions

We are excited about the innovative approaches and look forward to a productive and close collaboration in the coming year.

Publications, Services & Information Offers

Text+ maintains its bibliography on Zotero and presents a structured view on its Portal.

Understanding Speech: AI and Spoken Language

The summary of the presentations from the community workshop “Understanding Speech: AI and Spoken Language” has been published:

Draxler, C. (2024). AI and Spoken Language. https://doi.org/10.5281/zenodo.12606959.

The workshop, held on June 27-28, 2024, in Munich, addressed questions about dialect processing by AI, the transcription of large volumes of spoken data, and innovative methods for language acquisition.

On the first day, Johannes Prenninger (BMW Group) delivered a keynote on how modern connected cars generate vast amounts of speech data and how it is processed to develop user-friendly voice control systems. Speech recognition in cars responds only to specific commands, ensuring no private conversations are monitored. Subsequent expert presentations covered various transcription tools and their applications in research.

The second day began with a keynote by Barbara Plank (Ludwig-Maximilians-University Munich), addressing the challenges of AI processing of dialects and non-standard languages often found in social media or citizen science projects. Later presentations showcased projects, including the automatic transcription of popular podcasts like “Fest & Flauschig” using the Dresden High-Performance Computing Cluster. Another highlight was the presentation of a new palatography system using optical sensors, applied in speech therapy to analyze speech patterns.

To conclude, the Bavarian Archive for Speech Signals (BAS) presented its new web services integrating modern speech recognition systems such as UWEBASR and whisperX, offering various options for transcribing speech recordings. Participants learned how these tools can efficiently create raw transcriptions.

Further details about the workshop can be found in the Text+ blog post by Christoph Draxler and Philippe Genêt, “When Artificial Intelligence Meets Spoken Language: The 2024 Collections Community Workshop” (from 25.08.2024, https://textplus.hypotheses.org/11229).

Events & Reports

Digital Humanities Open Garden 2024

On June 13, 2024, the Forum Digital Humanities Leipzig (FDHL) hosted this year’s Digital Humanities Open Garden at the Saxon Academy of Sciences in Leipzig (SAW Leipzig). The goal of the event was to showcase current work in the field of digital humanities in an informal setting and to encourage exchange and networking. The event targets all DH enthusiasts, especially students in relevant fields.

This year, the lecture program featured representatives from two Text+ consortium institutions: the German National Library (DNB) and SAW Leipzig.

Dr. Ramon Voges presented “Chatbots, Watermarks, and Digital Legacies,” current DH projects at the German National Library. As the central archive library, the DNB’s task is to collect, permanently archive, and catalog all works published in Germany, as well as works about Germany and in the German language worldwide. The DNB operates locations in Leipzig and Frankfurt am Main and houses specialized collections such as the German Book and Writing Museum, the German Exile Archive, and the German Music Archive.

DNB projects are conducted within the legal constraints of copyright and archival obligations. However, there are many collaboration opportunities with DNB. This was demonstrated through three examples:

Chatbot: This project develops a chatbot to answer frequently asked questions from DNB users. It uses a retrieval-augmented generation system that can utilize locally stored, copyright-protected content to generate precise and context-based responses.
Watermarks4Point0: A cooperation project aimed at identifying and classifying watermarks in historical papers, using methods such as CycleGANs and K-nearest neighbor algorithms for pattern recognition and classification.
Digital Legacies: This project focuses on designing workflows for securing, curating, and providing digital legacies.

The DNB’s Scientific Service and DNBLab offer extensive support and access to digital holdings and infrastructure. The Scientific Service supports projects working with DNB holdings or data, while the DNBLab serves as a platform for accessing and working with digital objects.

In the second presentation, “Lamento – Latrine – Leyptzigk. Entity-Based Content Search in Distributed Resources” (slides), representatives of the Saxon Academy of Sciences in Leipzig (Thomas Eckart, Felix Helfer, Uwe Kretschmer, Erik Körner) presented the current development within Text+ of expanding Federated Content Search (FCS) with entity-based search methods.

Using real-world examples from dictionaries, corpora, and editions, the significance of central knowledge bases like the Integrated Authority File (GND), GeoNames, Wikidata, and others for modern research questions and networking in modern research data infrastructures was illustrated. Current work at the SAW on entity linking demonstrated the range of possible approaches to resource annotation through semi-automatic entity annotation (including the use of large language models) and also touched on the work of the corresponding Text+ task force.

Afterwards, the issue of utilizing existing annotations in the context of federated search and research services in Text+ was emphasized. This was demonstrated by the currently ongoing expansion of the established Federated Content Search, which serves user-friendly research in research data, through suitable request and presentation options for various forms of normative data for seamless integration into existing user interfaces. Various live demonstrations based on the current state of development rounded off the lecture.

After the lectures, the discussion continued in a relaxed atmosphere with a barbecue in the academy garden.

Further information on the event: https://fdhl.info/opengarden2024/ and https://www.saw-leipzig.de/de/aktuelles/digital-humanities-open-garden-2024

Text+ Interim Report

At the end of September, the Text+ interim report was submitted on time to the DFG, in accordance with their strict guidelines (max. 25 pages, with a specified structure based on key questions). The report includes three parts:

a public part for publishing on the DFG website,
an internal part of 25 pages
and a data sheet with indicators on Text+, which was compiled via a survey of all cooperation partners.

The public part of the report is already available on the DFG website and can be read there (Link to the document).

Dates

All events – both upcoming and past – can also be found in our event calendar on the Text+ portal..

Datum	Event	Ort
30 October 2024	NFDI und Spezialbibliotheken im Gespräch – eine Umfrage des NFDI Konsortiums Text+ zu Katalogdaten von Bibliotheken	virtuell
31 October 2024	Text+ Research Rendezvous	virtuell
06 November 2024	IO-Lecture: Migration von RocketChat zu Matrix	virtuell
12 November 2024	Text+ Research Rendezvous	virtuell
14 November 2024	Erschließen, Forschen, Analysieren	virtuell
18/19 November 2024	Digitale Wörterwelten: Einblick in die Text+ Infrastruktur	Berlin
20/21 November 2024	1st Base4NFDI User Conference (UC4B2024)	Berlin
27 November 2024	5. FID / Text+ Jour Fixe - Verzeichnen und Ablegen	SUB Göttingen
28 November 2024	Text+ Research Rendezvous	virtuell
04 December 2024	IO-Lecture: Wie kommt mein Dienst ins Portal?	virtuell
04 December 2024	Verknüpfung und Kontextualisierung: Die Gemeinsame Normdatei als ein PID-System für Kulturelle Objekte in GLAM-Institutionen	virtuell
10 December 2024	Text+ Research Rendezvous	virtuell
10 December 2024	GND-Forum NFDI & Co.	virtuell