Large Language Models (LLMs) and Artificial Intelligence

Against the backdrop of the rapidly advancing development of Large Language Models (LLMs), Text+ aims to show perspectives for the possible applications of generative language models, artificial intelligence and transformer models in science. To this end, the consortium intends to use its extensive holdings of language and text data just as profitably as the high-performance computing centres at its partner institutions.

The aim of Text+ is to provide applications and services for scientific communities that make use of LLMs. In addition, the Text+ centres want to prepare their language and text resources in a qualitatively targeted manner for the training of language models. Text+ will provide models (fine-tuning of pre-trained models or RAG) for specific tasks as well as resources - i.e. data and computing power - for the fine-tuning of models by researchers. Text+ also wants to explore how material with (copyright) access restrictions can be integrated into LLMs, whether and how LLMs can be trained with derived text formats and for which research questions LLMs are suitable.

The following specific use cases are to be implemented in the short term as examples:

  • ata preprocessing using the example of Named Entity Recognition (NER): LLMs support data preprocessing for the subsequent application of a specially trained NER model.
  • Runtime environment for NLP tools: Classifiers (e.g. from MONAPipe in Text+) are provided in containers via API and provided with GPU nodes for the effective use of deep learning models.
  • Generation of example sentences and context: Here, LLMs are to support the enrichment of entries in the GermaNet lexical-semantic word network.
  • Query generation for search support in the Federated Content Search (FCS) of Text+: An LLM-based ChatBot will support the exploration of the FCS and help to translate natural language queries into syntactically correct search queries for the FCS.
  • Entity Linking: LLMs support the linking of named entities in full texts with standardised data such as the GND or knowledge bases such as Wikidata.
  • Historical normalisations: LLMs retrained with data from historical collections adapt divergent spellings from different eras.