LLM Service

Text- and language-based humanities have extensive use cases for large language models (LLMs) with their research data. Text+ expands access to such research data via the Registry, Federated Content Search (FCS), and through the data centers and repositories of contributing partners.

The provision of a web service is now added as a service for the Open Source LLMs (Meta) LLaMA, Mixtral, Qwen, and Codestral, as well as Open-AI’s Chat-GPT. This is made possible by GWDG, a national high-performance and AI center that supports the development and testing of research-related use cases.

Who Can Use the LLM Service and How?

Initially, all individuals directly involved in the project can use it after logging in via Academic Cloud. An expansion of the user base is planned but initially has license-related boundaries. Except for the two externally integrated Chat-GPT models, the LLMs are hosted on GWDG servers, so no user related data is sent out.

Benefits

  • Free use of various open-source models
  • Free use of OpenAI GPT-4
  • AI chat without server-side storage of chat history (except Chat-GPT)
  • Managed hosting of own language models
  • Finetuning of LLMs on own data
  • Retrieval-Augmented Generation (RAG) on own documents
  • Compliance with legislative requirements and especially the privacy interests of users

Start LLM service

Feedback Wanted

Despite the described advantages, this integration of LLMs has the status of a first offer, which should be supplemented and improved over time both in terms of functionality and accessibility. Feedback on the current version is therefore explicitly wanted! Please send it to us using the contact form.

Application in Text+

With a focus on Text+ data domains - editions, collections, lexical resources - work is currently being done on the following application scenarios:

  • Data preprocessing using Named Entity Recognition (NER): Tagging various NEs or preprocessing for NER to then apply/train specially trained NER models on the data.
  • Runtime environment: GPU-supported runtime environment for Docker containers. Opportunity to open ports externally for providing APIs.
  • Entity Linking: Retrieval Augmented Generation (RAG) with context knowledge from Wikidata, the GND, and other sources. Likely additional database / search engine for contexts: Graph database (e.g. BlazeGraph) or ElasticSearch.
  • Query Generation: Federated Content Search, i.e., query formulation based on natural language descriptions (possibly a conversation in chat to refine the requirement, then option to take over the generated query in the FCS for the actual search).
  • Generation of example sentences / context: Improve GermaNet entries e.g., generate example sentences for a lexical entry.
  • Generation of historical normalizations: seq2seq transformer models for historical normalization (possibly also as a web service).
  • MONAPipe, APIs for components: providing neural models (e.g., speech reproduction, event detection) as an API.

General Application Scenarios in the Context of Language- and Text-based Research Data

In the following, some use cases are outlined that show how LLMs can generally be used in the text- and language-based humanities as powerful tools to support research and gain new insights. It is important to consider ethical considerations and comply with data protection regulations, especially when dealing with sensitive or copyrighted data.

  • Text analysis and text mining: LLMs can be used for content analysis to systematically analyze large amounts of text and work out topics, motives, or stylistic features. For example, literary works, historical documents, or philosophical texts can be automatically examined for recurring themes, sentiments, or linguistic patterns. In extensive textual data records, LLMs can trace historical changes in terms, and ideas, or analyze discourses in different periods and cultures.
  • Automatic annotation and categorization: LLMs can be used to automatically provide texts with relevant metadata, such as keywords, abstracts, or classifications. This facilitates the development and reuse of texts in repositories. They can help sort and organize data in large repositories according to thematic, geographical, or temporal criteria, making them more accessible and useful for researchers.
  • Speech recognition and translation: For interdisciplinary and international research projects, LLMs can be used to machine-translate texts from various languages, which significantly facilitates access to international research findings and sources. LLMs could be used to identify and analyze regional language variants, historical language stages, or dialectal differences.
  • Generation of research hypotheses: LLMs can be used to work out central research questions and hypotheses from large amounts of scientific literature, which can serve as the basis for new studies (automated literature reviews). By analyzing existing research literature, LLMs can identify areas that have been little researched so far and thus suggest new research topics.
  • Analysis of multimodal data: In cases of a close connection between texts and visual materials (e.g., manuscripts), LLMs can be used in combination with image analysis models to analyze such multimodal data, e.g., by analyzing image descriptions or linking texts with corresponding visual representations.
  • Historical research: LLMs can help to transcribe, annotate, and convert historical texts into digital formats, facilitating the creation and analysis of digital editions. By analyzing historical text corpora, LLMs can help trace discourse histories and explore the development of ideas and terms in history.
  • Support in the creation of research papers: LLMs can be used to generate first drafts or summaries of scientific papers, which can be particularly helpful in overcoming writer’s block or processing large amounts of data. They can be used to suggest suitable quotes and literature sources based on the text context, making the writing process more efficient.

Changelog

  • October 2024: initial deployment of service, including tethering to Academic Cloud as well as backend infrastructure