University of Tübingen (UniTü)

Text+ center: Tübingen Archive of Language Resources

Type of center: data center/competence center

The Department of Linguistics at the Eberhard Karls University of Tübingen (UniTü) is engaged in the theoretical foundations of computational linguistics and cognitive science, the text-technological foundations of corpus linguistics and the application areas of machine language processing. Special emphasis is placed on the following research fields: grammatical formalisms, discourse semantics, development of language resources for German in the fields of morphology, syntax and semantics, information retrieval, dialectometry and machine learning methods for natural languages.

For Text+, UniTü makes its data resources available in the Tübingen Archive for Language Resources (TALAR). These contribute to the expansion of the Text+ portfolio in the two domains of collections and lexical resources. In particular, corpora for spoken language and written texts that are annotated on a morphological, syntactic and semantic level are archived. This includes, for example, the TüBa treebanks for German, English and Japanese. In addition to the linguistically annotated corpora, UniTü offers data services in the form of vector space word representations and associated software tools.

Highlights of provided data and services

Collections:

WebLicht: Environment for the automatic annotation of text corpora. Linguistic tools such as Tokeniser, Part of Speech Tagger and Parser are encapsulated as web services that can be combined by the user into individual processing chains
Tübingen Treebank Collection: Based on the Tübingen Treebank Collection (including the TüBa-D/Z) and Universal Dependencies treebanks; the data centre offers Tündra, a search platform in syntactically annotated corpora

Lexical Resources:

GermaNet: Lexical-semantic word network that semantically relates German nouns, verbs and adjectives to each other
GermaNet-Rover: Search tool for searching for data in GermaNet

Third-party data reception

UniTü accepts a wide range of data from third parties to expand its profile in the area of collections and lexical resources. One focus is on syntactically annotated corpora (treebanks) and lexical data in GermaNet format. Other data types corresponding to the inventory data, e.g. word embeddings and experimental data, on request.

Contact

Contact for Text+: data-steward@semsprach.uni-tuebingen.de