Linguistics in non-European Languages

Motivation

We are active in the humanities and social sciences / linguistics / subject: general and comparative linguistics, typology, non-European languages.

The languages of the world differ in terms of the number and formal characteristics of their word types. In some languages, for example, there are adjectives or adverbs, in others not. In some languages nouns are case-marked, in others not. This is largely undisputed. A controversy that has been going on for quite some time is the question whether there are languages without noun/verb distinction.

What has not yet been investigated is whether this variability manifests itself in the formal differentiation of word types on the level of discourse. This question will be approached here from an information theory perspective. In view of information theory, we start from the hypothesis that the density of information in sentences is uniform across languages, but that there are language-specific patterns in the distribution of information among the various constituents of a sentence.

As one of the languages in which the syntactic differentiation of nouns and verbs is only very weak or not pronounced at all, Tagalog is worth investigating in comparison with German – as a representative of a language with a very clear noun/verb distinction – and with Indonesian as a language that is very similar to Tagalog in many respects, but makes a much clearer syntactic distinction between nouns and verbs.

In our research project we are dealing with linguistics and computer science, and the noun/verb distinction in the three different languages is investigated using the information theory approach.

Objectives

Our key question is: Is linguistic information distributed differently in languages with little formal noun/verb distinction than in languages where there are clear syntactic and morphological differences? We investigate this question primarily using corpus data from the Tagalog. Tagalic data are compared with similarly obtained data from German and Indonesian.

Solution

The determination of the information content of target words (content words, especially object and action words) and their co-occurrences for the automatic generation of information maps of sentences in Tagalog, Indonesian and German is the central task of this project and the basis for answering the core question formulated above about the different distribution of information in languages with low formal noun/verb distinction and in languages with clear syntactic and morphological differences.

For the German language there are already dependency parsers that can be used to create an information map. For Indonesian, a parser is already available which we can try out with 5000 sentences from the Universal Dependencies Treebank. As an alternative for Indonesian, a parser model of about 10000 – 15000 will be manually annotated. For Tagalog we don’t have a model for dependency parsers and therefore about 15,000 sentences from “ Normgrößenkorpora” are to be manually annotated as a model. Then we feed the manually generated model into a parser for Tagalog. Leipzig corpora are then automatically annotated with the parsers of the three languages to form information maps.

On the basis of the formed information maps, word types are classified into default and non-default forms and the distances or divergences of the information map of the three languages are determined with the measures such as cosine distance, Euclidean distance and Kullback-Leibler divergence.

Challenges

The project is highly risky to the extent that the project does not provide any evidence for the hypothesis that the distribution of information differs among languages with varying degrees of word type differentiation. Even such a negative result would certainly be a gain in knowledge. In this case, the project would conclude with considerations as to whether and, if so, how the fundamental question can be investigated more promisingly.

Review by community

We are ready to evaluate the services offered by Text+.