Derived Text Formats

While many of the language and text resources from the Text+ centres are freely available to researchers, others can only be used for research to a limited extent due to legal restrictions - in particular works that are protected by copyright or by data and personal rights or are subject to licensing restrictions.

Text+ aims to make such protected works accessible and usable for science. Derived text formats (DTF, see Schöch et al., 2020) offer one way of doing this.

What are DTFs?

DTFs are created by reducing the information content in texts - after an initial enrichment phase. They can be produced in such a way that, on the one hand, the result still allows at least one research question to be answered but, on the other hand, the remainder no longer infringes the rights of the copyright holder, for example.

One of the prerequisites for this is that there is no possibility of reconstructing the original text. Such DTFs can therefore be published freely. When generating more than one DTF from a document or a corpus, it must be ensured that no reconstruction is possible by combining them. Not all DTFs are automatically copyright-free. If it is possible to reconstruct the original text without major effort, the DTF is still subject to copyright.

How to generate DTFs?

DTFs are created on the basis of the original text by applying a series of a series of changes. In a first step, the text is enriched by annotations (e.g. part-of-speech tagging (POS), links from named entities to standardised data or statistical analyses analyses on the original text). This is followed by targeted information reduction. On the one hand, this is based on a series of changes, which typically take place automatically and are essentially on the decision as to which granularity these operations are based on.

Four operations are available for information reduction:

  • Delete
  • Retain
  • Replace
  • Swap

These can take effect at different levels of granularity (e.g. at token, sentence or paragraph level) and in relation to different sizes (e.g. per document, per work or per corpus).

Common forms of DTFs are, for example, term-document matrices, N-grams, texts with masked tokens or word embeddings.

Current Status

Text+ is currently working on a proposal for a DIN standard on ATFs as well as a publication on the legal aspects of these formats in order to provide the necessary expertise to support both scientific communities that use these data and institutions that wish to make such data available.

At the same time, Text+ is driving forward research in this area. Several analyses have recently been published, e.g. on the suitability of various ATFs for authorship attribution and on the fine-tuning of language models with ATFs:

  • Understanding the impact of three derived text formats on authorship classification with Delta: https://doi.org/10.5281/zenodo.7715299
  • Shifting Sentiments? What happens to BERT-based Sentiment Classification when derived text formats are used for fine-tuning: (link to be added)

Examples

RangN-GrammHäufigkeit
1gott sei dank43
2ja gnädigste frau17
3auch heute wieder13
4doch auch wieder11
5ist doch auch11
6ist immer so10
7gnädigste frau ist10
8war so war10
9nein gnädigste frau9
10wird ja wohl9
11ist doch recht9
12doch immer noch9

Frequencies of 3-grams across multiple texts, with a minimum frequency of 5. Example data based on five narrative texts by Theodor Fontane. [Schöch et al. 2020]

von_APPR_von Hohen-Cremmen_NN_Hohen-Cremmen Georg_NE_Georg zu_APPR_zu heller_ADJA_hell des_ART_die fiel_VVFIN_fallen schon_ADV_schon bewohnten_ADJA_bewohnt In_APPR_in der_ART_die <SEG> Mittagsstille_ADJA_Mittagsstille Gartenseite_NN_Gartenseite und_KON_und erst_ADV_erst Park-_TRUNC_Park- Dorfstraße_NN_Dorfstraße ,_PUN_, Seitenflügel_NN_Seitenflügel breiten_ADJA_breit die_ART_die hin_ADV_hin während_KOUS_während angebauter_ADJA_angebaut der_ART_die nach_APPR_nach ein_ART_eine Schatten_NN_Schatten auf_APPR_auf einen_ART_eine rechtwinklig_ADJD_rechtwinklig <SEG> großes_ADJA_groß ,_PUN_, auf_APPR_auf mit_APPR_mit in_APPR_in ein_ART_eine weiß_ADJD_weiß und_KON_und über_APPR_über quadrierten_ADJA_quadrierten und_KON_und diesen_PDAT_dies auf_APPR_auf Mitte_NN_Mitte seiner_PPOSAT_sein dann_ADV_dann Fliesengang_NN_Fliesengang hinaus_ADV_hinaus einen_ART_eine grün_ADJD_grün <SEG>

Excerpt from the list of tokens with annotation with segment-by-segment cancellation of the sequence information for the beginning of Fontane’s Effi Briest. Here on a unigram basis and with word form, lemma and word type information as well as a segment length of 20 tokens. Note the marking of the segment boundaries with after every 20 tokens. [Schöch et al. 2020]