Derived Text Formats

While many of the language and text resources from the Text+ centres are freely available to researchers, others can only be used for research to a limited extent due to legal restrictions - in particular works that are protected by copyright or by data and personal rights or are subject to licensing restrictions.

Text+ aims to make such protected works accessible and usable for science. Derived text formats (DTF, see Schöch et al., 2020) offer one way of doing this.

What are DTFs?

DTFs are created by reducing the information content in texts - after an initial enrichment phase. They can be produced in such a way that, on the one hand, the result still allows at least one research question to be answered but, on the other hand, the remainder no longer infringes the rights of the copyright holder, for example.

One of the prerequisites for this is that there is no possibility of reconstructing the original text. Such DTFs can therefore be published freely. When generating more than one DTF from a document or a corpus, it must be ensured that no reconstruction is possible by combining them. Not all DTFs are automatically copyright-free. If it is possible to reconstruct the original text without major effort, the DTF is still subject to copyright.

How to generate DTFs?

DTFs are created on the basis of the original text by applying a series of a series of changes. In a first step, the text is enriched by annotations (e.g. part-of-speech tagging (POS), links from named entities to standardised data or statistical analyses analyses on the original text). This is followed by targeted information reduction. On the one hand, this is based on a series of changes, which typically take place automatically and are essentially on the decision as to which granularity these operations are based on.

Four operations are available for information reduction:

Delete
Retain
Replace
Swap

These can take effect at different levels of granularity (e.g. at token, sentence or paragraph level) and in relation to different sizes (e.g. per document, per work or per corpus).

Common forms of DTFs are, for example, term-document matrices, N-grams, texts with masked tokens or word embeddings.

Current Status

Text+ is currently working on a proposal for a DIN standard on ATFs as well as a publication on the legal aspects of these formats in order to provide the necessary expertise to support both scientific communities that use these data and institutions that wish to make such data available.

At the same time, Text+ is driving forward research in this area. Several analyses have recently been published, e.g. on the suitability of various ATFs for authorship attribution and on the fine-tuning of language models with ATFs (see ‘Further links’)

Further links

The following link lists lead to existing DTFs that can already be worked with, as well as to research results that either deal with the properties of various DTFs or have been developed on the basis of the use of DTFs.

Sample DTFs for publication ‘Derived text formats: Text and data mining with copyrighted text assets’ (Schöch et al., 2020) as well as programme code to generate them: Link
HTRC Extracted Features (DTFs of more than 17 million volumes): Link
Google N-Grams (N-Grams from a corpus containing about 3,5 million Englisch language books): Link
Collection of American drama texts focusing on the structural markup: Link
DTFs from Spanish language novels in the corpus CoNSSA (Corpus of Novels of the Spanish Silver Age): Link
- Document-Term-Matrix (Bag of words) from the novel ‘Don Quijote de la Mancha’ of Miguel de Cervantes (Spanisch language): Link

Classification of Genres through 500 Years of Spanish Literature in CORDE (Calvo Tello, 2024) Linkchapter/19362)
Shifting Sentiments? What happens to BERT-based Sentiment Classification when derived text formats are used for fine-tuning (Du and Schöch, 2024) Link
InvBERT: Reconstructing Text from Contextualized Word Embeddings by inverting the BERT pipeline (Kugler et al., 2023) Link
Understanding the impact of three derived text formats on authorship classification with Delta (Du, 2023) Link
Full text vs. derived text format: Systematic evaluation of the performance of topic modelling with different text formats using Python (German) (Kocula, 2022) Link
Access to large text corpora of the 20th and 21st centuries with the help of derived text formats (German) (Raue und Schöch, 2020) Link
Masking Treebanks for the Free Distribution of Linguistic Resources and Other Applications (Rehm et al., 2007) Link
Corpus Masking: Legally Bypassing Licensing Restrictions for the Free Distribution of Text Collections (Rehm et al., 2007) Link

Examples

Rang	N-Gramm	Häufigkeit
1	gott sei dank	43
2	ja gnädigste frau	17
3	auch heute wieder	13
4	doch auch wieder	11
5	ist doch auch	11
6	ist immer so	10
7	gnädigste frau ist	10
8	war so war	10
9	nein gnädigste frau	9
10	wird ja wohl	9
11	ist doch recht	9
12	doch immer noch	9

Frequencies of 3-grams across multiple texts, with a minimum frequency of 5. Example data based on five narrative texts by Theodor Fontane. Schöch et al. 2020

von_APPR_von Hohen-Cremmen_NN_Hohen-Cremmen Georg_NE_Georg zu_APPR_zu heller_ADJA_hell des_ART_die fiel_VVFIN_fallen schon_ADV_schon bewohnten_ADJA_bewohnt In_APPR_in der_ART_die <SEG> Mittagsstille_ADJA_Mittagsstille Gartenseite_NN_Gartenseite und_KON_und erst_ADV_erst Park-_TRUNC_Park- Dorfstraße_NN_Dorfstraße ,_PUN_, Seitenflügel_NN_Seitenflügel breiten_ADJA_breit die_ART_die hin_ADV_hin während_KOUS_während angebauter_ADJA_angebaut der_ART_die nach_APPR_nach ein_ART_eine Schatten_NN_Schatten auf_APPR_auf einen_ART_eine rechtwinklig_ADJD_rechtwinklig <SEG> großes_ADJA_groß ,_PUN_, auf_APPR_auf mit_APPR_mit in_APPR_in ein_ART_eine weiß_ADJD_weiß und_KON_und über_APPR_über quadrierten_ADJA_quadrierten und_KON_und diesen_PDAT_dies auf_APPR_auf Mitte_NN_Mitte seiner_PPOSAT_sein dann_ADV_dann Fliesengang_NN_Fliesengang hinaus_ADV_hinaus einen_ART_eine grün_ADJD_grün <SEG>

Excerpt from the list of tokens with annotation with segment-by-segment cancellation of the sequence information for the beginning of Fontane’s Effi Briest. Here on a unigram basis and with word form, lemma and word type information as well as a segment length of 20 tokens. Note the marking of the segment boundaries with <SEG> after every 20 tokens. Schöch et al. 2020

Derived Text Formats

Derived Text Formats

What are DTFs?

How to generate DTFs?

Current Status

Further links

Links to extisting DTFs

Links to research results

Examples