Why the training of Large Language Models falls within the scope of the Text and Data Mining exceptions
The question of whether AI training falls within the scope of Text and Data Mining (TDM) exceptions has been a topic of intense debate. For some time, legal scholars and stakeholders have argued both for and against the applicability of these exceptions to the training of large language models (LLMs). The introduction of the AI Act seemed to have quieted much of this discussion, as the Act explicitly acknowledges the relevance of TDM exceptions for AI training. However, a recent study has reignited the debate presenting a technology-driven argument why AI training should not fall under the TDM framework. Statutory exceptions for Text and Data Mining were harmonised by the 2019 EU Directive on Copyright in the Digital Single Market (2019/790) (Articles 3 and 4). The Directive defines TDM very broadly as any automated analytical technique aimed at analysing text and data in digital form in order to generate information which includes but is not limited to patterns, trends and correlations.
An LLM is a large probabilistic model of natural language containing information about patterns, trends and correlations between words and expressions in natural language. As such, training an LLM, whether it is static or dynamic (generative), corresponds perfectly to the definition of TDM in European law. An LLM is specifically not a ‘repository’ of training data, from which the training data could be retrieved in an unmodified form. Although some LLMs have been reported to “regurgitate” portions of the data used in their training, this is a rare occurrence resulting from coincidence or deliberate hostile prompting.
The study delves significantly deeper into the technical functioning of AI training and, based on these technical details, argues that AI training cannot be covered by the TDM exception. However, this line of argumentation is not convincing. The decisive factor is the intent of the legislator.
It might be true that the EU legislator may not have anticipated the use of TDM exceptions for LLM training; the DSM Directive was proposed in 2016 and adopted in 2019, when the question of LLM training attracted little attention from the general public. Nevertheless, the definition of TDM in the Directive was deliberately broad in order to be future-proof. Ultimately, recent legislation clearly demonstrates the legislator’s intention to include AI training within the scope of TDM exceptions:
Article 53(1)(c) of the AI Act explicitly requires providers of general-purpose AI models to establish policies for identifying and complying with reservations of rights expressed by rightholders. This provision refers directly to Article 4 of the DSM Directive, which allows rightholders to opt out of TDM by expressing their reservations in a machine-readable format. By requiring compliance with this opt-out mechanism, the AI Act recognizes TDM exceptions as a potential legal foundation for AI training.
Further support for this interpretation can be found in a public policy questionnaire by the Council of the European Union on the relationship between generative Artificial Intelligence and copyright and related rights. In its opening statement, the Council emphasized that the AI Act confirms the applicability of TDM exceptions in the context of AI training, including the possibility to opt out.
The reignited debate will therefore turn out to be invalid.
DISCLAIMER: Nothing in this statement is intended to constitute or should be interpreted as legal advice. Keep in mind that both TDM exceptions come with specific requirements.