Motivation

The platform DraCor offers not only a series of corpora (described in another User Story), but also an API with several types of queries. These manage and analyze the texts of the corpora (either single texts or entire corpora). The project is available at https://dracor.org and the API at https://dracor.org/documentation/api. Any person can use the API to obtain easily the text contained in specific TEI elements, for example the text of the body, the stage directions, or the metadata. The API retrieves in general plaintext, although for specific fields CSV, JSON or XML-TEI are possible in specific queries. The majority of the queries of the API retrieves the text of a single document and not for the entire corpus nor the entire DraCor.

The corpora of DraCor are in TEI. DraCor foresees two different user groups within the DH community:

  1. The users who prefer graphical interfaces and who do not program.
  2. The users who prefer to interact with APIs through scripts (in different programming languages like Java, R, Python, JavaScript…).

DraCor has managed to create a platform that satisfies to a certain degree both groups: programmers can call the APIs from the scripts, users can obtain the data over the browser without programming.

Objectives

TextGrid, as one ressource within the Text+ Consortium, does not offer API to query or to manage the texts of the corpora. For example, if the user wants the plain text of all poems in TextGrid, they would need to download the entire TextGrid, iterate through thousands of files, open each text as XML-TEI file, check whether the text contains poems (element <lg>) and if so, retrieve them through xPaths. To facilitate this kind of questions, the user should be able to query the entire TextGridRep in a similar way as DraCor offers. This API could operate on specific texts, a selection of texts based on the shelf function, all the texts of a collection, or the entire TextGridRep. The users should choose in which format they want the results:

  1. XML-TEI
  2. TXT
  3. JSON
  4. CSV (for metadata fields)

If the user retrieves data from more than one text, the data from each text should be saved in a separated file, and all should be bundled in a zip file.

These requirements would be under the areas of collections within Text+.

Solution

Retrieving texts through APIs is rather unseen in the DH until now, with some exceptions such as DraCor or the Folger Digital Text collection.

TextGridRep and the TextGridLab are optimal environments for implementing this kind of APIs since it is one of the greatest open literary corpora and XML-TEI is its native format.

TextGrid offers already a series of functions and APIs, but none of them cover what the DraCor API facilitate for their corpora. Among other, in TextGrid are following possibilities implemented:

  1. “Shelf” function (selection of texts by the user)
  2. TextGrid Sade (to create TextGrid portals based on a specific project)
  3. TextGrid Search (to query specific elements of TextGrid, but without retrieving the text)
  4. TextGrid PID (an API about the persistent identifiers)
  5. TextGrid Publish (to import large number of texts)

Voyant tools offer similar possibilities for specifying the xPath for the text that should be analyzed. However, this only works for the corpora that the users loads to Voyant. It would be interesting to connect these APIs as a step between the original TextGridRep as a text archive and other analytical tools already connected to it, such as Switchboard or Voyant tools.