Motivation

I am a computational linguist, working as a lecturer. My work relies heavily on digital tools. For discovery, in addition to the usual Google Scholar search, I am using a specific platform called L’Année Philologique (The Philological Year), which is an index to scholarly work in fields related to the language, literature, history and culture of ancient Greece and Rome. I also often check LinguistList.org to keep updated, as it sends daily emails about jobs, books, calls for papers, questions, dissertations etc. To organise relevant articles, I am using Zotero and create folders to store them. In addition, I am also using keywords and hashtags to tag and search for objects. For making connections and getting an overview, I like to create charts to create a visual map of the information, and prefer to do it in the traditional way (on paper) as I am not yet satisfied with the digital tools that exist. I would like to have an online file system (similar to how repositories work) on my computer, that I am able to add metadata to for every object that is in there, and then automatically share with the university repository.

In short, my main concerns are unannotated Corpora, requiring manual intervention to make them usable; that there is a lack of compatibility between formats and that mapping from one metadata-schema to another is often problematic; and that paywalls prevent access to relevant documents.

Objectives

I would like to be able to use a super registry of annotated Corpora and to select various export formats for data to further process it with different tools (e.g. Python or R). In addition, I would be interested in working with colleagues from other disciplines and have a shared understanding of the project requirements because my work usually involves working with e.g. historians or traditional theologists, which can be difficult due to differences in expectations and the different terminology used.

Solution

For a collaborative project with a traditional historian and a Cultural Heritage social enterprise, I am aiming to create a Super-registry that aggregates annotated data from smaller registries and we are trying to improve the OCR (Optical Character Recognition) used in the analysis. Two things are necessary for this:

  1. a Discovery Service that allows data from different disciplines to be searched for;

  2. a tool that allows users to aggregate, store, annotate and make collections of linguistic data available to other users.

Ad 1) The OPERAS Research Infrastructure is currently developing such a Discovery Service, which could help reduce the obstacles for going further with this project: the TRIPLE platform. With this platform’s aggregation of many different individual research catalogues, it will save me a lot of time and effort in searching for new data and datasets.

Ad 2) Text+ and OPERAS could provide this tool to allow me to create new Corpora by aggregating, harmonizing and annotating data from different sources in order to share them with my colleagues by uploading them in different formats. The platform should recognise the kind of dataset and tag it accordingly, displaying the tag when shown in a list of my saved materials. I should also have the option to add my own tags to help later retrieval and to allow my colleagues to know the contents at a glance. Once our work is finished and the Corpora complete, we should be able to make them public for others to use, but we should nevertheless also be able to have them private and visible only to us at first.