Data Archiving Support

Motivation

I am a linguist of many decades of experience and have accumulated rich collections of research data. Also, I am aware of the need for a proper management of research data. Consequently, I have tried to put some order into the many resources I have created and worked with on a project per project basis. Prior to my retirement, I would like to commit all research data to an archive to preserve it for the years to come and to make it available to a wider audience. Unfortunately, I don’t see my own department (or my university) fit to tackle the archiving task; rather I am looking for an archive on the regional or national level that is well versed in Digital Humanities data, methodologies, technologies and that caters for data similar to mine. Ideally, I am looking for a computer-supported workflow that helps me to create archiving packages and to submit them to an archive with the best fit given the research data involved.

Objectives

The archiving tool should help me gathering the following information to describe my research data:

A plain text description of the project in which the research data was collected, the project’s goals, hypotheses, and research questions.
An enumeration of all researchers contributing to the project, including their affiliations and contact information.
A formalized way to describe the nature of the research data, say, by working through a list of predefined types of research data (e.g., corpus, lexicon, experimental stimuli).
A mechanism to link to publications (and those of my colleagues) that resulted from the analysis of the research data.
An easy-way to select the licence(s) for which my research data will be available to interesting parties. Here, it should be possible to assign licences also by a file-by-file basis rather than a project-by-project basis only.
A form where I can easily collect and upload my research data into a “data package”. If I upload file directories, then the tool should preserve the hierarchical structure of all research data. Once I upload the data from my hard drive to the tool, the tool should perform some basic analysis on each file (such as its name, size, media type, or its language). The tool shall allow me to complement or correct data that the automatic analysis failed to identify correctly.
Given the information provided so far, the tool then presents me with a list of archives most suitable for the research data I am preparing. I can select an archive (or multiple archives) to see the terms and conditions they offer for the preservation of my research data. Once an archive is selected, the entire package is sent to the archive of my choice (assuming the archive is ok with receiving & archiving my data). The tool’s transferal of my files ensures that all data is transferred in a correct and complete manner (for instance, by assigning checksums to each file prior to transferal).
As a result of the transferal, the archive confirms to have obtained my material in a correct and complete manner. The manager of the archive might have additional questions that I need to answer, and I might need to sign some sort of contract to fulfil legal obligations.

In general, I would like to highlight following aspects:

Getting researchers to get their valuable research data to a long-term archive is hard, so we need to lower the barrier to archive/publish research data.
Research data management is central to Text+, and while there are many tools available to analyse resources for the sake for scientific enquiry, there is no tool available that helps researchers during the archiving task.
Have automated methods in place to help generating semantically rich CMDI-based metadata without referring users to hard-to-use XML-based editors.

Solution

Have a tool developed that implements the functionality mentioned above.
Implement semi-automatic steps to the tool so that an analysis of all input helps with the automatic provision of metadata about research data.
Develop the tool considering its usability. Gather information automatically whenever possible, ask for information intelligently when necessary. Avoid at all costs that users are confronted or need to edit metadata in XML.
Have the tool make use of existing CMDI-based metadata profiles (CLARIN Component Registry) to describe metadata in rich detail.
Make use of existing technology (such as the Switchboard Profiler) to detect a resource’ language or media type.
Have archives volunteer as end-points to such a tool, improving on the existing “Find your archive” service.
Make use of the BagIt File Packaging Format (used by the library world for the transferal of data to archives).

Challenges

Competitor: docuteam packer with basic functionality but not adapted to Text+ community.

Challenges: Limited update by community, but if the tool is well-made, users can be nudged into using it.