Motivation
In the field of Media and Communication Studies ( DFG subject area 111-03), a lot of qualitative and analogue research is being done. For the quantitative branch of Media and Communication Studies research, only a few current, specific, and location-independent corpora are available to the research community free of charge. The project described below tries to change this situation by using (and partly providing) corpora based on text from current IT-blogs and newspaper articles for text data mining research.
Objectives
Following the assumption that IT-blogs and websites represent a group of technologically- and politically-interested experts, I investigate these blogs’ and websites’ impact on public discussion of matters situated at the intersection of technology and society. My research particularly examines the discourse about freedom of expression and the regulation of hate speech online by means of the German Network Enforcement Act (NetzDG) and compares the discourse about these topics on German IT-blogs and websites with the one in major German newspapers. My goal is to identify the most influential stakeholders in each communicative subfield (blogs, websites, journalists, politicians, firms, advocacy groups, etc.), their communication strategies, and the arguments they bring forward. Ultimately, I want to determine whether IT-blogs and websites are able to identify and address the above-mentioned political questions at the intersection of technology and society early on and if they are able to translate them for enhanced discussion within broader swaths of the population.
Solution
In cooperation with colleagues at the Berlin-Brandenburg Academy of Sciences and Humanities (BBAW), we have already compiled a corpus consisting of text from German-speaking IT-blogs and websites that is made available through the homepage of the Digitale Wörterbuch der deutschen Sprache (DWDS). This corpus can be searched according to keywords and the resulting hits can be downloaded and processed for further analysis. For my project, we have compiled a subcorpus “NetzDG” based on specific search terms and I am currently examining it with text data mining processes as well as in traditional forms of close reading.
Furthermore, we have generated a corpus that comprises the articles of nine major German newspapers regarding the NetzDG based on the same key words in cooperation with the German National Library (DNB). This newspaper corpus will allow me to compare the discourse on IT-blogs and websites with the one in the traditional media and therefore make it possible to examine the potential impact of these blogs and websites on newspaper coverage and the broader public discussion.
Challenges
IT-Blog Corpus: In order to conduct this kind of research, the IT-blog corpus provided by the DWDS first needs to be searched according to specific search terms and the resulting subcorpus “NetzDG” must be compiled and downloaded with a particular corpus building tool ( https://trafilatura.readthedocs.io/en/latest/tutorial-dwds.html). The output data (XML-files) then needs to be transferred into other data formats to allow for explorative and statistical data analyses with R, mixed-methods software, etc. or for more comfortable close reading options. These processes are time-consuming and can be error-prone, particularly for less experienced users and DH researchers. A more unified system as well as support along the way – particularly when errors occur – would be very beneficial to the user. Furthermore, it would be very helpful if opportunities for training and the acquisition of new skills or the improvement of existing once were made available to the community.
Newspaper Corpus: With regard to the newspaper corpus compiled at the DNB, the e-paper versions of the respective newspapers first have to be searched for the keywords and then the hits need to be compiled into a corpus. The output of this initial search process is provided in PDF format and yields the complete newspaper page on which the search term occurs. Therefore, the individual article that contains the search term needs to be separated from the rest of the page and must then be transferred into a text format for further processing. Additional metadata (publication date, author, pages, newspaper, etc.) also must be extracted either (if possible) automatically, or partly manually. At present, there is no reliable fully automated solution for this kind of article separation available, which is why this part of the process partially needs to be done manually and/or by specially trained personnel.
Due to the particular restrictions that apply to copyrighted materials such as current newspaper articles, it is necessary to do the analysis of this corpus on-site by using offline machines provided by the DNB. This generates travel costs and can be time-consuming. Remote access to the data would surely be an ideal solution.
Both corpora are dynamic in the sense that additional crawls and the improvement of the underlying algorithms (IT-blogs) will amend the content of the corpus over time and change the results. The same is true for the newspaper corpus to which additional components might be added. The subcorpora created for (my) analysis must therefore be documented in order to ensure reproducibility and the archival storage of the results. These measures of documentation should be available to the research community in order to warrant peer-review processes and critical engagement with the research.
Additionally, the integration of as many digital resources and corpora as possible, e.g. (thematically related) content from Twitter, Facebook, more (international) newspapers, newsfeeds, legal documents, administrative and governmental debates, etc., into a common search environment for federated content search that would allow researchers to identify relevant materials for their projects, as well as the processing of the data according to linked-open-data standards would be a tremendously helpful asset.
Review by community
Yes. I am more than happy to review the services provided by Text+ during the possible funding period.