Motivation
For my master’s thesis I wanted to combine methods from social science and computer science to investigate how hate speech on social media can be automatically detected with machine learning algorithms. Since Twitter has become one of the leading social networks with an average of 300 million active users per month it is often used to spread dehumanizing remarks based on origin, ethnicity, gender identity, sexual orientation, disability, serious diseases, age, religious and/or political affiliation. For my project I needed not only a large corpus, but also a dataset which tweets already were classified into the categories ‘hate speech’ or ‘non-hate-speech’. At that time (December 2018) I just found a few particularly interesting and suitable Twitter corpora by reading the state-of-the-art research papers on hate speech detection. The corpora which were freely accessible had classified English tweets into the relevant categories were provided by Davidson et al. ( https://github.com/t-davidson/hate-speech-and-offensive-language), Waseem and Hovy ( https://github.com/ZeerakW/hatespeech), and ElSherief et al. ( https://github.com/mayelsherif/hate_speech_icwsm18) .
Objectives
While creating the corpus for my project I encountered two problems: 1. The projects I gathered the data from had used two different methods for collecting tweets. Some extracted tweets by predefined keywords, others extracted tweets based on the topics they discussed. To handle possible outlier tweets and/or topic imbalances I had to consider both methods, their drawbacks and also had to take the individual sets of words/topics into consideration to unify the datasets.
Twitter does not allow the redistribution of “hydrated” content, meaning that you are allowed to public share Tweets-IDs only. Many Tweet’s IDs could not be resolved anymore, probably because the corresponding tweets were deleted in the meantime.
Therefore, assembling data for my master’s thesis was very challenging. Meanwhile a few more annotated Twitter datasets which are suitable for investigating hate speech have been made public ( https://hatespeechdata.com), but those hate speech corpora still are difficult to merge into a larger dataset. Twitter allows scraping tweets with a limitation of 18000 tweets per 15 minutes. Since dehumanizing remarks often get deleted quickly creating a dataset yourself is really time-consuming.
Twitter content is not only helpful for hate speech detection, but also can be used for other approaches, such as sentiment analysis, or the detection of spam, sarcasm or suicidality.
Solution
An initiative which collects and curates user generated content from Twitter/social media would not only be tremendously useful for further investigation on hate speech but also could provide data for content-, social-, network- or semantic analysis. Therefore, the methods which are used to create general Twitter datasets need to be transparent and reusable.
Each tweet should be enriched with detailed metadata. Besides the two categories ‘hate speech’ or ‘non-hate-speech’, for further investigation it would be interesting if each tweet carries the information on the keyword or topic which was responsible for the tweet being extracted from the Twitter stream. Such additional metadata would enable users to extract subsets from the general Twitter dataset collection which fit their needs. The German Twitter Titling Corpus for example specifically is investigating the tweets of 24 German politicians with a doctoral degree ( https://heidata.uni-heidelberg.de/dataset.xhtml?persistentId=doi:10.11588/data/AOSUY6). My master’s thesis would have greatly benefitted from a hate speech corpus with balanced topics for both classes.
For research projects it is mandatory that datasets remain accessible to guarantee reproducible results. An infrastructure like Text+ which ensures accessibility of referable datasets, especially for short-lived data such as user generated content, would improve interdisciplinary research due to combining methods from different disciplines/research areas.
Challenges
The demand for guaranteed long-term accessibility concerning tweets would mean that either Twitter would have to change their terms and conditions or that hate speech on Twitter would have to not get deleted. Still, it would be tremendously helpful for further research if somewhere in between these options lied the possibility to curate problematic tweets that could be pursued. Nevertheless, I would argue that improved accessibility to any user generated online content, regardless of topic or platform of origin, would benefit many researchers working in the social and political sciences.