Overcoming the computational hurdle and exploiting opportunities for humanities students in linguistic research for the Digital Humanities (DH) with Chat-GPT

funded by the DaTaPin Project at Saarland University

PI: Stefania Degaetano-Ortlieb

Research Assistant: Isabell Landwehr (Summer Term 2023), Sergei Bagdasarov (WinterTerm 2023/24)

This project aims to assess the potential of interactive AI systems in helping humanities students overcome their lack of computational skills, while exploring opportunities for students in linguistics doing research in the Digital Humanities to move from reproduction tasks towards enhancing their Future Skills (2021).

AIMS

Our specific project aims are three-fold: (1) assess how interactive AI systems (here: ChatGPT) can help overcome the computational hurdle in the humanities, (2) assess the opportunities such systems may offer for research questions in linguistics, focusing on language variation, (3) evaluate the use of AI systems in teaching for humanities students. On a more global level, these aims will allow us to work towards (a) endeavors for explainable AI as we are going to establish processes of critical analysis of AI output (cf. European Union Agency for Fundamental Rights 2022 on bias in AI) and (b) enhancing students’ Future Skills (2021) such as classical competences (e.g. critical thinking) as well as technical, digital and transformative competences.

MOTIVATION

The demands of the digital era have prompted efforts to equip humanities students with computational skills. This has led to the inclusion of programming courses and soft AI applications in many BA and MA programs. While this is a step in the right direction, transferring knowledge from generic courses to specific research contexts is not always straightforward. In fact, these courses are continuously adapted in order match humanities students’ needs. From my own teaching experience (more than 10 years of teaching corpus-linguistic and data-driven methods), humanities students in these study programs are very good in asking relevant research questions for the Digital Humanities, but still lack the computational skills to quantitatively and empirically analyze big data. So up to now, teaching of computational skills to address research questions in subject-specific courses was tailored to introduce basic programming skills (for example in R or python). A rather restricted range of packages and code lines was essential to avoid overburden the students and maximize their learning outcomes. Teaching humanities students to search for and customize code using help pages on coding platforms or forums was not an option. The abundance of web content that students would need to sift through leads to frustration and negatively impacts their learning outcomes. Thus, up to now, the computational hurdle more than often still hinders humanities students to embrace their whole research potential. At the same time, humanities students are very good in analyzing the output of data-driven approaches in qualitative ways and thoughtfully reflecting on the societal impact of their findings. However, the lack of appropriate data restricts the range of questions students can approach from a Digital Humanities’ perspective.

The use of ChatGPT is a tremendous opportunity to change this aiming at:

Overcoming the computational hurdle for humanities students: ChatGPT generates code (in several programming languages) based on content-related prompts, i.e. prompts can be targeted at specific research problems. Besides generating the code, ChatGPT allows us to make use of several explanatory functionalities. For example, it will explain the given code in detail, it can generate input and output data for us to evaluate the suggested code, it can give instructions on how to approach error messages and it can change the code accordingly. This has clearly didactic advantages that we should exploit in teaching (consider e.g. the impact on self-paced learning). However, processes of critical analysis of the output have to be established to allow students to reasonably evaluate the produced output.
Generating and analyzing AI-generated output given specific parameters of variation (e.g. age, gender, text type, language): The huge language models behind ChatGPT allow it to generate a rather broad spectrum of language variation: from spoken colloquial up to written formal text, text in different languages, for different audiences (e.g., experts, non-experts, age-specific), just to mention a few. This opens up new opportunities of research tasks which students were hindered to pursue in the past due to lack of data availability. More importantly, by analyzing ChatGPT-generated output in depth from a (socio-)linguistic perspective allows us to better understand AI systems and work towards more explainable AI. For example, AI-generated output often comes with strong societal biases (cf. European Union for Fundamental Rights 2022). These may also strongly differ across languages.

To take advantage of this opportunity, we must adapt our teaching in various ways (see 1-2 below). Also, we need to build on previous work on evaluating the impact of technological innovation to assess its application to help humanities students acquiring computational and Future Skills (see 3 below). For this, we will integrate introduction, application and evaluation phases within our courses (6 courses in total, see list below) in the following two study programs.

COURSE MODULES

We have created several teaching modules, which we distribute as Sways in a MSTeams environment, as MSTeams is our main teaching platform. However, the Sways can be used separately. We’re happy to provide access to the modules, just write us an email and well let you join the MSTeams or distribute the modules in other formats.

The following teaching modules have been designed and used during the Winter Term 2023/24 at Saarland University in the MA Empirical Linguistics and Translatology and in the BA Language Science.

From Ideas to research questions and hypotheses
Register variation

Data manipuation with R
Data visualization with R

Statistics in linguistic research
Logistic models
Simple and multiple linear models
Mixed-effects linear models

AIMS

MOTIVATION

COURSE MODULES

Teilen mit: