Journal of Data Mining & Digital Humanities

Predicting Sustainable Development Goals Using Course Descriptions — from LLMs to Conventional Foundation Models

April 29, 2024April 29, 2024 Kharlashkin, Lev Edit

We present our work on predicting United Nations sustainable development goals (SDG) for university courses. We use an LLM named PaLM 2 to generate training data given a noisy human-authored course description input as input. We use this data to train several different smaller language models to predict SDGs for university courses. This work contributes to better university level adaptation of SDGs. The best performing model in our experiments was BART with an F1-score of 0.786.

Normalization of Arabic Dialects into Modern Standard Arabic using BERT and GPT-2

April 29, 2024April 29, 2024 Alnajjar, Khalid Edit

We present an encoder-decored based model for normalization of Arabic dialects using both BERT and GPT-2 based models. Arabic is a language of many dialects that not only differ from the Modern Standard Arabic (MSA) in terms of pronunciation but also in terms of morphology, grammar and lexical choice. This diversity can be troublesome even to a native Arabic speaker let alone a computer. Several NLP tools work well for MSA and in some of the main dialects but fail to cover Arabic language as a whole. Based on our manual evaluation, our model normalizes sentences entirely correctly 46\% of the time and almost correctly 26\% of the time.

Perplexity Games: Maoism vs. Literature through the Lens of Cognitive Stylometry

April 29, 2024April 29, 2024 Kurzynski, Maciej Edit

The arrival of large language models (LLMs) has provoked an urgent search for stylistic markers that could differentiate machine text from human text, but while the human-like appearance of machine text has captivated public attention, the reverse phenomenon—human text becoming machine-like—has raised much less concern. This conceptual lag is surprising given the ample historical evidence of state-backed attempts to regulate human thought. The present article proposes a new comparative framework, Perplexity Games, to leverage the predictive power of LLMs and compare the statistical properties of Maospeak, a language style that emerged during the Mao Zedong’s era in China (1949-1976), with the style of canonical modern Chinese writers, such as Eileen Chang (1920-1995) and Mo Yan (1955-). The low perplexity of Maospeak, as computed across different GPT models, suggests that the impact of ideologies on language can be compared to likelihood-maximization text-generation techniques which reduce the scope of valid sequence continuations. These findings have cognitive implications: whereas engineered languages such as Maospeak hijack the predictive mechanisms of human cognition by narrowing the space of linguistic possibilities, literature resists such cognitive constraints by dispersing the probability mass over multiple, equally valid paths. Exposure to diverse language data counters the influences of ideologies on our linguistically mediated perceptions of the world and increases the perplexity of our imaginations.

Towards efficient and reliable utilization of automated data collection: Media scrapers applied to news on climate change

April 29, 2024April 29, 2024 Mervaala, Erkki Edit

Abstract: Automated data collection provides tempting opportunities for social sciences and humanities studies. Abundant data accumulating in various digital archives allows more comprehensive, timely and cost-efficient ways of harvesting and processing information. While easing or even removing some of the key problems, such as laborious and time-consuming data collection and potential errors and biases related to subjective coding of materials and distortions caused by focus on small samples, automated methods also bring in new risks such as poor understanding of contexts of the data or non-recognition of underlying systematic errors or missing information. Results from testing different methods to collect data describing newspaper coverage of climate change in Finland emphasize that fully relying on automatable tools such as media scrapers has its limitations and can provide comprehensive but incomplete document acquisition for research. Many of these limitations can, however, be addressed and not all of them rely on manual control.

On searchable Mordvin corpora at the Language Bank of Finland, EMERALD

April 29, 2024April 29, 2024 Rueter, Jack Edit

Description of Mordvin language corpora development at the Language Bank of Finland.Description of development.

Sentiment Analysis for Literary Texts: Hemingway as a Case-study

April 29, 2024April 29, 2024 Yuri, Bizzoni Edit

[...]

Perceptions of 21st-century digital skills and agency among design sprint participants in Laurea UAS, Finland

April 29, 2024April 29, 2024 Mononen, Asko Edit

This explorative study investigated students’ (N=16) perceptions before and after the study unit Digital Analytics and Consumer Insights. The studies were conducted as an intensive hybrid five-day design sprint, a variant of project- and problem-based learning. An online questionnaire with a 5-point Likert scale was used for data collection. The findings indicate that the intervention improved perceptions of most studied digital “hard skills” (8/11 claims). Out of twelve 21st-century “soft skills” claims, perceptions were high initially and improved significantly for critical thinking and systematic problem-solving claims during the design sprint. The agency scores showed a slight improvement but no significant difference. Face-to-face groups would be willing to recommend the sprint method more for peers than online groups. In the era of global turbulence and artificial intelligence, in addition to hard skills, soft skills like communication, teamwork, problem-solving and project management are in demand by employers. According to LinkedIn data in 2/2024, adaptability is the most demanded skill. In addition to traditional subjects, the pedagogical methods in higher education should better support the development of 21st-century skills.

Ainu–Japanese Bi-directional Neural Machine Translation: A Step Towards Linguistic Preservation of Ainu, An Under-Resourced Indigenous Language in Japan

April 29, 2024April 29, 2024 Miyagawa, So Edit

This study presents a groundbreaking approach to preserving the Ainu language, recognized as critically endangered by UNESCO, by developing a bi-directional neural machine translation (MT) system between Ainu and Japanese. Utilizing the Marian MT framework, known for its effectiveness with resource-scarce languages, the research aims to overcome the linguistic complexities inherent in Ainu's polysynthetic structure. The paper delineates a comprehensive methodology encompassing data collection from diverse Ainu text sources, meticulous preprocessing, and the deployment of neural MT models, culminating in the achievement of significant SacreBLEU scores that underscore the models' translation accuracy. The findings illustrate the potential of advanced MT technology to facilitate linguistic preservation and educational endeavors, advocating for integrating such technologies in safeguarding endangered languages. This research not only underscores the critical role of MT in bridging language divides but also sets a precedent for employing computational linguistics to preserve cultural and linguistic heritage.

Historical Documents and Automatic Text Recognition: Introduction

March 19, 2024March 19, 2024 Pinche, Ariane Edit

With this special issue of the Journal of Data Mining and Digital Humanities (JDMDH), we bringtogether in one single volume several experiments, projects and reflections related to automatic textrecognition applied to historical documents.More and more research projects1 now include automatic text acquisition in their data processing chain,and this is true not only for projects focussed on Digital or Computational Humanities but increasinglyalso for those that are simply using existing digital tools as the means to an end. The increasing useof this technology has led to an automation of tasks that affects the role of the researcher in the textualproduction process. This new data-intensive practice makes it urgent to collect and harmonise the corporanecessary for the constitution of training sets, but also to make them available for exploitation. Thisspecial issue is therefore an opportunity to present articles combining philological and technical questionsto make a scientific assessment of the use of automatic text recognition for ancient documents, itsresults, its contributions and the new practices induced by its use in the process of editing and exploringtexts. We hope that practical aspects will be questioned on this occasion, while raising methodologicalchallenges and its impact on research data.The special issue on Automatic Text Recognition (ATR) is therefore dedicated to providing a comprehensiveoverview of the use of ATR in the humanities field, particularly concerning historical documentsin the early 2020s. This issue presents a fusion of engineering and philological aspects, catering to bothbeginners and experienced users interested in launching projects with ATR. The collection encompassesa diverse array of approaches, covering topics such as data creation or collection for training genericmodels, reaching specific objectives, technical and HTR machine architecture, segmentation methods,and image processing.

Exploring Data Provenance in Handwritten Text Recognition Infrastructure: Sharing and Reusing Ground Truth Data, Referencing Models, and Acknowledging Contributions. Starting the Conversation on How We Could Get It Done

March 18, 2024March 18, 2024 Romein, C. Annemieke Edit

This paper discusses best practices for sharing and reusing Ground Truth in Handwritten Text Recognition infrastructures, as well as ways to reference and acknowledge contributions to the creation and enrichment of data within these systems. We discuss how one can place Ground Truth data in a repository and, subsequently, inform others through HTR-United. Furthermore, we want to suggest appropriate citation methods for ATR data, models, and contributions made by volunteers. Moreover, when using digitised sources (digital facsimiles), it becomes increasingly important to distinguish between the physical object and the digital collection. These topics all relate to the proper acknowledgement of labour put into digitising, transcribing, and sharing Ground Truth HTR data. This also points to broader issues surrounding the use of machine learning in archival and library contexts, and how the community should begin to acknowledge and record both contributions and data provenance.