Presentation

Search Abstracts | Symposia | Slide Sessions | Poster Sessions

Modelling late positivities by large language models

Poster Session C, Friday, October 25, 4:30 - 6:00 pm, Great Hall 3 and 4

Yiming Lu1; 1University of California, Irvine

Background: Large language models (LLMs) have been increasingly applied to model event-related potentials (ERPs) in sentence processing. Researchers have extensively used LLMs to predict human N400. However, the later components, P600 and post-N400 frontal positivity (PNP), received much less attention. This study aims to reveal the predictive power of two information-theoretic measures estimated from LLMs, i.e., surprisal and entropy, for these two components. Previous ERP experiments showed that N400 is modulated by word predictability, but the later components could be more sensitive toward sentence constraints. As such, we anticipated that LLM surprisal should predict N400 fairly well. LLM entropy should predict the later components well. Methods: to validate these hypotheses, we used LLMs to model a recently published large-scale EEG experiment reported by Stone et al., (2023). Stone and colleagues examined the influence of sentence constraints in modulating frontal PNP and P600, where 64 participants were exposed to 56 sets of sentences differing in sentence constraints and final-word predictability. In the present work, we used LLM to calculate final-word surprisal and context entropy (masking the final critical words). To do so, we used the transformer language model, GPT-2. The specific models were the pre-trained versions of GPT-2 models accessed from the huggingface library (Wolf et al., 2020), including GPT-2 base (124M), GPT2-medium (355M), GPT-2 large (774M), and GPT-2 xl (1.5B). These models differ in the number of parameters, but they have the same training size (80M web pages) and vocabulary size (50257 words). Linear mixed effect models with by-subject and by-item intercepts were constructed for N400, P600, and frontal PNP separately. The Akaike Information Criteria (AIC) were extracted from models for comparison. Results: we found that surprisal was a good predictor for N400, replicating previous studies. Surprisal did not have predictive power over PNP and only smaller models report marginal significance for P600. For entropy, small models could not predict N400, but larger GPT-2 models could, contrary to previous results. Moreover, no LLM-entropy could predict P600 and PNP, which was the most surprising. Despite so, as model parameters increased, Akaike information criteria decreased for entropy, meaning that increasing parameter size provides a better trade-off between model fit and complexity. In conclusion, human entropy is markedly different from LLM entropy, but why is this the case? Several factors could be at play. Firstly, the LLM lexicon could be much larger than human participants. Also, the candidates that humans activate in next-word prediction were arguably much fewer than LLMs. Hence, GPT-2 entropy could be much larger than humans. Secondly, GPT-2 models are autoregressive models with perfect memory over the entire context window. This might not be the case for human comprehenders. Thirdly, not every word in the context was necessarily relevant. Human readers could selectively attend to the pertinent words, to activate the right situation model. In contrast, LLM entropy was calculated from every word in the preceding context. It is an open question whether LLM could develop selective attention to filter out irrelevant words in contexts.

Topic Areas: Computational Approaches, Reading

SNL Account Login


Forgot Password?
Create an Account

News

Abstract Submissions extended through June 10

Meeting Registration is Open

Make Your Hotel Reservations Now

2024 Membership is Open

Please see Dates & Deadlines for important upcoming dates.