Presentation

Search Abstracts | Symposia | Slide Sessions | Poster Sessions | Lightning Talks

Context and Attention Shape Electrophysiological Correlates of Speech-to-Language Transformation

Poster A76 in Poster Session A, Tuesday, October 24, 10:15 am - 12:00 pm CEST, Espace Vieux-Port

Andrew Anderson1,2, Chris Davis3, Ed Lalor1; 1Medical College of Wisconsin, 2University of Rochester, 3Western Sydney University

To understand speech, the human brain must accommodate variability in intonation, speech rate, volume, accents and so on to transform sounds into words. A promising approach to explaining this process has been to model and predict electroencephalogram (EEG) recordings of brain responses to speech. Contemporary models typically invoke hand-crafted speech categories (e.g. phonemes) as an intermediary representational stage between sounds and words. However, such brain models are incomplete because sounds are externally categorized. This also means they cannot predict the neural computations that putatively underpin categorization. By providing end-to-end accounts of speech-to-language transformation, new deep-learning systems could enable more complete brain models. We here reference EEG recordings of audiobook comprehension to one such model – Whisper – which transforms speech into a time-aligned linguistic representation across a series of feedforward layers. Notably this transformation is achieved through encoding prior speech context which may supply linguistic cues that help disambiguate periods of noisy speech (e.g. “President Joe <noise>”). Such contextual information is not present in purely categorical speech models. We first reanalyzed a dataset of publicly available EEG recordings taken from 19 subjects as they listened to ~1hour of an audiobook. We hypothesized that the contextualized internal representations of Whisper would more accurately predict EEG responses than models of concurrent speech acoustics. This was because Whisper, like the human brain, is adapted to transform speech to language. To test this hypothesis, we ran a series of cross-validated multiple regression analyses that mapped different layers of a Whisper model of the speech stimulus to predict EEG data. We observed that Whisper’s deep most linguistic layers dominated prediction in bilateral temporal scalp electrodes that are traditionally linked to acoustic speech responses. These deep layers proved to be more accurate predictors than models of speech acoustics or linguistic prediction (derived from GPT-2). Furthermore, constraining Whisper’s access to prior speech context impacted EEG prediction accuracy – though contrary to our expectation the relationship was concave in that access to ~10s context was more beneficial than either shorter or longer contexts (e.g. 1s or 30s). To consolidate evidence that the new EEG correlates reflected linguistic transformation of speech, we examined a second publicly available “cocktail party” dataset (27 subjects). Here listeners heard two concurrent speech streams, but only paid attention to one. We hypothesized and found that Whisper’s predictive advantage over the acoustic models, would selectively dwindle when the unattended speech stream was modelled and predicted – in line with listeners inability to accurately report on the unattended speech content. The current study helps advance understanding of the neurophysiological processes underpinning speech comprehension by identifying a self-contained EEG model of speech-to-language transformation that is relatively accurate, sensitive to listener attention and potentially revealing of how the brain could exploit prior speech context in comprehension. We hope that the approach may lead towards user-friendly methods to help index the linguistic depth of speech processing in developmental and disordered populations.

Topic Areas: Speech Perception, Computational Approaches

SNL Account Login

Forgot Password?
Create an Account

News