Presentation

Search Abstracts | Symposia | Slide Sessions | Poster Sessions

Understanding the shared coding of speech and language between deep neural network models and the human brain

Poster Session C, Friday, October 25, 4:30 - 6:00 pm, Great Hall 3 and 4

Yuanning Li1, Peili Chen1, Shiji Xiang1, Edward Chang2; 1ShanghaiTech University, 2University of California San Francisco

Recent computational cognitive neuroscience advances highlight the parallel between deep neural network (DNN) processing of speech/text and human brain activity. However, most studies have examined how single-modality DNN models (speech or text) correspond with activity in particular brain networks. Yet, the critical factors driving these correlations remain unknown, especially whether DNNs of different modalities share these factors. It is also not clear how these driving factors evolve in space and time in the brain language network. To address these questions, we analyzed the representation similarity between self-supervised learning (SSL) models for speech (Wav2Vec2) and language (GPT-2), against neural responses to naturalistic speech captured via high-density electrocorticography from 16 participants. We developed both a time-invariant sentence-level and a time-dependent single-word-level neural encoding models. These models helped delineate the overall correspondence and fine-grained temporal dynamics in brain-DNN interactions respectively. Our results indicate high prediction accuracy of both types of SSL models relative to neural activity before and after word onsets. We observed distinct spatiotemporal dynamics: both models showed high encoding accuracy 40 milliseconds before word onset, especially in the mid-superior temporal gyrus (mid-STG), with Wav2Vec2 also peaking 200 milliseconds after word onset. Applying clustering analysis to the timecourse of word-level encoding score of the SSL models, we found two distinct clusters that mainly correspond to mSTG and pSTG. The mSTG cluster contributed to the -40ms peak, while the pSTG cluster contributed to the 200ms peak. Using canonical correlation analysis, we discovered that shared components between Wav2Vec2 and GPT-2 explain a substantial portion of the SSL-brain similarity. Further decomposition of DNN representations indicated that contextual information encoded in SSL models contributed more to the brain alignment in mid-STG and before word onsets (-40ms), while acoustic-phonetic and static semantic information encoded in SSL models contributed more to the brain alignment in post-STG and after word onsets (200ms). In summary, we demonstrate that both speech and language DNNs share neural correlates driven by context and acoustic-phonetic cues, aligning with distinct neural activity patterns over space and time. Our findings suggest that key aspects of neural coding in response to speech are captured by self-supervised DNNs of different modalities, reflecting a convergence of artificial and biological information processing systems.

Topic Areas: Computational Approaches, Speech Perception

SNL Account Login


Forgot Password?
Create an Account

News

Abstract Submissions extended through June 10

Meeting Registration is Open

Make Your Hotel Reservations Now

2024 Membership is Open

Please see Dates & Deadlines for important upcoming dates.