Presentation

Search Abstracts | Symposia | Slide Sessions | Poster Sessions

A computational investigation of the transformation from talker-specific detail to talker-invariant lexical representations

Poster Session D, Saturday, October 26, 10:30 am - 12:00 pm, Great Hall 3 and 4

Sahil Luthra1, Kevin Brown2, Jay G. Rueckl3, James S. Magnuson3,4,5; 1Carnegie Mellon University, 2Oregon State University, 3University of Connecticut, 4BCBL: Basque Center on Cognition, Brain and Language, 5Ikerbasque: Basque Foundation for Science

A long-standing theoretical question in the field of speech perception is the extent to which representations of speech sounds include talker-specific detail or might be conditioned on talker (e.g., Goldinger, 1998; Kleinschmidt, 2019; Magnuson & Nusbaum, 2007). Neurobiological data have provided some insight into this question, with evidence that phonetic processing and talker processing are supported by separate but overlapping neural systems (for review, see Luthra, 2021). The goal of the current work is to test whether computational constraints may create pressure for a neurobiological system to adopt such a dual-stream architecture (Avcu et al., 2023). Although most computational models of spoken word recognition operate on abstract phonetic features (and thereby sidestep the issue of talker-related acoustic-phonetic variability), a notable exception is the EARSHOT model (Magnuson et al., 2020). Because it operates on spectrogram-based inputs, EARSHOT offers a tool for investigating how talker-to-talker acoustic variability might be processed. Here, we use Representational Similarity Analysis (RSAs; Kriegeskorte et al., 2008) to assay how talker-specific and talker-invariant information are represented over time in the model-internal hidden states of three EARSHOT model variants. First, we trained a standard “lexical-objective” model to map acoustic inputs to lexical-semantic outputs. Second, we trained a talker-objective variant, the only goal of which was to identify talkers. Third, we trained a dual-objective variant to map speech to lexical and talker outputs simultaneously. RSA indicated that talker-specific details quickly dissipate from hidden unit activity for the standard model, but talker-specific details more strongly drive hidden unit activity for the talker-objective model. Our results provide insight into how talker-specific surface details are mapped to abstract, talker-invariant lexical representations depending on training targets. They may generate novel hypotheses about similar transformations in the brain, a possibility we will test by comparing models to cortical responses to speech (cf. Brodbeck et al., 2024; Mesgarani et al., 2020).

Topic Areas: Speech Perception, Computational Approaches

SNL Account Login


Forgot Password?
Create an Account

News

Abstract Submissions extended through June 10

Meeting Registration is Open

Make Your Hotel Reservations Now

2024 Membership is Open

Please see Dates & Deadlines for important upcoming dates.