Presentation

Search Abstracts | Symposia | Slide Sessions | Poster Sessions | Poster Slams

Automated measures of syntactic complexity for spontaneous speech

Poster E3 in Poster Session E, Saturday, October 8, 3:15 - 5:00 pm EDT, Millennium Hall
This poster is part of the Sandbox Series.

Galit Agmon1, Sameer Pradhan1, Sharon Ash1, Naomi Nevler1, Mark Liberman1, Murray Grossman1, Sunghye Cho1; 1University of Pennsylvania

Sentences with more complex syntactic structures incur greater burden on cognitive processing. This is reflected in elevated reaction times, increased brain activity, or specific deficits in aphasia. This effect of syntactic complexity has been typically studied in the controlled framework of minimal pairs, where grammatical constructions like passives, object relative clauses, or central embeddings are found to be more taxing than actives, subject relative clauses, or right branching, respectively. In recent years, a growing number of studies have been quantifying syntactic complexity for uncontrolled stimuli such as speech (e.g., Agmon et al. in SNL 2021). However, currently used methods are heterogeneous, both in terms of their quantification metrics and the language models they rely on (specifically, dependency grammar vs. phrase-structure grammar). In addition, none of these methods has been validated with manually annotated grammatical constructions. In this study, we compared two measures of syntactic complexity, using automated tools, and validated their usefulness in capturing different aspects of syntactic complexity. We examined Mean Dependency Distance (MDD) and Mean Node Count (MNC). MDD is derived from dependency grammar by calculating each word's distance from its head and averaging per sentence. MNC is derived from phrase-structure grammar by counting the number of phrases closed by each word and averaging per sentence. We calculated MDD and MNC for different types of corpora: sentences of minimal pairs (e.g., passive/active, object/subject relative clauses, etc.), sentences extracted from written text (Alice in Wonderland), and spontaneous speech of younger and older healthy speakers. We examined the correlational structure of these measures relative to each other and validated them against a manual count of grammatical constructions. In all corpora, MDD and MNC were not correlated with each other, suggesting that they capture independent aspects of syntax (written text: p=0.3; younger speech: p=-0.1; older speech: p=-0.2). Though MDD was correlated with sentence length (r>0.7, p<0.001 in all corpora), it was higher for the more complex sentences in minimal pairs of equal length. MNC was positively associated with the number of embedded clauses in both age groups (younger: p<0.001; older: p=0.01). MDD of older speakers was dramatically lower than that of younger speakers (p<0.001), controlling for sentence length. MNC of older speakers was slightly higher than that of younger speakers (p=0.05), reflecting their more frequent use of embedded constructions. To conclude, automated measures can be useful for quantifying syntactic complexity in uncontrolled corpora, such as spontaneous speech. We showed that MDD and MNC capture complementary aspects of syntactic complexity. MDD is greater for constructions that are known to be complex, such as passives, object relative clauses and central embeddings. MDD may reflect linguistic working memory, which could explain why it is lower for older speakers. MNC, on the other hand, is greater for deeper phrase-structure trees, and may reflect the linguistic process of merging phrases to build up the syntactic representation. In our future work, we will validate those measures with brain signals, as well as longitudinal analyses, and investigate the trajectory of syntactic complexity during neurodegenerative disease.

Topic Areas: Methods, Syntax