Presentation

Search Abstracts | Symposia | Slide Sessions | Poster Sessions | Lightning Talks

Complex lexico-semantic networks: cross-linguistic comparison of embedding cosine similarity and human free word associations

There is a Poster PDF for this presentation, but you must be a current member or registered to attend SNL 2023 to view it. Please go to your Account Home page to register.

Poster A91 in Poster Session A, Tuesday, October 24, 10:15 am - 12:00 pm CEST, Espace Vieux-Port

Abigail E. Licata1, Sarah Saneei1, Simon De Deyne2, Valentina Borghesani1; 1University of Geneva, 2University of Melbourne

Numerous achievements in natural language processing over the past decade have led to the development of novel analytical approaches for cognitive neuroscience. For instance, word embeddings can effectively represent word relations and perform a variety of tasks including semantic analysis (Khurana et al., 2023). An attractive feature of embeddings is that they can consider an enormous amount of data that would be infeasible for human data collection efforts. However, while language models are continuously improving, the correspondence between their underlying representations and that of humans remains uncertain (Stevenson & Merlo, 2022). While word vectors have been proven able to capture some semantic and syntactic relations between words, they do not reliably capture the complexity of human semantic spaces which are influenced by language, culture and experience. Given the paucity of cross-lingual research in this area, we set out to compare word associations between humans and embeddings across three languages. Free word association datasets in three languages (English, Dutch, French) were obtained from the Small World of Words (SWoW) project (www.smallwordofwords.org). Word networks for a set of 231 cue words were constructed based on forward association frequency. In each monolingual embedding trained on a Wikipedia corpus, we ran a K Nearest Neighbors (KNN) analysis to identify the top 5 most similar words for each cue and then calculated the mean average precision @ K (MAP@K) and normalized cumulative gain distribution (MNCGD) using the SWoW dataset as ground truth. All responses were preprocessed including lemmatization and removal of NLTK stopwords and proper nouns. Polysemous words were checked within each language and confirmed by a native speaker; polysemous cues were removed for each language if a mismatch occurred between the SWoW and MUSE output. All embeddings performed relatively poorly on the KNN search, as there was low correspondence between the retrieved KNN words from the embeddings and the most frequently produced responses for each cue word in the SWoW dataset. The best performance (MAP@K, MNDCG) from the three language embeddings was English (0.10, 0.17), followed by French (0.08, 0.15) and Dutch (0.05; 0.09). We sought to compare KNN performance from word embeddings and human free word association to understand how well language models can approximate the lexico-semantic space within and across languages. We found that the embedding performance only partially reflected free word associations from humans and that this was consistent across languages. English embedding performance had the highest correspondence, probably due to the greater number of English Wikipedia pages compared to French and Dutch. Future work will utilize random walks across knowledge graphs to understand the capacity of these models to capture semantic similarity and relatedness across languages.

Topic Areas: Computational Approaches, Meaning: Lexical Semantics

SNL Account Login

Forgot Password?
Create an Account

News