COMPUTATIONAL MODELS OF SPEECH PERCEPTION AND CATEGORIZATION


GMM learning speech sound categories over development.

Languages vary in how acoustic cues are mapped onto speech sound categories. For example, Japanese has a single category on the R/L dimension, while English has two. English has only two categories for voicing (e.g., B and P), while Thai has four. How do listeners learn to map acoustic differences in speech onto phonological categories, and what information in the speech signal do they use to do this?

Our lab uses several computational approaches to study these questions. First, consider the problem that infants face when learning how many phonological categories there are in their native language for a given acoustic dimension. One way they can do this is by tracking the distribtuional statistics of cues in the sound signal. We have modeled this process using a type of computational model called a Gaussian mixture model (GMM).

The movie on this page shows one of these models learning the voicing categories of English. Initially, the model starts out many potential categories (since, like infants, it does not know how many categories the language will have). Over time, the model eliminates unnecessary categories to arrive at the correct two-category solution, adjusting the parameters of the mixture components (black lines) to capture the distributional statistics of the input (red line). This simualtion demonstrates that statistical learning and competition are sufficiently powerful mechanisms for acquiring speech categories from distributional statistics, and they allow us to examine the process of speech development over time.

We can also use this approach to study how listeners learn to weight and combine multiple cues in speech. We have developed a GMM that weights cues based on their relaiabilty, and we have shown that this model can explain both the integration of multiple acoustic cues and changes in audio-visual cue weights over development. For instance, infants tend to weight auditory cues more heavily than adults. The model shows the same developmental timecourse, initially weighting auditory and visual cues similarly to infants and eventually weighting them more like adults once it learns which cues are most reliable.

GMM categorization of speech sounds along a /b/-/d/ continuum, showing a larger influence of visual cues over time (Getz et al., 2017).

A related set of questions concerns how we identify which acosutic cues are important for speech perception given the large set of possible cues in the signal and the large amount of variability between talkers in how they use those cues to indicate phonological distinctions. To address this, we used techniques from graph theory to identify networks (Steiner trees) connecting individual acoustic cues and individual talkers in a way that captured the subsets of cues needed to accurately classify the sounds. This allows us to find a balance between models that are too complex (e.g., including all possible cues) and models that are too simple (i.e., not including enough cues to account for differences between talkers).

Together, these models help us better understand both the information necessary for recognizing speech and the processes by which listeners acquire and adjust their speech sound cateogires.

The most informative cues for recognizing fricatives identified by Steiner tree networks in Crinnion et al. (2020).

More information: