Effectiveness of automatic phoneme recognition on a pseudoword production task after oral or oropharyngeal cancer - IALP 2025 Innovation and Inspiration in Communication Sciences and Disorders

Background Anatomical changes after oral or oropharyngeal cancer need to consider speech production at the segmental level to analyze the links between oro-pharyngeal anatomy and phoneme recognition. The use of automatic acoustic-phonetic decoding systems enables this analysis but requires study of their recognition effectiveness in a pathological speech context.

Objective To study the effectiveness of automatic phonemic recognition on a pseudoword production task after oral or oropharyngeal cancer.

Methods Eighty subjects treated for oral / oropharyngeal cancer were recorded producing a phonetically balanced list of 52 pseudowords (mean: 4.6 phonemes). Each pseudoword was transcribed by three automatic systems: a TDNN-HMM and two CTC transformers (with and without fine-tuning). A clinician also performed a real-time transcription and perceptually assessed the severity of the speech impairment (0: major impairment, 10: no impairment). Recognition effectiveness was assessed by calculating a phone error rate (PER: ratio between the sum of recognition errors - deletions, insertions, substitutions - and the number of target phonemes) and a perceived phonological deviation score (PPD: average number of phonological features of deviation per phoneme between recognition and target).

Results 4,107 pseudowords were analyzed (exclusions: recording problems, missing data). Human transcription offers the best recognition quality (lowest PER and PPD, PER=0.24, PPD=0.42). Among the automatic systems, the CTC-fine-tuned Transformer has the significantly lowest PER (PER=0.52), but the TDNN-HMM has the lowest PPD (PPD=1.54). In both humans and automatic systems, errors increase as the disorder becomes more severe. Automatic systems converge towards humans (more favorable ratio), when less severe subjects are specifically analyzed.

Conclusion Automatic acoustic-phonetic decoding systems applied to post-cancer speech are still less effective than human experts. The errors depend on the architecture of these systems, possibly relating to the frequent insertions (Transformer CTC: 0.80 phonemes/pseudoword). Sound events in the signal (mouth noises, etc.) can disrupt the effectiveness of automatic recognition, unlike that of a human. Analysis of the human/automatic convergence raises questions about the effects of human phonemic restoration as soon as speech is degraded. Further work is needed to investigate the link between these errors and the analytical deficits, to enlighten production mechanisms and optimize rehabilitation strategies.