Speech Emotion Recognition from Bone-Conducted Speech Using Wav2Vec2 Transformer Model

Published in 7th International Conference on Sustainable Technologies for Industry 5.0, Narayanganj, Bangladesh, 2025

Speech emotion recognition (SER) is an essential technology for enhancing human-computer interactions (HCI). While most SER research uses air-conducted (AC) speech, bone-conducted (BC) speech offers a resilient alternative, especially in noisy environments. This paper introduces an end-to-end SER system based on the Wav2Vec2.0 transformer model, fine-tuned with the EmoBone dataset—a comprehensive, multinational BC speech dataset featuring eight emotion categories collected from 29 speakers in 10 countries. Our method utilizes self-supervised learning to bypass manual feature extraction, learning detailed contextual features directly from raw audio waveforms. The system combines a custom classification head with the pre-trained Wav2Vec2.0 encoder for efficient emotion prediction. Evaluation results show that our approach attains an overall accuracy of 93% and a weighted average F1-score of 93% on the EmoBone dataset, markedly surpassing earlier best-performing techniques. The model demonstrates strong effectiveness in separating emotions with unique acoustic features but encounters challenges differentiating acoustically similar emotions like neutral-sad and fear-disgust pairs. These results underscore the promising capabilities of transformer-based architectures for BC speech emotion recognition and set a new standard for future studies. Additional ablation studies evaluate channel effects (AC-only, BC-only, and AC+BC fusion) and pretraining configurations (from-scratch, linear probe, full fine-tune). This study is among the first to demonstrate self-supervised transformer efficacy for bone-conducted emotion recognition, setting a benchmark for future SER research.

Recommended citation: M. K. Saha, M. S. Hosain, M. R. Hossen, S. K. Ray, L. C. Paul, and M. S. Uddin, "Speech Emotion Recognition from Bone-Conducted Speech Using Wav2Vec2 Transformer Model," 2025 IEEE 7th International Conference on Sustainable Technologies for Industry 5.0, Narayanganj, Bangladesh, 2025