Multimodal Emotion Regression with Multi-Objective Optimization and VAD-Aware Audio Modeling for the 10th ABAW EMI Track
arXiv:2603.13760v1 Announce Type: new Abstract: We participated in the 10th ABAW Challenge, focusing on the Emotional Mimicry Intensity (EMI) Estimation track on the Hume-Vidmimic2 dataset. This task aims to predict six continuous emotion dimensions: Admiration, Amusement, Determination, Empathic Pain, Excitement, and Joy. Through systematic multimodal exploration of pretrained high-level features, we found that, under our pretrained feature setting, direct feature concatenation outperformed the more complex fusion strategies we tested. This empirical finding motivated us to design a systematic approach built upon three core principles: (i) preserving modality-specific attributes through feature-level concatenation; (ii) improving training stability and metric alignment via multi-objective optimization; and (iii) enriching acoustic representations with a VAD-inspired latent prior. Our final framework integrates concatenation-based multimodal fusion, a shared six-dimensional regression
arXiv:2603.13760v1 Announce Type: new Abstract: We participated in the 10th ABAW Challenge, focusing on the Emotional Mimicry Intensity (EMI) Estimation track on the Hume-Vidmimic2 dataset. This task aims to predict six continuous emotion dimensions: Admiration, Amusement, Determination, Empathic Pain, Excitement, and Joy. Through systematic multimodal exploration of pretrained high-level features, we found that, under our pretrained feature setting, direct feature concatenation outperformed the more complex fusion strategies we tested. This empirical finding motivated us to design a systematic approach built upon three core principles: (i) preserving modality-specific attributes through feature-level concatenation; (ii) improving training stability and metric alignment via multi-objective optimization; and (iii) enriching acoustic representations with a VAD-inspired latent prior. Our final framework integrates concatenation-based multimodal fusion, a shared six-dimensional regression head, multi-objective optimization with MSE, Pearson-correlation, and auxiliary branch supervision, EMA for parameter stabilization, and a VAD-inspired latent prior for the acoustic branch. On the official validation set, the proposed scheme achieved our best mean Pearson Correlation Coefficient of 0.478567.
Executive Summary
This article presents a novel multimodal emotion regression framework for the 10th ABAW EMI Track, leveraging a systematic approach to multimodal fusion, multi-objective optimization, and VAD-aware audio modeling. The proposed framework integrates concatenation-based multimodal fusion, a shared six-dimensional regression head, and a VAD-inspired latent prior for the acoustic branch. The results demonstrate a mean Pearson Correlation Coefficient of 0.478567 on the official validation set. The framework demonstrates improved emotion estimation performance, but further research is needed to explore its generalizability and robustness in real-world applications.
Key Points
- ▸ The proposed framework integrates concatenation-based multimodal fusion and a shared six-dimensional regression head.
- ▸ Multi-objective optimization with MSE, Pearson-correlation, and auxiliary branch supervision is used to improve training stability and metric alignment.
- ▸ A VAD-inspired latent prior is incorporated into the acoustic branch to enrich acoustic representations.
Merits
Strength in Systematic Approach
The framework's systematic approach to multimodal fusion, multi-objective optimization, and VAD-aware audio modeling demonstrates a clear and effective method for addressing the complexities of emotion estimation.
Demerits
Limited Generalizability
The framework's performance is demonstrated on a single dataset (Hume-Vidmimic2), and further research is needed to explore its generalizability and robustness in real-world applications.
Expert Commentary
While the proposed framework demonstrates improved emotion estimation performance, its limited generalizability and robustness in real-world applications are significant concerns. Future research should focus on exploring the framework's performance on diverse datasets and in different scenarios, such as varying environmental conditions and complex social interactions. Additionally, the incorporation of domain knowledge and human insights into the framework's development could enhance its effectiveness and adaptability. Ultimately, the framework's potential to contribute to the development of emotionally intelligent systems makes it a promising area of research.
Recommendations
- ✓ Future research should focus on exploring the framework's generalizability and robustness in real-world applications.
- ✓ The incorporation of domain knowledge and human insights into the framework's development could enhance its effectiveness and adaptability.