Papers
arxiv:2409.18971

Early Joint Learning of Emotion Information Makes MultiModal Model Understand You Better

Published on Sep 12, 2024
Authors:
,
,
,
,
,
,
,
,

Abstract

In this paper, we present our solutions for emotion recognition in the sub-challenges of Multimodal Emotion Recognition Challenge (MER2024). To mitigate the modal competition issue between audio and text, we adopt an early fusion strategy based on a large language model, where joint training of audio and text is conducted initially. And the joint Audio-Text modal feature will be late-fused with other unimodal features. In order to solve the problems of data insufficiency and class imbalance, We use multiple turns of multi-model voting for data mining. Moreover, to enhance the quality of audio features, we employ speech source separation to preprocess audios. Our model ranks 2nd in both MER2024-SEMI and MER2024-NOISE, validating our method's effectiveness.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2409.18971 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2409.18971 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2409.18971 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.