EmotionCLIP Model

Project Overview

EmotionCLIP is an open-domain multimodal emotion perception model built on CLIP. This model aims to perform broad emotion recognition through multimodal inputs such as faces, scenes, and photos, supporting the analysis of emotional attributes in images, scene layouts, and even artworks.

Datasets

The model is trained using the following datasets:

EmoSet:

Citation:

@inproceedings{yang2023emoset,
  title={EmoSet: A Large-Scale Visual Emotion Dataset with Rich Attributes},
  author={Yang, Jingyuan and Huang, Qirui and Ding, Tingting and Lischinski, Dani and Cohen-Or, Danny and Huang, Hui},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={20383--20394},
  year={2023}
}

This dataset contains rich emotional labels and visual features, providing a foundation for emotion perception.In this model, We use the dataset Emoset118K.

Open Human Facial Emotion Recognition Dataset:
- Contains nearly 10,000 images with emotion labels gathered from wild scenes to enhance the model's capability in facial emotion recognition.

training method

Prefix-Tuning

Fine-tuning Weights

This repository provides two fine-tuned weights:

EmotionCLIP Weights
- Fine-tuned on the EmoSet 118K dataset, without additional training specifically for facial emotion recognition.
- Final evaluation results:
  - Loss: 1.5687
  - Accuracy: 0.8037
  - Recall: 0.8037
  - F1: 0.8033
MixCLIP Weights
- Integrates the 10,000 face images and enhances the data for the neutral category, which is not included in EmoSet.
- Due to the small number of samples in this category, the model's recognition ability remains inadequate.
- Final evaluation results:
  - Loss: 1.5680
  - Accuracy: 0.8042
  - Recall: 0.8042
  - F1: 0.8057

Usage Instructions

git clone https://huggingface.co/jiangchengchengNLP/EmotionCLIP

cd EmotionCLIP
# Create your own test file to store images ending in JPG, or organize images from the repository for testing
# By default, MixCLIP weights are used. Run the following python command in the current folder.

from EmotionCLIP import model, preprocess, tokenizer
from PIL import Image
import torch
import matplotlib.pyplot as plt
import os
from torch.nn import functional as F

# Image folder path
image_folder = r'./test'
image_files = [os.path.join(image_folder, f) for f in os.listdir(image_folder) if f.endswith('.jpg')]

# Emotion label mapping
consist_json = {
    'amusement': 0,
    'anger': 1,
    'awe': 2,
    'contentment': 3,
    'disgust': 4,
    'excitement': 5,
    'fear': 6,
    'sadness': 7,
    #'neutral': 8
}
reversal_json = {v: k for k, v in consist_json.items()}
text_list = [f"This picture conveys a sense of {key}" for key in consist_json.keys()]
text_input = tokenizer(text_list)

# Create subplots
num_images = len(image_files)
rows = 3  # 3 rows
cols = 3  # 3 columns
fig, axes = plt.subplots(rows, cols, figsize=(15, 10))  # Adjust the canvas size
axes = axes.flatten()  # Flatten the subplots to a 1D array
title_fontsize = 20

# Iterate through each image
for idx, img_path in enumerate(image_files):
    # Load image
    img = Image.open(img_path)
    img_input = preprocess(img)

    # Predict emotion
    with torch.no_grad():
        logits_per_image, _ = model(img_input.unsqueeze(0).to(device=model.device, dtype=model.dtype), text_input.to(device=model.device))
    softmax_logits_per_image = F.softmax(logits_per_image, dim=-1)
    top_k_values, top_k_indexes = torch.topk(softmax_logits_per_image, k=1, dim=-1)
    predicted_emotion = reversal_json[top_k_indexes.item()]

    # Display image and prediction result
    ax = axes[idx]
    ax.imshow(img)
    ax.set_title(f"Predicted: {predicted_emotion}", fontsize=title_fontsize)
    ax.axis('off')

# Hide any extra subplots
for idx in range(num_images, rows * cols):
    axes[idx].axis('off')

plt.tight_layout()
plt.show()

Result Display

The best evaluation results of the model are shown below:

Metric	EmotionCLIP	MixCLIP
Loss	1.5687	1.5680
Accuracy	0.8037	0.8042
Recall	0.8037	0.8042
F1	0.8033	0.8057

Existing Issues

When recognizing fine-grained human emotions and broad emotional attributes, the model faces significant challenges. It must simultaneously capture human body language and subtle facial changes while maintaining an overall perception of scenes and photo subjects, which can lead to competitive cognition.

Specifically, for the “disgust” category, the model often misclassifies it as sadness or anger, partly because human expressions of disgust tend to be unclear.

Moreover, the dataset’s "disgust" category contains mainly non-human images, causing the model to favor global recognition, which hinders its ability to capture the subtle differences in disgust.

In this experiment, we extended the emotion recognition task to an emotion perception task, requiring the model to not only perceive human emotional changes but also possess the ability to generate emotions from the physical world. Although this goal is exciting, we found that the model's emotion going remains driven by illusions, making it difficult to achieve stable, common-sense-based understanding.

Summary

We explored the broad field of emotion perception using CLIP on the EmoSet and partial facial datasets, providing two fine-tuned weights (EmosetCLIP and MixCLIP). However, there are still many challenges in expanding from facial emotion recognition to broad-field emotion perception, including the conflict between fine-grained emotion capture and global emotion perception, as well as issues related to data imbalance.

jiangchengchengNLP
/

EmotionCLIP