Update README.md

d03fa54 verified 3 months ago

8.2 kB

	---
	license: mit
	language:
	- en
	base_model:
	- google-bert/bert-base-uncased
	- microsoft/resnet-34
	tags:
	- Social Bias
	- Fairness
	- Fake News
	metrics:
	- f1(0.698616087436676)
	- precision(0.6369158625602722)
	- recall(0.7735527753829956)
	- accuracy(0.6247606873512268)
	datasets:
	- vector-institute/newsmediabias-plus
	library_name: transformers
	co2_eq_emissions:
	emissions: 8
	source: Code Carbon
	training_type: fine-tuning
	geographical_location: Albany, New York
	hardware_used: T4
	---

	# Multimodal Bias Classifier

	This model is a multimodal classifier that combines text and image inputs to detect potential bias in content. It uses a BERT-based text encoder and a ResNet-34 image encoder, which are fused for classification purposes. A contrastive learning approach was used during training, leveraging CLIP embeddings as guidance to align the text and image representations.

	## Resources:

	- This model is based on [FND-CLIP](https://arxiv.org/pdf/2205.14304), proposed by Zhou et al. 2022.
	- It was trained on the [News Media Bias Plus dataset](https://huggingface.co/datasets/vector-institute/newsmediabias-plus), here are the offical [dataset docs](https://vectorinstitute.github.io/Newsmediabias-plus/).
	- [Model training ipynb](https://github.com/VectorInstitute/news-media-bias-plus/blob/main/benchmarking/multi-modal-classifiers/baselines-and-notebooks/training-notebooks/slm/fnd-clip-bias-training.ipynb)

	## Model Details

	- Text Encoder: BERT (`bert-base-uncased`)
	- Image Encoder: ResNet-34 (`microsoft/resnet-34`)
	- Projection Dimensionality: 768
	- Fusion Method: Concatenation (default), Alignment, or Cosine Similarity
	- Loss Functions: Binary Cross-Entropy for classification, Cosine Embedding Loss for contrastive learning
	- Purpose: Detecting bias in multimodal content (text + image)

	## Training

	The model was trained using a multimodal dataset with labeled instances of biased and unbiased content. The training process incorporated both classification and contrastive loss to help align the text and image representations in a shared latent space.

	### Training Losses
	- Classification Loss: Binary Cross-Entropy (BCEWithLogitsLoss) to classify content as biased or unbiased.
	- Contrastive Loss: CosineEmbeddingLoss, which uses CLIP text and image embeddings as ground truth guidance to align text and image features.

	### Excluding CLIP
	While the CLIP model was used during training to guide the alignment of the image and text embeddings, the final model does not retain CLIP weights, as it is designed to function independently once training is complete.

	## How to Load the Model

	You can load this model for bias classification by following the code below. The model accepts text input and an image input, processing them through BERT and ResNet-34 encoders, respectively. The final prediction indicates whether the content is likely biased or unbiased.

	```python
	import torch
	from torch import nn
	from transformers import AutoModel
	from huggingface_hub import hf_hub_download
	from typing import Literal
	import json

	class MultimodalClassifier(nn.Module):
	def __init__(
	self,
	text_encoder_id_or_path: str,
	image_encoder_id_or_path: str,
	projection_dim: int,
	fusion_method: Literal["concat", "align", "cosine_similarity"] = "concat",
	proj_dropout: float = 0.1,
	fusion_dropout: float = 0.1,
	num_classes: int = 1,
	) -> None:
	super().__init__()

	self.fusion_method = fusion_method
	self.projection_dim = projection_dim
	self.num_classes = num_classes

	##### Text Encoder
	self.text_encoder = AutoModel.from_pretrained(text_encoder_id_or_path)
	self.text_projection = nn.Sequential(
	nn.Linear(self.text_encoder.config.hidden_size, self.projection_dim),
	nn.Dropout(proj_dropout),
	)

	##### Image Encoder (using ResNet34 from AutoModel with timm)
	self.image_encoder = AutoModel.from_pretrained(image_encoder_id_or_path, trust_remote_code=True)
	self.image_encoder.classifier = nn.Identity() # rm the classification head
	self.image_projection = nn.Sequential(
	nn.Linear(512, self.projection_dim),
	nn.Dropout(proj_dropout),
	)

	##### Fusion Layer
	fusion_input_dim = self.projection_dim * 2 if fusion_method == "concat" else self.projection_dim
	self.fusion_layer = nn.Sequential(
	nn.Dropout(fusion_dropout),
	nn.Linear(fusion_input_dim, self.projection_dim),
	nn.GELU(),
	nn.Dropout(fusion_dropout),
	)

	##### Classification Layer
	self.classifier = nn.Linear(self.projection_dim, self.num_classes)

	def forward(self, pixel_values: torch.Tensor, input_ids: torch.Tensor, attention_mask: torch.Tensor) -> torch.Tensor:
	##### Text Encoder Projection #####
	full_text_features = self.text_encoder(input_ids=input_ids, attention_mask=attention_mask, return_dict=True).last_hidden_state
	full_text_features = full_text_features[:, 0, :] # using cls token
	full_text_features = self.text_projection(full_text_features)

	##### Image Encoder Projection #####
	resnet_image_features = self.image_encoder(pixel_values=pixel_values).last_hidden_state

	# global average pooling for resent image features (bad idea? dim problems)
	resnet_image_features = resnet_image_features.mean(dim=[-2, -1])
	resnet_image_features = self.image_projection(resnet_image_features)

	##### Fusion and Classification #####
	if self.fusion_method == "concat":
	fused_features = torch.cat([full_text_features, resnet_image_features], dim=-1)
	else:
	fused_features = full_text_features * resnet_image_features # don't think this works atm (should be dot prod)

	# fusion and classifier layers
	fused_features = self.fusion_layer(fused_features)
	classification_output = self.classifier(fused_features)

	return classification_output

	def load_model():
	config_path = hf_hub_download(repo_id="maximuspowers/multimodal-bias-classifier", filename="config.json")
	with open(config_path, "r") as f:
	config = json.load(f)

	model = MultimodalClassifier(
	text_encoder_id_or_path=config["text_encoder_id_or_path"],
	image_encoder_id_or_path="microsoft/resnet-34",
	projection_dim=config["projection_dim"],
	fusion_method=config["fusion_method"],
	proj_dropout=config["proj_dropout"],
	fusion_dropout=config["fusion_dropout"],
	num_classes=config["num_classes"]
	)

	model_weights_path = hf_hub_download(repo_id="maximuspowers/multimodal-bias-classifier", filename="model_weights.pth")
	checkpoint = torch.load(model_weights_path, map_location=torch.device('cpu'))
	model.load_state_dict(checkpoint, strict=False)

	return model
	```

	## How to Run the Model

	```python
	import torch
	from transformers import AutoTokenizer
	from PIL import Image
	import requests
	from torchvision import transforms

	model = load_model()
	model.eval()

	# text input
	text_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
	sample_text = "This is a sample sentence for bias classification."
	text_inputs = text_tokenizer(
	sample_text,
	return_tensors="pt",
	padding="max_length",
	truncation=True,
	max_length=512
	)

	# image input
	image = Image.open("./random_image.jpg").convert("RGB")
	image_transform = transforms.Compose([
	transforms.Resize((224, 224)),
	transforms.ToTensor(),
	transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
	])
	image_input = image_transform(image).unsqueeze(0) # add batch dim

	# run
	with torch.no_grad():
	classification_output = model(
	pixel_values=image_input,
	input_ids=text_inputs["input_ids"],
	attention_mask=text_inputs["attention_mask"]
	)
	predicted_class = torch.sigmoid(classification_output).round().item()
	print("Predicted class:", "Biased" if predicted_class == 1 else "Unbiased")
	```