Sengil's picture
Update README.md
c115f85 verified
---
library_name: transformers
license: apache-2.0
base_model: answerdotai/ModernBERT-base
tags:
- generated_from_trainer
- text-classification
- news-classification
- english
- modernbert
metrics:
- f1
model-index:
- name: ModernBERT-NewsClassifier-EN-small
results: []
---
# ModernBERT-NewsClassifier-EN-small
This model is a fine-tuned version of [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) on an English **News Category** dataset covering 15 distinct topics (e.g., **Politics**, **Sports**, **Business**, etc.). It achieves the following results on the evaluation set:
- **Validation Loss**: `3.1201`
- **Weighted F1 Score**: `0.5475`
---
## Model Description
**Architecture**: This model is based on [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base), an advanced Transformer architecture featuring Rotary Position Embeddings (RoPE), Flash Attention, and a native long context window (up to 8,192 tokens). For the classification task, a linear classification head is added on top of the BERT encoder outputs.
**Task**: **Multi-class News Classification**
- The model classifies English news headlines or short texts into one of 15 categories.
**Use Cases**:
- Automatically tagging news headlines with appropriate categories in editorial pipelines.
- Classifying short text blurbs for social media or aggregator systems.
- Building a quick filter for content-based recommendation engines.
---
## Intended Uses & Limitations
- **Intended for**: Users who need to categorize short English news texts into broad topics.
- **Language**: Trained primarily on **English** texts. Performance on non-English text is not guaranteed.
- **Limitations**:
- Certain categories (e.g., `BLACK VOICES`, `QUEER VOICES`) may contain nuanced language that could lead to misclassification if context is limited or if the text is ambiguous.
---
## Training and Evaluation Data
- **Dataset**: Curated from an English news-category dataset with 15 labels (e.g., `POLITICS`, `ENTERTAINMENT`, `SPORTS`, `BUSINESS`, etc.).
- **Data Size**: ~30,000 samples in total, balanced at 2,000 samples per category.
- **Split**: 90% training (27,000 samples) and 10% testing (3,000 samples).
### Categories
1. POLITICS
2. WELLNESS
3. ENTERTAINMENT
4. TRAVEL
5. STYLE & BEAUTY
6. PARENTING
7. HEALTHY LIVING
8. QUEER VOICES
9. FOOD & DRINK
10. BUSINESS
11. COMEDY
12. SPORTS
13. BLACK VOICES
14. HOME & LIVING
15. PARENTS
---
## Training Procedure
### Hyperparameters
| Hyperparameter | Value |
|------------------------------:|:-----------------------|
| **learning_rate** | 5e-05 |
| **train_batch_size** | 8 |
| **eval_batch_size** | 4 |
| **seed** | 42 |
| **gradient_accumulation_steps** | 2 |
| **total_train_batch_size** | 16 (8 x 2) |
| **optimizer** | `adamw_torch_fused` (betas=(0.9,0.999), epsilon=1e-08) |
| **lr_scheduler_type** | linear |
| **lr_scheduler_warmup_steps**| 100 |
| **num_epochs** | 5 |
**Optimizer**: Used `AdamW` with fused kernels (`adamw_torch_fused`) for efficiency.
**Loss Function**: Cross-entropy (with weighted F1 as metric).
---
## Training Results
| Training Loss | Epoch | Step | Validation Loss | F1 (Weighted) |
|:-------------:|:------:|:----:|:---------------:|:-------------:|
| 2.6251 | 1.0 | 1688 | 1.3810 | 0.5543 |
| 1.9267 | 2.0 | 3376 | 1.4378 | 0.5588 |
| 0.6349 | 3.0 | 5064 | 2.1705 | 0.5415 |
| 0.1273 | 4.0 | 6752 | 2.9007 | 0.5402 |
| 0.0288 | 4.9973 | 8435 | 3.1201 | 0.5475 |
- **Best Weighted F1** observed near the final epochs is **~0.55** on the validation set.
---
## Inference Example
Below are two ways to use this model: via a **pipeline** and by using the **model & tokenizer** directly.
### 1) Quick Start with `pipeline`
```python
from transformers import pipeline
# Instantiate the pipeline
classifier = pipeline(
"text-classification",
model="Sengil/ModernBERT-NewsClassifier-EN-small"
)
# Sample text
text = "The President pledges new infrastructure initiatives amid economic concerns."
outputs = classifier(text)
# Output: [{'label': 'POLITICS', 'score': 0.95}, ...]
print(outputs)
```
### 2) Direct Model Usage
```python
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_name = "Sengil/ModernBERT-NewsClassifier-EN-small"
# Load model & tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
sample_text = "Local authorities call for better healthcare policies."
inputs = tokenizer(sample_text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
logits = model(**inputs).logits
# Convert logits to probabilities
probs = F.softmax(logits, dim=1)[0]
predicted_label_id = torch.argmax(probs).item()
# Get the label string
id2label = model.config.id2label
predicted_label = id2label[predicted_label_id]
confidence_score = probs[predicted_label_id].item()
print(f"Predicted Label: {predicted_label} | Score: {confidence_score:.4f}")
```
---
## Additional Information
- **Framework Versions**:
- **Transformers**: 4.49.0.dev0
- **PyTorch**: 2.5.1+cu121
- **Datasets**: 3.2.0
- **Tokenizers**: 0.21.0
- **License**: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
- **Intellectual Property**: The original ModernBERT base model is provided by [answerdotai](https://huggingface.co/answerdotai). This fine-tuned checkpoint inherits the same license.
---
**Citation** (If you use or extend this model in your research or applications, please consider citing it):
```
@misc{ModernBERTNewsClassifierENsmall,
title={ModernBERT-NewsClassifier-EN-small},
author={Mert Sengil},
year={2025},
howpublished={\url{https://huggingface.co/Sengil/ModernBERT-NewsClassifier-EN-small}},
}
```