Update README.md

c115f85 verified 13 days ago

6.29 kB

	---
	library_name: transformers
	license: apache-2.0
	base_model: answerdotai/ModernBERT-base
	tags:
	- generated_from_trainer
	- text-classification
	- news-classification
	- english
	- modernbert
	metrics:
	- f1
	model-index:
	- name: ModernBERT-NewsClassifier-EN-small
	results: []
	---

	# ModernBERT-NewsClassifier-EN-small


	This model is a fine-tuned version of [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) on an English News Category dataset covering 15 distinct topics (e.g., Politics, Sports, Business, etc.). It achieves the following results on the evaluation set:

	- Validation Loss: `3.1201`
	- Weighted F1 Score: `0.5475`
	---
	## Model Description

	Architecture: This model is based on [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base), an advanced Transformer architecture featuring Rotary Position Embeddings (RoPE), Flash Attention, and a native long context window (up to 8,192 tokens). For the classification task, a linear classification head is added on top of the BERT encoder outputs.

	Task: Multi-class News Classification
	- The model classifies English news headlines or short texts into one of 15 categories.

	Use Cases:
	- Automatically tagging news headlines with appropriate categories in editorial pipelines.
	- Classifying short text blurbs for social media or aggregator systems.
	- Building a quick filter for content-based recommendation engines.
	---
	## Intended Uses & Limitations

	- Intended for: Users who need to categorize short English news texts into broad topics.
	- Language: Trained primarily on English texts. Performance on non-English text is not guaranteed.
	- Limitations:
	- Certain categories (e.g., `BLACK VOICES`, `QUEER VOICES`) may contain nuanced language that could lead to misclassification if context is limited or if the text is ambiguous.
	---

	## Training and Evaluation Data

	- Dataset: Curated from an English news-category dataset with 15 labels (e.g., `POLITICS`, `ENTERTAINMENT`, `SPORTS`, `BUSINESS`, etc.).
	- Data Size: ~30,000 samples in total, balanced at 2,000 samples per category.
	- Split: 90% training (27,000 samples) and 10% testing (3,000 samples).

	### Categories

	1. POLITICS
	2. WELLNESS
	3. ENTERTAINMENT
	4. TRAVEL
	5. STYLE & BEAUTY
	6. PARENTING
	7. HEALTHY LIVING
	8. QUEER VOICES
	9. FOOD & DRINK
	10. BUSINESS
	11. COMEDY
	12. SPORTS
	13. BLACK VOICES
	14. HOME & LIVING
	15. PARENTS

	---

	## Training Procedure

	### Hyperparameters

	\| Hyperparameter \| Value \|
	\|------------------------------:\|:-----------------------\|
	\| learning_rate \| 5e-05 \|
	\| train_batch_size \| 8 \|
	\| eval_batch_size \| 4 \|
	\| seed \| 42 \|
	\| gradient_accumulation_steps \| 2 \|
	\| total_train_batch_size \| 16 (8 x 2) \|
	\| optimizer \| `adamw_torch_fused` (betas=(0.9,0.999), epsilon=1e-08) \|
	\| lr_scheduler_type \| linear \|
	\| lr_scheduler_warmup_steps\| 100 \|
	\| num_epochs \| 5 \|

	Optimizer: Used `AdamW` with fused kernels (`adamw_torch_fused`) for efficiency.
	Loss Function: Cross-entropy (with weighted F1 as metric).

	---

	## Training Results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| F1 (Weighted) \|
	\|:-------------:\|:------:\|:----:\|:---------------:\|:-------------:\|
	\| 2.6251 \| 1.0 \| 1688 \| 1.3810 \| 0.5543 \|
	\| 1.9267 \| 2.0 \| 3376 \| 1.4378 \| 0.5588 \|
	\| 0.6349 \| 3.0 \| 5064 \| 2.1705 \| 0.5415 \|
	\| 0.1273 \| 4.0 \| 6752 \| 2.9007 \| 0.5402 \|
	\| 0.0288 \| 4.9973 \| 8435 \| 3.1201 \| 0.5475 \|

	- Best Weighted F1 observed near the final epochs is ~0.55 on the validation set.

	---

	## Inference Example

	Below are two ways to use this model: via a pipeline and by using the model & tokenizer directly.

	### 1) Quick Start with `pipeline`

	```python
	from transformers import pipeline

	# Instantiate the pipeline
	classifier = pipeline(
	"text-classification",
	model="Sengil/ModernBERT-NewsClassifier-EN-small"
	)

	# Sample text
	text = "The President pledges new infrastructure initiatives amid economic concerns."
	outputs = classifier(text)

	# Output: [{'label': 'POLITICS', 'score': 0.95}, ...]
	print(outputs)
	```

	### 2) Direct Model Usage

	```python
	import torch
	import torch.nn.functional as F
	from transformers import AutoTokenizer, AutoModelForSequenceClassification

	model_name = "Sengil/ModernBERT-NewsClassifier-EN-small"

	# Load model & tokenizer
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)

	sample_text = "Local authorities call for better healthcare policies."
	inputs = tokenizer(sample_text, return_tensors="pt", truncation=True, max_length=512)

	with torch.no_grad():
	logits = model(**inputs).logits

	# Convert logits to probabilities
	probs = F.softmax(logits, dim=1)[0]
	predicted_label_id = torch.argmax(probs).item()

	# Get the label string
	id2label = model.config.id2label
	predicted_label = id2label[predicted_label_id]
	confidence_score = probs[predicted_label_id].item()

	print(f"Predicted Label: {predicted_label} \| Score: {confidence_score:.4f}")
	```

	---

	## Additional Information

	- Framework Versions:
	- Transformers: 4.49.0.dev0
	- PyTorch: 2.5.1+cu121
	- Datasets: 3.2.0
	- Tokenizers: 0.21.0

	- License: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
	- Intellectual Property: The original ModernBERT base model is provided by [answerdotai](https://huggingface.co/answerdotai). This fine-tuned checkpoint inherits the same license.

	---

	Citation (If you use or extend this model in your research or applications, please consider citing it):
	```
	@misc{ModernBERTNewsClassifierENsmall,
	title={ModernBERT-NewsClassifier-EN-small},
	author={Mert Sengil},
	year={2025},
	howpublished={\url{https://huggingface.co/Sengil/ModernBERT-NewsClassifier-EN-small}},
	}
	```