IndoBERTweet-HateSpeech
Model Description
IndoBERTweet fine-tuned on IndoToxic2024 dataset, with an accuracy of 0.89 and macro-F1 of 0.78. Performances are obtained through stratified 10-fold cross-validation.
Supported Tokenizer
- indolem/indobertweet-base-uncased
Example Code
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
# Specify the model and tokenizer name
model_name = "Exqrch/IndoBERTweet-HateSpeech"
tokenizer_name = "indolem/indobertweet-base-uncased"
# Load the pre-trained model
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
text = "selamat pagi semua!"
output = model(**tokenizer(text, return_tensors="pt"))
logits = output.logits
# Get the predicted class label
predicted_class = torch.argmax(logits, dim=-1).item()
print(predicted_class)
--- Output ---
> 0
--- End of Output ---
Limitations
Trained only on Indonesian texts. No information on code-switched text performance.
Sample Output
Model name: Exqrch/IndoBERTweet-HateSpeech
Text 1: Kenapa sih mereka berantem terus?
Prediction: 0
Text 2: Orang gila emang elu!
Prediction: 1
Citation
If used, please cite:
@article{susanto2024indotoxic2024,
title={IndoToxic2024: A Demographically-Enriched Dataset of Hate Speech and Toxicity Types for Indonesian Language},
author={Lucky Susanto and Musa Izzanardi Wijanarko and Prasetia Anugrah Pratama and Traci Hong and Ika Idris and Alham Fikri Aji and Derry Wijaya},
year={2024},
eprint={2406.19349},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2406.19349},
}
- Downloads last month
- 431
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.