TheRealM4rtin commited on
Commit
a6f967e
·
verified ·
1 Parent(s): 16f1b8e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +152 -3
README.md CHANGED
@@ -1,3 +1,152 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ tags:
4
+ - toxicity
5
+ - text-classification
6
+ - roberta
7
+ - jigsaw
8
+ license: mit
9
+ datasets:
10
+ - jigsaw-toxic-comment-classification-challenge
11
+ base_model:
12
+ - FacebookAI/roberta-base
13
+ ---
14
+
15
+ # Model Card for RoBERTa Toxicity Classifier
16
+
17
+ This model is a fine-tuned version of RoBERTa-base for toxicity classification, capable of identifying six different types of toxic content in text.
18
+
19
+ ## Model Details
20
+
21
+ ### Model Description
22
+
23
+ This model is a fine-tuned version of RoBERTa-base, trained to identify toxic content across multiple categories. It was developed to help identify and moderate harmful content in text data.
24
+
25
+ - **Developed by:** Bonnavaud Laura, Cousseau Martin, Laborde Stanislas, Rady Othmane, Satouri Amani
26
+ - **Model type:** RoBERTa-based text classification
27
+ - **Language(s):** English
28
+ - **License:** MIT
29
+ - **Finetuned from model:** facebook/roberta-base
30
+
31
+ ## Uses
32
+
33
+ ### Direct Use
34
+
35
+ The model can be used directly for:
36
+ - Content moderation
37
+ - Toxic comment detection
38
+ - Online safety monitoring
39
+ - Comment filtering systems
40
+
41
+ ### Out-of-Scope Use
42
+
43
+ This model should not be used for:
44
+ - Legal decision making
45
+ - Automated content removal without human review
46
+ - Processing non-English content
47
+ - Making definitive judgments about individuals or groups
48
+
49
+ ## Bias, Risks, and Limitations
50
+
51
+ - The model may reflect biases present in the training data
52
+ - Performance may vary across different demographics and contexts
53
+ - False positives/negatives can occur and should be considered in deployment
54
+ - Not suitable for high-stakes decisions without human oversight
55
+
56
+ ### Recommendations
57
+
58
+ Users should:
59
+ - Implement human review processes alongside model predictions
60
+ - Monitor model performance across different demographic groups
61
+ - Use confidence thresholds appropriate for their use case
62
+ - Be transparent about the use of automated toxicity detection
63
+
64
+ ## Training Details
65
+
66
+ ### Training Data
67
+
68
+ The model was trained on the Jigsaw Toxic Comment Classification Challenge dataset, which includes comments labeled for toxic content across six categories:
69
+ - Toxic
70
+ - Severe Toxic
71
+ - Obscene
72
+ - Threat
73
+ - Insult
74
+ - Identity Hate
75
+
76
+ The dataset was split into training and testing sets with a 90-10 split ratio, using stratified sampling based on the sum of toxic labels to ensure balanced distribution. Empty comments were handled by filling with empty strings, and all texts were properly cleaned and tokenized in batches of 48 samples.
77
+
78
+ ### Training Procedure
79
+
80
+ #### Training Hyperparameters
81
+
82
+ - **Training regime:** FP16 mixed precision
83
+ - **Optimizer:** AdamW
84
+ - **Learning rate:** 2e-5
85
+ - **Batch size:** 320
86
+ - **Epochs:** Up to 40 with early stopping (patience=15)
87
+ - **Max sequence length:** 128
88
+ - **Warmup ratio:** 0.1
89
+ - **Weight decay:** 0.1
90
+ - **Gradient accumulation steps:** 2
91
+ - **Scheduler:** Linear
92
+ - **DataLoader workers:** 2
93
+
94
+ ### Evaluation
95
+
96
+ #### Testing Data, Factors & Metrics
97
+
98
+ The model was evaluated on a held-out test set from the Jigsaw dataset.
99
+
100
+ #### Metrics
101
+
102
+ The model was evaluated using comprehensive metrics for multi-label classification:
103
+
104
+ Per class metrics:
105
+ - Accuracy
106
+ - Precision
107
+ - Recall
108
+ - F1 Score
109
+
110
+ Aggregate metrics:
111
+ - Overall accuracy
112
+ - Macro-averaged metrics:
113
+ - Macro Precision
114
+ - Macro Recall
115
+ - Macro F1
116
+ - Micro-averaged metrics:
117
+ - Micro Precision
118
+ - Micro Recall
119
+ - Micro F1
120
+
121
+ Best model selection was based on F1 score during training.
122
+
123
+ ## Environmental Impact
124
+
125
+ - **Hardware Type:** 4x NVIDIA A10 24GB
126
+ - **Training hours:** 20 Minutes
127
+ - **Cloud Provider:** ESIEA Cluster
128
+
129
+ ## Technical Specifications
130
+
131
+ ### Model Architecture and Technical Details
132
+
133
+ - Base model: RoBERTa-base
134
+ - Problem type: Multi-label classification
135
+ - Number of labels: 6
136
+ - Output layers: Linear classification head for multi-label prediction
137
+ - Number of parameters: ~125M
138
+ - Training optimizations:
139
+ - Distributed Data Parallel (DDP) support with NCCL backend
140
+ - FP16 mixed precision training
141
+ - Memory optimizations:
142
+ - Gradient accumulation (steps=2)
143
+ - DataLoader pinned memory
144
+ - Efficient batch processing
145
+ - Caching system for tokenized data to improve training efficiency
146
+
147
+ ### Hardware Requirements
148
+
149
+ Minimum requirements for inference:
150
+ - RAM: 4GB
151
+ - CPU: Modern processor supporting AVX instructions
152
+ - GPU: Optional, but recommended for batch processing