File size: 2,252 Bytes
587e1d5
 
224e5f2
587e1d5
 
224e5f2
587e1d5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
# tweet-topic-21-single

This is a RoBERTa-base model trained on ~124M tweets from January 2018 to December 2021 (see [here](https://huggingface.co/cardiffnlp/twitter-roberta-base-2021-124m)), and finetuned for single-label topic classification on a corpus of 6,997  [tweets](https://huggingface.co/datasets/cardiffnlp/tweet_topic_single).
The original roBERTa-base model can be found [here](https://huggingface.co/cardiffnlp/twitter-roberta-base-2021-124m) and the original reference paper is [TweetEval](https://github.com/cardiffnlp/tweeteval). This model is suitable for English. 

- Reference Papers: [TimeLMs paper](https://arxiv.org/abs/2202.03829), [TweetTopic](https://arxiv.org/abs/2209.09824). 
- Git Repo: [TimeLMs official repository](https://github.com/cardiffnlp/timelms).

<b>Labels</b>: 
- 0 -> arts_&_culture;
- 1 -> business_&_entrepreneurs;
- 2 -> pop_culture;
- 3 -> daily_life;
- 4 -> sports_&_gaming;
- 5 -> science_&_technology


## Full classification example

```python
from transformers import AutoModelForSequenceClassification, TFAutoModelForSequenceClassification
from transformers import AutoTokenizer
import numpy as np
from scipy.special import softmax

    
MODEL = f"cardiffnlp/tweet-topic-21-single"
tokenizer = AutoTokenizer.from_pretrained(MODEL)

# PT
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
class_mapping = model.config.id2label

text = "Tesla stock is on the rise!"
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

scores = output[0][0].detach().numpy()
scores = softmax(scores)

# TF
#model = TFAutoModelForSequenceClassification.from_pretrained(MODEL)
#class_mapping = model.config.id2label
#text = "Tesla stock is on the rise!"
#encoded_input = tokenizer(text, return_tensors='tf')
#output = model(**encoded_input)
#scores = output[0][0]
#scores = softmax(scores)


ranking = np.argsort(scores)
ranking = ranking[::-1]
for i in range(scores.shape[0]):
    l = class_mapping[ranking[i]]
    s = scores[ranking[i]]
    print(f"{i+1}) {l} {np.round(float(s), 4)}")

```

Output: 

```
1) business_&_entrepreneurs 0.8361
2) science_&_technology 0.0904
3) pop_culture 0.0288
4) daily_life 0.0178
5) arts_&_culture 0.0137
6) sports_&_gaming 0.0133
```