File size: 4,397 Bytes
29bfbfa
 
917903a
 
5dad04b
 
 
eb9285d
29bfbfa
2bc7f6c
eb9285d
6df1fa2
eb9285d
 
2bc7f6c
eb9285d
 
 
 
917903a
 
9fb80bf
84367c2
917903a
eb9285d
 
917903a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3448016
 
917903a
 
 
 
 
 
 
 
 
 
 
 
 
9fb80bf
917903a
9fb80bf
 
 
917903a
eb9285d
 
 
917903a
eb9285d
917903a
 
eb9285d
917903a
eb9285d
917903a
eb9285d
 
917903a
eb9285d
917903a
 
 
 
 
 
 
 
 
 
 
eb9285d
917903a
eb9285d
 
 
 
917903a
eb9285d
 
 
 
 
 
917903a
eb9285d
 
917903a
 
eb9285d
917903a
eb9285d
 
 
917903a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
---
license: apache-2.0
tags:
- ESG
- finance
language:
- en

---
![esgify](ESGify.png)
# About ESGify
<img src="ESGify_logo.jpeg" alt="image" width="20%" height="auto">
**ESGify** is a model for multilabel news classification with respect to ESG risks. Our custom methodology includes 46 ESG classes and 1 non-relevant to ESG class, resulting in 47 classes in total:

![esgify_classes](ESGify_classes.jpg)

# Usage 

ESGify is based on MPNet architecture but with a custom classification head. The ESGify class is defined is follows.

```python
from collections import OrderedDict
from transformers import MPNetPreTrainedModel, MPNetModel, AutoTokenizer
import torch

# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
        token_embeddings = model_output #First element of model_output contains all token embeddings
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Definition of ESGify class because of custom,sentence-transformers like, mean pooling function and classifier head
class ESGify(MPNetPreTrainedModel):
    """Model for Classification ESG risks from text."""

    def __init__(self,config): #tuning only the head
        """
        """
        super().__init__(config)
        # Instantiate Parts of model
        self.mpnet = MPNetModel(config,add_pooling_layer=False)
        self.id2label =  config.id2label
        self.label2id =  config.label2id
        self.classifier = torch.nn.Sequential(OrderedDict([('norm',torch.nn.BatchNorm1d(768)),
                                                ('linear',torch.nn.Linear(768,512)),
                                                ('act',torch.nn.ReLU()),
                                                ('batch_n',torch.nn.BatchNorm1d(512)),
                                                ('drop_class', torch.nn.Dropout(0.2)),
                                                ('class_l',torch.nn.Linear(512 ,47))]))


    def forward(self, input_ids, attention_mask):
         # Feed input to mpnet model
        outputs = self.mpnet(input_ids=input_ids,
                             attention_mask=attention_mask)
         
        # mean pooling dataset and eed input to classifier to compute logits
        logits = self.classifier( mean_pooling(outputs['last_hidden_state'],attention_mask))
         
        # apply sigmoid
        logits  = 1.0 / (1.0 + torch.exp(-logits))
        return logits
```

After defining model class, we initialize ESGify and tokenizer with the pre-trained weights

```python
model = ESGify.from_pretrained('ai-lab/ESGify')
tokenizer = AutoTokenizer.from_pretrained('ai-lab/ESGify')
```

Getting results from the model:

```python
texts = ['text1','text2']
to_model = tokenizer.batch_encode_plus(
                  texts,
                  add_special_tokens=True,
                  max_length=512,
                  return_token_type_ids=False,
                  padding="max_length",
                  truncation=True,
                  return_attention_mask=True,
                  return_tensors='pt',
                )
results = model(**to_model)
```

To identify top-3 classes by relevance and their scores: 

```python
for i in torch.topk(results, k=3).indices.tolist()[0]:
    print(f"{model.id2label[i]}: {np.round(results.flatten()[i].item(), 3)}")
```

For example, for the news "She faced employment rejection because of her gender", we get the following top-3 labels:
```
Discrimination: 0.944
Strategy Implementation: 0.82
Indigenous People: 0.499
```

Before training our model, we masked words related to Organisation, Date, Country, and Person to prevent false associations between these entities and risks. Hence, we recommend to process text with FLAIR NER model before inference.
An example of such preprocessing is given in https://colab.research.google.com/drive/15YcTW9KPSWesZ6_L4BUayqW_omzars0l?usp=sharing.


# Training procedure

We use the pretrained [`microsoft/mpnet-base`](https://huggingface.co/microsoft/mpnet-base) model.
Next, we do the domain-adaptation procedure by Mask Language Modeling with using texts of ESG reports. 
Finally, we fine-tune our model on 2000 texts with manually annotation of ESG specialists.