MoritzLaurer
/

deberta-v3-base-zeroshot-v2.0-c

@@ -52,6 +52,8 @@ output = zeroshot_classifier(text, classes_verbalised, hypothesis_template=hypot
 print(output)
 ```
 ### Details on data and training
 Reproduction code is available here, in the `v2_synthetic_data` directory: https://github.com/MoritzLaurer/zeroshot-classifier/tree/main
@@ -59,61 +61,56 @@ Reproduction code is available here, in the `v2_synthetic_data` directory: https
 ## Metrics
-Balanced accuracy is reported for all datasets.
-`deberta-v3-large-zeroshot-v1.1-all-33` was trained on all datasets, with only maximum 500 texts per class to avoid overfitting.
-The metrics on these datasets are therefore not strictly zeroshot, as the model has seen some data for each task during training.
-`deberta-v3-large-zeroshot-v1.1-heldout` indicates zeroshot performance on the respective dataset.
-To calculate these zeroshot metrics, the pipeline was run 28 times, each time with one dataset held out from training to simulate a zeroshot setup.
-![figure_large_v1.1](https://raw.githubusercontent.com/MoritzLaurer/zeroshot-classifier/main/results/fig_large_v1.1.png)
-|                            |   deberta-v3-large-mnli-fever-anli-ling-wanli-binary |   deberta-v3-large-zeroshot-v1.1-heldout |   deberta-v3-large-zeroshot-v1.1-all-33 |
-|:---------------------------|----------------------------:|-----------------------------------------:|----------------------------------------:|
-| datasets mean (w/o nli)    |                        64.1 |                                     73.4 |                                    85.2 |
-| amazonpolarity (2)         |                        94.7 |                                     96.6 |                                    96.8 |
-| imdb (2)                   |                        90.3 |                                     95.2 |                                    95.5 |
-| appreviews (2)             |                        93.6 |                                     94.3 |                                    94.7 |
-| yelpreviews (2)            |                        98.5 |                                     98.4 |                                    98.9 |
-| rottentomatoes (2)         |                        83.9 |                                     90.5 |                                    90.8 |
-| emotiondair (6)            |                        49.2 |                                     42.1 |                                    72.1 |
-| emocontext (4)             |                        57   |                                     69.3 |                                    82.4 |
-| empathetic (32)            |                        42   |                                     34.4 |                                    58   |
-| financialphrasebank (3)    |                        77.4 |                                     77.5 |                                    91.9 |
-| banking77 (72)             |                        29.1 |                                     52.8 |                                    72.2 |
-| massive (59)               |                        47.3 |                                     64.7 |                                    77.3 |
-| wikitoxic_toxicaggreg (2)  |                        81.6 |                                     86.6 |                                    91   |
-| wikitoxic_obscene (2)      |                        85.9 |                                     91.9 |                                    93.1 |
-| wikitoxic_threat (2)       |                        77.9 |                                     93.7 |                                    97.6 |
-| wikitoxic_insult (2)       |                        77.8 |                                     91.1 |                                    92.3 |
-| wikitoxic_identityhate (2) |                        86.4 |                                     89.8 |                                    95.7 |
-| hateoffensive (3)          |                        62.8 |                                     66.5 |                                    88.4 |
-| hatexplain (3)             |                        46.9 |                                     61   |                                    76.9 |
-| biasframes_offensive (2)   |                        62.5 |                                     86.6 |                                    89   |
-| biasframes_sex (2)         |                        87.6 |                                     89.6 |                                    92.6 |
-| biasframes_intent (2)      |                        54.8 |                                     88.6 |                                    89.9 |
-| agnews (4)                 |                        81.9 |                                     82.8 |                                    90.9 |
-| yahootopics (10)           |                        37.7 |                                     65.6 |                                    74.3 |
-| trueteacher (2)            |                        51.2 |                                     54.9 |                                    86.6 |
-| spam (2)                   |                        52.6 |                                     51.8 |                                    97.1 |
-| wellformedquery (2)        |                        49.9 |                                     40.4 |                                    82.7 |
-| manifesto (56)             |                        10.6 |                                     29.4 |                                    44.1 |
-| capsotu (21)               |                        23.2 |                                     69.4 |                                    74   |
-| mnli_m (2)                 |                        93.1 |                                    nan   |                                    93.1 |
-| mnli_mm (2)                |                        93.2 |                                    nan   |                                    93.2 |
-| fevernli (2)               |                        89.3 |                                    nan   |                                    89.5 |
-| anli_r1 (2)                |                        87.9 |                                    nan   |                                    87.3 |
-| anli_r2 (2)                |                        76.3 |                                    nan   |                                    78   |
-| anli_r3 (2)                |                        73.6 |                                    nan   |                                    74.1 |
-| wanli (2)                  |                        82.8 |                                    nan   |                                    82.7 |
-| lingnli (2)                |                        90.2 |                                    nan   |                                    89.6 |
 ## Limitations and bias
 The model can only do text classification tasks.
-Please consult the original DeBERTa paper and the papers for the different datasets for potential biases.
 ## License
@@ -147,9 +144,7 @@ If you have questions or ideas for cooperation, contact me at moritz{at}huggingf
-### Hypotheses used for classification
-The hypotheses in the tables below were used to fine-tune the model.
-Inspecting them can help users get a feeling for which type of hypotheses and tasks the model was trained on.
 You can formulate your own hypotheses by changing the `hypothesis_template` of the zeroshot pipeline. For example:
 ```python
@@ -157,7 +152,7 @@ from transformers import pipeline
 text = "Angela Merkel is a politician in Germany and leader of the CDU"
 hypothesis_template = "Merkel is the leader of the party: {}"
 classes_verbalized = ["CDU", "SPD", "Greens"]
-zeroshot_classifier = pipeline("zero-shot-classification", model="MoritzLaurer/deberta-v3-large-zeroshot-v1.1-all-33")
 output = zeroshot_classifier(text, classes_verbalized, hypothesis_template=hypothesis_template, multi_label=False)
 print(output)
 ```

 print(output)
 ```
+`multi_label=False` forces the model to decide on only one class. `multi_label=True` enables the model to choose multiple classes.
 ### Details on data and training
 Reproduction code is available here, in the `v2_synthetic_data` directory: https://github.com/MoritzLaurer/zeroshot-classifier/tree/main
 ## Metrics
+The model was evaluated on 28 different text classification tasks with the [balanced_accuracy](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.balanced_accuracy_score.html) metric.
+The main reference point is `facebook/bart-large-mnli` which is at the time of writing (27.03.24) the most used commercially-friendly 0-shot classifier.
+The different `...zeroshot-v2.0` models were all trained with the same data and the only difference the the underlying foundation model.
+Note that my `...zeroshot-v1.1` models (e.g. [deberta-v3-base-zeroshot-v1.1-all-33](https://huggingface.co/MoritzLaurer/deberta-v3-base-zeroshot-v1.1-all-33))
+ perform better on these 28 datasets, but they are trained on several datasets with non-commercial licenses.
+For commercial users, I therefore recommend using the v2.0 model and non-commercial users might get better performance with the v1.1 models.
+![results_aggreg_v2.0](https://raw.githubusercontent.com/MoritzLaurer/zeroshot-classifier/e859471dd183ad44b705c047130433301386aab8/v2_synthetic_data/results/zeroshot-v2.0-aggreg.png)
+|                            |   facebook/bart-large-mnli |   roberta-base-zeroshot-v2.0 |   roberta-large-zeroshot-v2.0 |   deberta-v3-base-zeroshot-v2.0 |   deberta-v3-large-zeroshot-v2.0 |
+|:---------------------------|---------------------------:|-----------------------------:|------------------------------:|--------------------------------:|---------------------------------:|
+| all datasets mean          |                      0.566 |                        0.612 |                         0.65  |                           0.647 |                            0.697 |
+| amazonpolarity (2)         |                      0.937 |                        0.924 |                         0.951 |                           0.937 |                            0.952 |
+| imdb (2)                   |                      0.892 |                        0.871 |                         0.904 |                           0.893 |                            0.923 |
+| appreviews (2)             |                      0.934 |                        0.913 |                         0.937 |                           0.938 |                            0.943 |
+| yelpreviews (2)            |                      0.948 |                        0.953 |                         0.977 |                           0.979 |                            0.988 |
+| rottentomatoes (2)         |                      0.831 |                        0.803 |                         0.841 |                           0.841 |                            0.87  |
+| emotiondair (6)            |                      0.495 |                        0.523 |                         0.514 |                           0.487 |                            0.495 |
+| emocontext (4)             |                      0.605 |                        0.535 |                         0.609 |                           0.566 |                            0.687 |
+| empathetic (32)            |                      0.366 |                        0.386 |                         0.417 |                           0.388 |                            0.455 |
+| financialphrasebank (3)    |                      0.673 |                        0.521 |                         0.445 |                           0.678 |                            0.656 |
+| banking77 (72)             |                      0.327 |                        0.138 |                         0.297 |                           0.433 |                            0.542 |
+| massive (59)               |                      0.454 |                        0.481 |                         0.599 |                           0.533 |                            0.599 |
+| wikitoxic_toxicaggreg (2)  |                      0.609 |                        0.752 |                         0.768 |                           0.752 |                            0.751 |
+| wikitoxic_obscene (2)      |                      0.728 |                        0.818 |                         0.854 |                           0.853 |                            0.884 |
+| wikitoxic_threat (2)       |                      0.531 |                        0.796 |                         0.874 |                           0.861 |                            0.876 |
+| wikitoxic_insult (2)       |                      0.514 |                        0.738 |                         0.802 |                           0.768 |                            0.778 |
+| wikitoxic_identityhate (2) |                      0.567 |                        0.776 |                         0.801 |                           0.774 |                            0.801 |
+| hateoffensive (3)          |                      0.41  |                        0.497 |                         0.484 |                           0.539 |                            0.634 |
+| hatexplain (3)             |                      0.373 |                        0.423 |                         0.385 |                           0.441 |                            0.446 |
+| biasframes_offensive (2)   |                      0.499 |                        0.571 |                         0.587 |                           0.546 |                            0.648 |
+| biasframes_sex (2)         |                      0.503 |                        0.703 |                         0.845 |                           0.794 |                            0.877 |
+| biasframes_intent (2)      |                      0.635 |                        0.541 |                         0.635 |                           0.562 |                            0.696 |
+| agnews (4)                 |                      0.722 |                        0.765 |                         0.764 |                           0.694 |                            0.824 |
+| yahootopics (10)           |                      0.303 |                        0.55  |                         0.621 |                           0.575 |                            0.605 |
+| trueteacher (2)            |                      0.492 |                        0.488 |                         0.501 |                           0.505 |                            0.515 |
+| spam (2)                   |                      0.523 |                        0.537 |                         0.528 |                           0.531 |                            0.698 |
+| wellformedquery (2)        |                      0.528 |                        0.5   |                         0.5   |                           0.5   |                            0.476 |
+| manifesto (56)             |                      0.088 |                        0.111 |                         0.206 |                           0.198 |                            0.277 |
+| capsotu (21)               |                      0.375 |                        0.525 |                         0.558 |                           0.543 |                            0.631 |
 ## Limitations and bias
 The model can only do text classification tasks.
+Biases can come from the underlying foundation model, the human NLI training data and the synthetic data generated by Mixtral.
 ## License
+### Flexible usage and "prompting"
 You can formulate your own hypotheses by changing the `hypothesis_template` of the zeroshot pipeline. For example:
 ```python
 text = "Angela Merkel is a politician in Germany and leader of the CDU"
 hypothesis_template = "Merkel is the leader of the party: {}"
 classes_verbalized = ["CDU", "SPD", "Greens"]
+zeroshot_classifier = pipeline("zero-shot-classification", model="MoritzLaurer/deberta-v3-base-zeroshot-v2.0")
 output = zeroshot_classifier(text, classes_verbalized, hypothesis_template=hypothesis_template, multi_label=False)
 print(output)
 ```