javier-ab-bsc commited on
Commit
22e6062
·
verified ·
1 Parent(s): 791d00a

Added evaluation section

Browse files
Files changed (1) hide show
  1. README.md +339 -1
README.md CHANGED
@@ -544,7 +544,345 @@ The dataset does not allow for external contributions.
544
 
545
  ## Evaluation
546
 
547
- <span style="color:red">TODO</span>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
548
 
549
  ## Ethical Considerations and Limitations
550
 
 
544
 
545
  ## Evaluation
546
 
547
+ ### Gold-standard benchmarks
548
+
549
+ Evaluation is done using the Language Model Evaluation Harness (Gao et al., 2024). We evaluate on a set of tasks taken from [SpanishBench](https://github.com/EleutherAI/lm-evaluation-harness/pull/2157), [CatalanBench](https://github.com/EleutherAI/lm-evaluation-harness/pull/2154), [BasqueBench](https://github.com/EleutherAI/lm-evaluation-harness/pull/2153) and [GalicianBench](https://github.com/EleutherAI/lm-evaluation-harness/pull/2155). These benchmarks include both new and existing tasks and datasets. In the tables below, we include the results in a selection of evaluation datasets that represent model's performance across a variety of tasks within these benchmarks.
550
+
551
+ We only use tasks that are either human generated, human translated, or with a strong human-in-the-loop (i.e., machine translation followed by professional revision or machine generation followed by human revision and annotation). This is the reason behind the variety in number of tasks reported across languages. As more tasks that fulfill these requirements are published, we will update the presented results. We also intend to expand the evaluation to other languages, as long as the datasets meet our quality standards.
552
+
553
+ During the implementation of the evaluation we observed a series of issues worth considering when replicating and interpreting the results presented. These issues include ≈1.5% variances in performance in some tasks depending on the version of the `transformers` library used, and depending on the use (or lack of use) of tensor parallelism when loading a model. When implementing existing tasks, we carry out a comprehensive quality evaluation of the dataset, the Harness task itself, and what kind of input models see during evaluation. Our implementation (see links above) addresses multiple existing problems such as errors in datasets and prompts, and lack of pre-processing. All this means that results will vary if using other Harness implementations, and may slightly vary depending on the replication setup.
554
+
555
+ It should be noted that these results are subject to all the drawbacks of every current gold-standard evaluation, and that the figures do not fully represent the models capabilities and potential. We thus advise caution when reading and interpreting the results.
556
+
557
+ A full list of results compared to other baselines, a discussion of the model's performance across tasks and its implications, and details regarding problem-solving with task implementation will soon be available in the technical report.
558
+
559
+ All results reported below are on a 5-shot setting.
560
+
561
+ #### Spanish
562
+
563
+ <table><thead>
564
+ <tr>
565
+ <th>Category</th>
566
+ <th>Task</th>
567
+ <th>Metric</th>
568
+ <th>Result</th>
569
+ </tr></thead>
570
+ <tbody>
571
+ <tr>
572
+ <td>Commonsense Reasoning</td>
573
+ <td>xstorycloze_es</td>
574
+ <td>acc</td>
575
+ <td>74.06</td>
576
+ </tr>
577
+ <tr>
578
+ <td>Math</td>
579
+ <td>mgsm_direct_es</td>
580
+ <td>em</td>
581
+ <td>4</td>
582
+ </tr>
583
+ <tr>
584
+ <td rowspan="2">NLI</td>
585
+ <td>wnli_es</td>
586
+ <td>acc</td>
587
+ <td>46.48</td>
588
+ </tr>
589
+ <tr>
590
+ <td>xnli_es</td>
591
+ <td>acc</td>
592
+ <td>46.47</td>
593
+ </tr>
594
+ <tr>
595
+ <td>Paraphrasing</td>
596
+ <td>paws_es</td>
597
+ <td>acc</td>
598
+ <td>57.65</td>
599
+ </tr>
600
+ <tr>
601
+ <td>QA</td>
602
+ <td>xquad_es</td>
603
+ <td>acc</td>
604
+ <td>71.48</td>
605
+ </tr>
606
+ <tr>
607
+ <td>Translation</td>
608
+ <td>flores_es</td>
609
+ <td>bleu</td>
610
+ <td>23.56</td>
611
+ </tr>
612
+ </tbody>
613
+ </table>
614
+
615
+ #### Catalan
616
+
617
+ <table><thead>
618
+ <tr>
619
+ <th>Category</th>
620
+ <th>Task</th>
621
+ <th>Metric</th>
622
+ <th>Result</th>
623
+ </tr></thead>
624
+ <tbody>
625
+ <tr>
626
+ <td rowspan="2">Commonsense Reasoning</td>
627
+ <td>copa_ca</td>
628
+ <td>acc</td>
629
+ <td>80.8</td>
630
+ </tr>
631
+ <tr>
632
+ <td>xstorycloze_ca</td>
633
+ <td>acc</td>
634
+ <td>73.73</td>
635
+ </tr>
636
+ <tr>
637
+ <td>Math</td>
638
+ <td>mgsm_direct_ca</td>
639
+ <td>em</td>
640
+ <td>6</td>
641
+ </tr>
642
+ <tr>
643
+ <td rowspan="2">NLI</td>
644
+ <td>wnli_ca</td>
645
+ <td>acc</td>
646
+ <td>56.34</td>
647
+ </tr>
648
+ <tr>
649
+ <td>xnli_ca</td>
650
+ <td>acc</td>
651
+ <td>49.4</td>
652
+ </tr>
653
+ <tr>
654
+ <td rowspan="2">Paraphrasing</td>
655
+ <td>parafraseja</td>
656
+ <td>acc</td>
657
+ <td>64.88</td>
658
+ </tr>
659
+ <tr>
660
+ <td> paws_ca</td>
661
+ <td>acc</td>
662
+ <td>61.5</td>
663
+ </tr>
664
+ <tr>
665
+ <td rowspan="5">QA</td>
666
+ <td>arc_ca_easy</td>
667
+ <td>acc</td>
668
+ <td>69.23</td>
669
+ </tr>
670
+ <tr>
671
+ <td> arc_ca_challenge</td>
672
+ <td>acc</td>
673
+ <td>44.54</td>
674
+ </tr>
675
+ <tr>
676
+ <td> openbookqa_ca</td>
677
+ <td>acc</td>
678
+ <td>36.8</td>
679
+ </tr>
680
+ <tr>
681
+ <td> piqa_ca</td>
682
+ <td>acc</td>
683
+ <td>70.35</td>
684
+ </tr>
685
+ <tr>
686
+ <td> siqa_ca</td>
687
+ <td>acc</td>
688
+ <td>48.26</td>
689
+ </tr>
690
+ <tr>
691
+ <td>Translation</td>
692
+ <td>flores_ca</td>
693
+ <td>bleu</td>
694
+ <td>30.34</td>
695
+ </tr>
696
+ </tbody></table>
697
+
698
+ #### Basque
699
+
700
+ <table><thead>
701
+ <tr>
702
+ <th>Category</th>
703
+ <th>Task</th>
704
+ <th>Metric</th>
705
+ <th>Result</th>
706
+ </tr></thead>
707
+ <tbody>
708
+ <tr>
709
+ <td rowspan="2">Commonsense Reasoning</td>
710
+ <td>xcopa_eu</td>
711
+ <td>acc</td>
712
+ <td>68</td>
713
+ </tr>
714
+ <tr>
715
+ <td>xstorycloze_eu</td>
716
+ <td>acc</td>
717
+ <td>64.79</td>
718
+ </tr>
719
+ <tr>
720
+ <td>Math</td>
721
+ <td>mgsm_direct_eu</td>
722
+ <td>em</td>
723
+ <td>6</td>
724
+ </tr>
725
+ <tr>
726
+ <td rowspan="2">NLI</td>
727
+ <td>wnli_eu</td>
728
+ <td>acc</td>
729
+ <td>38.03</td>
730
+ </tr>
731
+ <tr>
732
+ <td>xnli_eu</td>
733
+ <td>acc</td>
734
+ <td>42.85</td>
735
+ </tr>
736
+ <tr>
737
+ <td rowspan="3">QA</td>
738
+ <td>eus_exams</td>
739
+ <td>acc</td>
740
+ <td>38.41</td>
741
+ </tr>
742
+ <tr>
743
+ <td>eus_proficiency</td>
744
+ <td>acc</td>
745
+ <td>31.13</td>
746
+ </tr>
747
+ <tr>
748
+ <td>eus_trivia</td>
749
+ <td>acc</td>
750
+ <td>45.36</td>
751
+ </tr>
752
+ <tr>
753
+ <td>Reading Comprehension</td>
754
+ <td>eus_reading</td>
755
+ <td>acc</td>
756
+ <td>33.24</td>
757
+ </tr>
758
+ <tr>
759
+ <td>Translation</td>
760
+ <td>flores_eu</td>
761
+ <td>bleu</td>
762
+ <td>16.29</td>
763
+ </tr>
764
+ </tbody></table>
765
+
766
+ #### Galician
767
+
768
+ <table><thead>
769
+ <tr>
770
+ <th>Category</th>
771
+ <th>Task</th>
772
+ <th>Metric</th>
773
+ <th>Result</th>
774
+ </tr></thead>
775
+ <tbody>
776
+ <tr>
777
+ <td>Math</td>
778
+ <td>mgsm_direct_gl</td>
779
+ <td>em</td>
780
+ <td>48</td>
781
+ </tr>
782
+ <tr>
783
+ <td rowspan="2">Paraphrasing</td>
784
+ <td>parafrases_gl</td>
785
+ <td>acc</td>
786
+ <td>58.84</td>
787
+ </tr>
788
+ <tr>
789
+ <td>paws_gl</td>
790
+ <td>acc</td>
791
+ <td>60.85</td>
792
+ </tr>
793
+ <tr>
794
+ <td>QA</td>
795
+ <td>openbookqa_gl</td>
796
+ <td>acc</td>
797
+ <td>34.6</td>
798
+ </tr>
799
+ <tr>
800
+ <td>Translation</td>
801
+ <td>flores_gl</td>
802
+ <td>bleu</td>
803
+ <td>27.98</td>
804
+ </tr>
805
+ </tbody>
806
+ </table>
807
+
808
+ #### English
809
+
810
+ <table><thead>
811
+ <tr>
812
+ <th>Category</th>
813
+ <th>Task</th>
814
+ <th>Metric</th>
815
+ <th>Result</th>
816
+ </tr></thead>
817
+ <tbody>
818
+ <tr>
819
+ <td rowspan="2">Commonsense Reasoning</td>
820
+ <td>copa</td>
821
+ <td>acc</td>
822
+ <td>90</td>
823
+ </tr>
824
+ <tr>
825
+ <td>xstorycloze_en</td>
826
+ <td>acc</td>
827
+ <td>79.22</td>
828
+ </tr>
829
+ <tr>
830
+ <td>Math</td>
831
+ <td>mgsm_direct_en *</td>
832
+ <td>em</td>
833
+ <td>8</td>
834
+ </tr>
835
+ <tr>
836
+ <td rowspan="2">NLI</td>
837
+ <td>wnli</td>
838
+ <td>acc</td>
839
+ <td>52.11</td>
840
+ </tr>
841
+ <tr>
842
+ <td>xnli_en</td>
843
+ <td>acc</td>
844
+ <td>47.27</td>
845
+ </tr>
846
+ <tr>
847
+ <td>Paraphrasing</td>
848
+ <td> paws *</td>
849
+ <td>acc</td>
850
+ <td>59.6</td>
851
+ </tr>
852
+ <tr>
853
+ <td rowspan="6">QA</td>
854
+ <td>arc_easy</td>
855
+ <td>acc</td>
856
+ <td>81.36</td>
857
+ </tr>
858
+ <tr>
859
+ <td>arc_challenge</td>
860
+ <td>acc</td>
861
+ <td>50.6</td>
862
+ </tr>
863
+ <tr>
864
+ <td>openbookqa</td>
865
+ <td>acc</td>
866
+ <td>34.4</td>
867
+ </tr>
868
+ <tr>
869
+ <td>piqa</td>
870
+ <td>acc</td>
871
+ <td>78.78</td>
872
+ </tr>
873
+ <tr>
874
+ <td>social_iqa</td>
875
+ <td>acc</td>
876
+ <td>50.15</td>
877
+ </tr>
878
+ <tr>
879
+ <td>squad_en</td>
880
+ <td>acc</td>
881
+ <td>78.06</td>
882
+ </tr>
883
+ </tbody></table>
884
+
885
+ \* Current LM Evaluation Harness implementation is lacking correct pre-processing. These results are obtained with adequate pre-processing.
886
 
887
  ## Ethical Considerations and Limitations
888