Spaces:

LLM360
/

de-arena

Running

App Files Files Community

yzabc007 commited on Oct 9, 2024

Commit

a56e40b

1 Parent(s): 2f5cc84

Update space

Browse files

Files changed (1) hide show

app.py +28 -8

app.py CHANGED Viewed

@@ -103,7 +103,8 @@ def init_leaderboard(dataframe):
 # model_result_path = "./src/results/models_2024-10-08-03:10:26.811832.jsonl"
 # model_result_path = "./src/results/models_2024-10-08-03:25:44.801310.jsonl"
 # model_result_path = "./src/results/models_2024-10-08-17:39:21.001582.jsonl"
-model_result_path = "./src/results/models_2024-10-09-05:17:38.810960.json"
 # model_leaderboard_df = get_model_leaderboard_df(model_result_path)
@@ -156,7 +157,7 @@ with demo:
                         AutoEvalColumn.rank_math_probability.name,
                         AutoEvalColumn.rank_reason_logical.name,
                         AutoEvalColumn.rank_reason_social.name,
-                        # AutoEvalColumn.rank_chemistry.name,
                         ],
                     rank_col=[],
                 )
@@ -274,6 +275,7 @@ with demo:
             [SocialIQA](https://arxiv.org/abs/1904.09728),
             [NormBank](https://arxiv.org/abs/2305.17008), covering challenging social reasoning tasks,
             such as social commonsense reasoning, social normative reasoning, Theory of Mind (ToM) reasoning, etc.
             """
             gr.Markdown(DESCRIPTION_TEXT, elem_classes="markdown-text")
@@ -314,9 +316,10 @@ with demo:
         with gr.TabItem("🔬 Science", elem_id="science-table", id=4):
             CURRENT_TEXT = """
-            Sicnece domain is a critical area for evaluating LLMs.
-            We are working on adding several tasks on scientific domains to the leaderboard. The forthcoming ones are biology, chemistry, and physics.
-            We have diversely and aggressively collected recent science datasets, including but not limited to
             [GPQA](https://arxiv.org/abs/2311.12022),
             [JEEBench](https://aclanthology.org/2023.emnlp-main.468/),
             [MMLU-Pro](https://arxiv.org/abs/2406.01574),
@@ -359,8 +362,7 @@ with demo:
         with gr.TabItem("</> Coding", elem_id="coding-table", id=5):
             CURRENT_TEXT = """
-            # Coming soon!
-            We are working on adding more tasks in coding domains to the leaderboard.
             The forthcoming ones focus on Python, Java, and C++, with plans to expand to more languages.
             We collect a variety of recent coding datasets, including
             [HumanEval](https://huggingface.co/datasets/openai/openai_humaneval),
@@ -371,6 +373,24 @@ with demo:
             Our efforts also include synthesizing new code-related queries to ensure diversity!
             """
             gr.Markdown(CURRENT_TEXT, elem_classes="markdown-text")
@@ -386,7 +406,7 @@ with demo:
             ## Team members
             Yanbin Yin, [Zhen Wang](https://zhenwang9102.github.io/), [Kun Zhou](https://lancelot39.github.io/), Xiangdong Zhang,
-            [Shibo Hao](https://ber666.github.io/), [Yi Gu](https://www.yigu.page/), Jieyuan Liu, Somanshu Singla, [Tianyang Liu](https://leolty.github.io/),
             [Eric P. Xing](https://www.cs.cmu.edu/~epxing/), [Zhengzhong Liu](https://hunterhector.github.io/), [Haojian Jin](https://www.haojianj.in/),
             [Zhiting Hu](https://zhiting.ucsd.edu/)

 # model_result_path = "./src/results/models_2024-10-08-03:10:26.811832.jsonl"
 # model_result_path = "./src/results/models_2024-10-08-03:25:44.801310.jsonl"
 # model_result_path = "./src/results/models_2024-10-08-17:39:21.001582.jsonl"
+# model_result_path = "./src/results/models_2024-10-09-05:17:38.810960.json"
+model_result_path = "./src/results/models_2024-10-09-06:22:21.122422.json"
 # model_leaderboard_df = get_model_leaderboard_df(model_result_path)
                         AutoEvalColumn.rank_math_probability.name,
                         AutoEvalColumn.rank_reason_logical.name,
                         AutoEvalColumn.rank_reason_social.name,
+                        AutoEvalColumn.rank_chemistry.name,
                         ],
                     rank_col=[],
                 )
             [SocialIQA](https://arxiv.org/abs/1904.09728),
             [NormBank](https://arxiv.org/abs/2305.17008), covering challenging social reasoning tasks,
             such as social commonsense reasoning, social normative reasoning, Theory of Mind (ToM) reasoning, etc.
+            More fine-grained types of reasoning, such as symbolic, analogical, counterfactual reasoning, are planned to be added in the future.
             """
             gr.Markdown(DESCRIPTION_TEXT, elem_classes="markdown-text")
         with gr.TabItem("🔬 Science", elem_id="science-table", id=4):
             CURRENT_TEXT = """
+            Scientific tasks are crucial for evaluating LLMs, requiring both domain-specific knowledge and reasoning capabilities.
+            We are adding several fine-grained scientific domains to the leaderboard. The forthcoming ones are biology, chemistry, and physics.
+            We have diversely and aggressively collected recent scientific datasets, including but not limited to
             [GPQA](https://arxiv.org/abs/2311.12022),
             [JEEBench](https://aclanthology.org/2023.emnlp-main.468/),
             [MMLU-Pro](https://arxiv.org/abs/2406.01574),
         with gr.TabItem("</> Coding", elem_id="coding-table", id=5):
             CURRENT_TEXT = """
+            We are working on adding more fine-grained tasks in coding domains to the leaderboard.
             The forthcoming ones focus on Python, Java, and C++, with plans to expand to more languages.
             We collect a variety of recent coding datasets, including
             [HumanEval](https://huggingface.co/datasets/openai/openai_humaneval),
             Our efforts also include synthesizing new code-related queries to ensure diversity!
             """
             gr.Markdown(CURRENT_TEXT, elem_classes="markdown-text")
+            with gr.TabItem("🐍 Python", elem_id="python_subtab", id=0, elem_classes="subtab"):
+                CURRENT_TEXT = """
+                # Coming soon!
+                """
+                gr.Markdown(CURRENT_TEXT, elem_classes="markdown-text")
+            with gr.TabItem("☕ Java", elem_id="java_subtab", id=1, elem_classes="subtab"):
+                CURRENT_TEXT = """
+                # Coming soon!
+                """
+                gr.Markdown(CURRENT_TEXT, elem_classes="markdown-text")
+            with gr.TabItem("➕ C++", elem_id="cpp_subtab", id=2, elem_classes="subtab"):
+                CURRENT_TEXT = """
+                # Coming soon!
+                """
+                gr.Markdown(CURRENT_TEXT, elem_classes="markdown-text")
             ## Team members
             Yanbin Yin, [Zhen Wang](https://zhenwang9102.github.io/), [Kun Zhou](https://lancelot39.github.io/), Xiangdong Zhang,
+            [Shibo Hao](https://ber666.github.io/), [Yi Gu](https://www.yigu.page/), [Jieyuan Liu](https://www.linkedin.com/in/jieyuan-liu/), [Somanshu Singla](https://www.linkedin.com/in/somanshu-singla-105636214/), [Tianyang Liu](https://leolty.github.io/),
             [Eric P. Xing](https://www.cs.cmu.edu/~epxing/), [Zhengzhong Liu](https://hunterhector.github.io/), [Haojian Jin](https://www.haojianj.in/),
             [Zhiting Hu](https://zhiting.ucsd.edu/)