href

Running

Shane commited on 12 days ago

Commit

c96dbc6

•

1 Parent(s): 557b080

updated citations

Files changed (2) hide show

app.py CHANGED Viewed

@@ -167,11 +167,11 @@ with gr.Blocks(css=custom_css) as app:
     with gr.Row():
         with gr.Accordion("📚 Citation", open=False):
             citation_button = gr.Textbox(
-                value=r"""@misc{RewardBench,
-    title={RewardBench: Evaluating Reward Models for Language Modeling},
-    author={Lambert, Nathan and Pyatkin, Valentina and Morrison, Jacob and Miranda, LJ and Lin, Bill Yuchen and Chandu, Khyathi and Dziri, Nouha and Kumar, Sachin and Zick, Tom and Choi, Yejin and Smith, Noah A. and Hajishirzi, Hannaneh},
-    year={2024},
-    howpublished={\url{https://huggingface.co/spaces/allenai/reward-bench}
 }""",
                 lines=7,
                 label="Copy the following to cite these results.",

     with gr.Row():
         with gr.Accordion("📚 Citation", open=False):
             citation_button = gr.Textbox(
+                value=r"""@article{lyu2024href,
+      title={HREF: Human Response-Guided Evaluation of Instruction Following in Language Models},
+      author={Xinxi Lyu and Yizhong Wang and Hannaneh Hajishirzi and Pradeep Dasigi},
+      journal={arXiv preprint arXiv:2412.15524},
+      year={2024}
 }""",
                 lines=7,
                 label="Copy the following to cite these results.",

src/md.py CHANGED Viewed

@@ -23,8 +23,6 @@ For reproductability, we use greedy decoding for all model generation as default
 - **Large**: HREF has the largest evaluation size among similar benchmarks, making its evaluation more reliable.
 - **Contamination-resistant**: HREF's evaluation set is hidden and uses public models for both the baseline model and judge model, which makes it completely free of contamination.
 - **Task Oriented**: Instead of naturally collected instructions from the user, HREF contains instructions that are written specifically targetting 8 distinct categories that are used in instruction tuning, which allows it to provide more insights about how to improve language models.
-## Contact Us
-TODO
 """
 # Get Pacific time zone (handles PST/PDT automatically)

 - **Large**: HREF has the largest evaluation size among similar benchmarks, making its evaluation more reliable.
 - **Contamination-resistant**: HREF's evaluation set is hidden and uses public models for both the baseline model and judge model, which makes it completely free of contamination.
 - **Task Oriented**: Instead of naturally collected instructions from the user, HREF contains instructions that are written specifically targetting 8 distinct categories that are used in instruction tuning, which allows it to provide more insights about how to improve language models.
 """
 # Get Pacific time zone (handles PST/PDT automatically)