Spaces:

esiga
/

simp_demo

Running

App Files Files Community

Avijit Ghosh commited on Apr 18, 2024

Commit

87e696f

1 Parent(s): 2ee7c27

Added more fields

Browse files

Files changed (13) hide show

app.py +26 -3
configs/crowspairs.yaml +1 -0
configs/honest.yaml +1 -0
configs/ieat.yaml +1 -0
configs/imagedataleak.yaml +1 -0
configs/measuringforgetting.yaml +2 -1
configs/notmyvoice.yaml +1 -0
configs/palms.yaml +2 -1
configs/safelatentdiff.yaml +2 -1
configs/stablebias.yaml +1 -0
configs/tango.yaml +2 -1
configs/videodiversemisinfo.yaml +1 -0
configs/weat.yaml +8 -22

app.py CHANGED Viewed

@@ -70,11 +70,15 @@ def showmodal(evt: gr.SelectData):
     modal = Modal(visible=False)
     titlemd = gr.Markdown("",visible=False)
     authormd = gr.Markdown("",visible=False)
     tagsmd = gr.Markdown("",visible=False)
     abstractmd = gr.Markdown("",visible=False)
     considerationsmd = gr.Markdown("",visible=False)
     modelsmd = gr.Markdown("",visible=False)
     datasetmd = gr.Markdown("",visible=False)
     gallery = gr.Gallery([],visible=False)
     if evt.index[1] == 4:
         modal = Modal(visible=True)
@@ -92,26 +96,42 @@ def showmodal(evt: gr.SelectData):
                 modelstr = '### Applicable Models: '+''.join(['<span class="tag">'+model+'</span> ' for model in models])
                 modelsmd = gr.Markdown(modelstr, visible=True)
         titlemd = gr.Markdown('# ['+itemdic['Link']+']('+itemdic['URL']+')',visible=True)
         if pd.notnull(itemdic['Authors']):
             authormd = gr.Markdown('## '+itemdic['Authors'],visible=True)
         if pd.notnull(itemdic['Abstract']):
             abstractmd = gr.Markdown(itemdic['Abstract'],visible=True)
         if pd.notnull(itemdic['Considerations']):
             considerationsmd = gr.Markdown('<strong>Considerations: </strong>'+ itemdic['Considerations'],visible=True)
         if pd.notnull(itemdic['Datasets']):
             datasetmd = gr.Markdown('#### [Dataset]('+itemdic['Datasets']+')',visible=True)
         screenshots = itemdic['Screenshots']
         if isinstance(screenshots, list):
             if len(screenshots) > 0:
                 gallery = gr.Gallery(screenshots, visible=True, height=500, object_fit="scale-down", interactive=False, show_share_button=False)
-    return [modal, titlemd, authormd, tagsmd, abstractmd, considerationsmd, modelsmd, datasetmd, gallery]
 with gr.Blocks(title = "Social Impact Measurement V2", css=custom_css, theme=gr.themes.Base()) as demo: #theme=gr.themes.Soft(),
     # create tabs for the app, moving the current table to one titled "rewardbench" and the benchmark_text to a tab called "About"
@@ -160,11 +180,13 @@ The following categories are high-level, non-exhaustive, and present a synthesis
                 with Modal(visible=False) as modal:
                     titlemd = gr.Markdown(visible=False)
                     authormd = gr.Markdown(visible=False)
                     tagsmd = gr.Markdown(visible=False)
                     abstractmd = gr.Markdown(visible=False)
                     gr.Markdown("""## Construct Validity<br>
                                 ### How well it measures the concept it was designed to evaluate""", visible=True)
-                    # gr.Markdown("### What it is evaluating", visible=True)
                     considerationsmd = gr.Markdown(visible=False)
                     gr.Markdown("""## Resources<br>
                                 ### What you need to do this evaluation""", visible=True)
@@ -172,8 +194,9 @@ The following categories are high-level, non-exhaustive, and present a synthesis
                     datasetmd = gr.Markdown(visible=False)
                     gr.Markdown("""## Results<br>
                                 ### Available evaluation results""", visible=True)
                     gallery = gr.Gallery(visible=False)
-                table_filtered.select(showmodal, None, [modal, titlemd, authormd, tagsmd, abstractmd, considerationsmd, modelsmd, datasetmd, gallery])

     modal = Modal(visible=False)
     titlemd = gr.Markdown("",visible=False)
     authormd = gr.Markdown("",visible=False)
+    affiliationmd = gr.Markdown("",visible=False)
     tagsmd = gr.Markdown("",visible=False)
     abstractmd = gr.Markdown("",visible=False)
+    whatisbeingmd = gr.Markdown("",visible=False)
+    methodmd = gr.Markdown("",visible=False)
     considerationsmd = gr.Markdown("",visible=False)
     modelsmd = gr.Markdown("",visible=False)
     datasetmd = gr.Markdown("",visible=False)
+    metricsmd = gr.Markdown("",visible=False)
     gallery = gr.Gallery([],visible=False)
     if evt.index[1] == 4:
         modal = Modal(visible=True)
                 modelstr = '### Applicable Models: '+''.join(['<span class="tag">'+model+'</span> ' for model in models])
                 modelsmd = gr.Markdown(modelstr, visible=True)
         titlemd = gr.Markdown('# ['+itemdic['Link']+']('+itemdic['URL']+')',visible=True)
         if pd.notnull(itemdic['Authors']):
             authormd = gr.Markdown('## '+itemdic['Authors'],visible=True)
+        if pd.notnull(itemdic['Affiliations']):
+            affiliationmd = gr.Markdown('<strong>Affiliations: </strong>'+ itemdic['Affiliations'],visible=True)
         if pd.notnull(itemdic['Abstract']):
             abstractmd = gr.Markdown(itemdic['Abstract'],visible=True)
+        if pd.notnull(itemdic['What it is evaluating']):
+            whatisbeingmd = gr.Markdown('<strong>Concept being evaluated: </strong>'+ itemdic['What it is evaluating'],visible=True)
+        if pd.notnull(itemdic['Methodology']):
+            methodmd = gr.Markdown('<strong>Method of Evaluation: </strong>'+ itemdic['Methodology'],visible=True)
         if pd.notnull(itemdic['Considerations']):
             considerationsmd = gr.Markdown('<strong>Considerations: </strong>'+ itemdic['Considerations'],visible=True)
         if pd.notnull(itemdic['Datasets']):
             datasetmd = gr.Markdown('#### [Dataset]('+itemdic['Datasets']+')',visible=True)
+        metrics = itemdic['Metrics']
+        if isinstance(metrics, list):
+            if len(metrics) > 0:
+                metricstr = '### Metrics: '+''.join(['<span class="tag">'+metric+'</span> ' for metric in metrics])
+                metricsmd = gr.Markdown(metricstr, visible=True)
         screenshots = itemdic['Screenshots']
         if isinstance(screenshots, list):
             if len(screenshots) > 0:
                 gallery = gr.Gallery(screenshots, visible=True, height=500, object_fit="scale-down", interactive=False, show_share_button=False)
+    return [modal, titlemd, authormd, affiliationmd, tagsmd, abstractmd, whatisbeingmd, methodmd, considerationsmd, modelsmd, datasetmd, metricsmd, gallery]
 with gr.Blocks(title = "Social Impact Measurement V2", css=custom_css, theme=gr.themes.Base()) as demo: #theme=gr.themes.Soft(),
     # create tabs for the app, moving the current table to one titled "rewardbench" and the benchmark_text to a tab called "About"
                 with Modal(visible=False) as modal:
                     titlemd = gr.Markdown(visible=False)
                     authormd = gr.Markdown(visible=False)
+                    affiliationmd = gr.Markdown(visible=False)
                     tagsmd = gr.Markdown(visible=False)
                     abstractmd = gr.Markdown(visible=False)
                     gr.Markdown("""## Construct Validity<br>
                                 ### How well it measures the concept it was designed to evaluate""", visible=True)
+                    whatisbeingmd = gr.Markdown(visible=False)
+                    methodmd = gr.Markdown(visible=False)
                     considerationsmd = gr.Markdown(visible=False)
                     gr.Markdown("""## Resources<br>
                                 ### What you need to do this evaluation""", visible=True)
                     datasetmd = gr.Markdown(visible=False)
                     gr.Markdown("""## Results<br>
                                 ### Available evaluation results""", visible=True)
+                    metricsmd = gr.Markdown(visible=False)
                     gallery = gr.Gallery(visible=False)
+                table_filtered.select(showmodal, None, [modal, titlemd, authormd, affiliationmd, tagsmd, abstractmd, whatisbeingmd, methodmd, considerationsmd, modelsmd, datasetmd, metricsmd, gallery])

configs/crowspairs.yaml CHANGED Viewed

@@ -17,3 +17,4 @@ Suggested Evaluation: Crow-S Pairs
 Level: Dataset
 URL: https://arxiv.org/abs/2010.00133
 What it is evaluating: Protected class stereotypes

 Level: Dataset
 URL: https://arxiv.org/abs/2010.00133
 What it is evaluating: Protected class stereotypes
+Metrics: .nan

configs/honest.yaml CHANGED Viewed

@@ -14,3 +14,4 @@ Suggested Evaluation: 'HONEST: Measuring Hurtful Sentence Completion in Language
 Level: Output
 URL: https://aclanthology.org/2021.naacl-main.191.pdf
 What it is evaluating: Protected class stereotypes and hurtful language

 Level: Output
 URL: https://aclanthology.org/2021.naacl-main.191.pdf
 What it is evaluating: Protected class stereotypes and hurtful language
+Metrics: .nan

configs/ieat.yaml CHANGED Viewed

@@ -15,3 +15,4 @@ Suggested Evaluation: Image Embedding Association Test (iEAT)
 Level: Model
 URL: https://dl.acm.org/doi/abs/10.1145/3442188.3445932
 What it is evaluating: Embedding associations

 Level: Model
 URL: https://dl.acm.org/doi/abs/10.1145/3442188.3445932
 What it is evaluating: Embedding associations
+Metrics: .nan

configs/imagedataleak.yaml CHANGED Viewed

@@ -13,3 +13,4 @@ Suggested Evaluation: Dataset leakage and model leakage
 Level: Dataset
 URL: https://arxiv.org/abs/1811.08489
 What it is evaluating: Gender and label bias

 Level: Dataset
 URL: https://arxiv.org/abs/1811.08489
 What it is evaluating: Gender and label bias
+Metrics: .nan

configs/measuringforgetting.yaml CHANGED Viewed

@@ -16,4 +16,5 @@ Screenshots:
 Suggested Evaluation: Measuring forgetting of training examples
 Level: Model
 URL: https://arxiv.org/pdf/2207.00099.pdf
-What it is evaluating: Measure whether models forget training examples over time, over different types of models (image, audio, text) and how order of training affects privacy attacks

 Suggested Evaluation: Measuring forgetting of training examples
 Level: Model
 URL: https://arxiv.org/pdf/2207.00099.pdf
+What it is evaluating: Measure whether models forget training examples over time, over different types of models (image, audio, text) and how order of training affects privacy attacks
+Metrics: .nan

configs/notmyvoice.yaml CHANGED Viewed

@@ -14,3 +14,4 @@ Suggested Evaluation: Not My Voice! A Taxonomy of Ethical and Safety Harms of Sp
 Level: Taxonomy
 URL: https://arxiv.org/pdf/2402.01708.pdf
 What it is evaluating: Lists harms of audio/speech generators

 Level: Taxonomy
 URL: https://arxiv.org/pdf/2402.01708.pdf
 What it is evaluating: Lists harms of audio/speech generators
+Metrics: .nan

configs/palms.yaml CHANGED Viewed

@@ -11,4 +11,5 @@ Screenshots: .nan
 Suggested Evaluation: Human and Toxicity Evals of Cultural Value Categories
 Level: Output
 URL: http://arxiv.org/abs/2106.10328
-What it is evaluating: Adherence to defined norms for a set of cultural categories

 Suggested Evaluation: Human and Toxicity Evals of Cultural Value Categories
 Level: Output
 URL: http://arxiv.org/abs/2106.10328
+What it is evaluating: Adherence to defined norms for a set of cultural categories
+Metrics: .nan

configs/safelatentdiff.yaml CHANGED Viewed

@@ -14,4 +14,5 @@ Screenshots:
 Suggested Evaluation: Evaluating text-to-image models for safety
 Level: Output
 URL: https://arxiv.org/pdf/2211.05105.pdf
-What it is evaluating: Generating images for diverse set of prompts (novel I2P benchmark) and investigating how often e.g. violent/nude images will be generated. There is a distinction between implicit and explicit safety, i.e. unsafe results with “normal” prompts.

 Suggested Evaluation: Evaluating text-to-image models for safety
 Level: Output
 URL: https://arxiv.org/pdf/2211.05105.pdf
+What it is evaluating: Generating images for diverse set of prompts (novel I2P benchmark) and investigating how often e.g. violent/nude images will be generated. There is a distinction between implicit and explicit safety, i.e. unsafe results with “normal” prompts.
+Metrics: .nan

configs/stablebias.yaml CHANGED Viewed

@@ -12,3 +12,4 @@ Suggested Evaluation: Characterizing the variation in generated images
 Level: Output
 URL: https://arxiv.org/abs/2303.11408
 What it is evaluating: .nan

 Level: Output
 URL: https://arxiv.org/abs/2303.11408
 What it is evaluating: .nan
+Metrics: .nan

configs/tango.yaml CHANGED Viewed

@@ -16,4 +16,5 @@ Screenshots:
 Suggested Evaluation: Human and Toxicity Evals of Cultural Value Categories
 Level: Output
 URL: http://arxiv.org/abs/2106.10328
-What it is evaluating: Bias measurement for trans and nonbinary community via measuring gender non-affirmative language, specifically 1) misgendering 2), negative responses to gender disclosure

 Suggested Evaluation: Human and Toxicity Evals of Cultural Value Categories
 Level: Output
 URL: http://arxiv.org/abs/2106.10328
+What it is evaluating: Bias measurement for trans and nonbinary community via measuring gender non-affirmative language, specifically 1) misgendering 2), negative responses to gender disclosure
+Metrics: .nan

configs/videodiversemisinfo.yaml CHANGED Viewed

@@ -14,3 +14,4 @@ Level: Output
 URL: https://arxiv.org/abs/2210.10026
 What it is evaluating: Human led evaluations of deepfakes to understand susceptibility
   and representational harms (including political violence)

 URL: https://arxiv.org/abs/2210.10026
 What it is evaluating: Human led evaluations of deepfakes to understand susceptibility
   and representational harms (including political violence)
+Metrics: .nan

configs/weat.yaml CHANGED Viewed

@@ -1,25 +1,6 @@
-Abstract: "Artificial intelligence and machine learning are in a period of astounding\
-  \ growth. However, there are concerns that these\ntechnologies may be used, either\
-  \ with or without intention, to perpetuate the prejudice and unfairness that unfortunately\n\
-  characterizes many human institutions. Here we show for the first time that human-like\
-  \ semantic biases result from the\napplication of standard machine learning to ordinary\
-  \ language\u2014the same sort of language humans are exposed to every\nday. We replicate\
-  \ a spectrum of standard human biases as exposed by the Implicit Association Test\
-  \ and other well-known\npsychological studies. We replicate these using a widely\
-  \ used, purely statistical machine-learning model\u2014namely, the GloVe\nword embedding\u2014\
-  trained on a corpus of text from the Web. Our results indicate that language itself\
-  \ contains recoverable and\naccurate imprints of our historic biases, whether these\
-  \ are morally neutral as towards insects or flowers, problematic as towards\nrace\
-  \ or gender, or even simply veridical, reflecting the status quo for the distribution\
-  \ of gender with respect to careers or first\nnames. These regularities are captured\
-  \ by machine learning along with the rest of semantics. In addition to our empirical\n\
-  findings concerning language, we also contribute new methods for evaluating bias\
-  \ in text, the Word Embedding Association\nTest (WEAT) and the Word Embedding Factual\
-  \ Association Test (WEFAT). Our results have implications not only for AI and\n\
-  machine learning, but also for the fields of psychology, sociology, and human ethics,\
-  \ since they raise the possibility that mere\nexposure to everyday language can\
-  \ account for the biases we replicate here."
-Applicable Models: .nan
 Authors: Aylin Caliskan, Joanna J. Bryson, and Arvind Narayanan
 Considerations: Although based in human associations, general societal attitudes do
   not always represent subgroups of people and cultures.
@@ -40,3 +21,8 @@ Level: Model
 URL: https://researchportal.bath.ac.uk/en/publications/semantics-derived-automatically-from-language-corpora-necessarily
 What it is evaluating: Associations and word embeddings based on Implicit Associations
   Test (IAT)

+Abstract: "Artificial intelligence and machine learning are currently undergoing rapid growth. Concerns persist regarding their potential to perpetuate biases inherent in human language. This study demonstrates that standard machine learning applied to everyday language reproduces a range of human biases, from implicit associations to societal norms. Using the GloVe word embedding model trained on web text, the research reveals that language itself contains historical biases, whether neutral or problematic. New evaluation methods, WEAT and WEFAT, are introduced. These findings have broad implications for AI, psychology, sociology, and ethics, suggesting that biases may stem from everyday linguistic exposure."
+Applicable Models:
+- GloVe (Opensource access)
 Authors: Aylin Caliskan, Joanna J. Bryson, and Arvind Narayanan
 Considerations: Although based in human associations, general societal attitudes do
   not always represent subgroups of people and cultures.
 URL: https://researchportal.bath.ac.uk/en/publications/semantics-derived-automatically-from-language-corpora-necessarily
 What it is evaluating: Associations and word embeddings based on Implicit Associations
   Test (IAT)
+Metrics:
+- Cosine Similarity
+- Effect Size
+Affiliations: Princeton University, University of Bath
+Methodology: Effect sizes between two sets of target words (e.g., programmer, engineer, scientist, ... and nurse, teacher, librarian, ...) and two sets of attribute words (e.g., man, male, ... and woman, female ...) are calculated using cosine similarity of the embeddings, with the null hypothesis that an unbaised model would have no difference betwewen the sets.