File size: 6,173 Bytes
eba8a37
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
GUIDELINES = """
# Contribution Guidelines  
  
The 💨Data Contamination Report  is a community-driven project and we welcome contributions from everyone.The objetive of this project is to provide a comprehensive list of data contamination cases, for both models and datasets. We aim to provide a tool for the community for avoiding evaluating  
models on contaminated datasets. We also expect to generate a dataset that will help researchers  
to develop algorithms to automatically detect contaminated datasets in the future.  
  
If you wish to contribute to the project by reporting a data contamination case, please open a pull request  
in the [✋Community Tab](https://huggingface.co/spaces/CONDA-Workshop/Data-Contamination-Report/discussions).Your pull request should edit the [contamination_report.csv](https://huggingface.co/spaces/CONDA-Workshop/Data-Contamination-Report/blob/main/contamination_report.csv)  
file and add a new row with the details of the contamination case. Please will the following template with the details of the contamination case.  ***Pull Requests that do not follow the template won't be accepted.***

# Template for reporting data contamination  
  
```markdown
## What are you reporting:  
- [ ] Evaluation dataset(s) found in a pre-training corpus. (e.g. COPA found in ThePile) 
- [ ] Evaluation dataset(s) found in a pre-trained model. (e.g. FLAN T5 has been trained on ANLI)  

**Evaluation dataset(s)**: Name(s) of the evaluation dataset(s). If available in the HuggingFace Hub please write the path  (e.g. `uonlp/CulturaX`), otherwise provide a link to a paper, GitHub or dataset-card.  
  
**Contaminated model(s)**: Name of the model(s) (if any) that have been contaminated with the evaluation dataset. If available in the HuggingFace Hub please list the corresponding paths (e.g. `allenai/OLMo-7B`).  
  
**Contaminated corpora**: Name of the corpora used to pretrain models (if any) that have been contaminated with the evaluation dataset. If available in the HuggingFace hub please write the path (e.g. `CohereForAI/aya_dataset`)   

**Contaminated split(s)**: If the dataset has Train, Development and/or Test splits please report the contaminated split(s). You can report a percentage of the dataset contaminated. 

  
## Briefly describe your method to detect data contamination  
  
 - [ ] Data-based approach
 - [ ] Model-based approach  

Description of your method, 3-4 sentences. Evidence of data contamination (Read below):

#### Data-based approaches  
Data-based approaches identify evidence of data contamination in a pre-training corpus by directly examining the dataset for instances of the evaluation data. This method involves algorithmically searching through a large pre-training dataset to find occurrences of the evaluation data. You should provide evidence of data contamination in the form: "dataset X appears in line N of corpus Y," "dataset X appears N times in corpus Y," or "N examples from dataset X appear in corpus Y."   
  
#### Model-based approaches  
  
Model-based approaches, on the other hand, utilize heuristic algorithms to infer the presence of data contamination in a pre-trained model. These methods do not directly analyze the data but instead assess the model's behavior to predict data contamination. Examples include prompting the model to reproduce elements of an evaluation dataset to demonstrate memorization (i.e https://hitz-zentroa.github.io/lm-contamination/blog/), or using perplexity measures to estimate data contamination (). You should provide evidence of data contamination in the form of evaluation results of the algorithm from research papers, screenshots of model outputs that demonstrate memorization of a pre-training dataset, or any other form of evaluation that substantiates the method's effectiveness in detecting data contamination. You can provide a confidence score in your predictions.   

## Citation  
  
Is there a paper that reports the data contamination or describes the method used to detect data contamination?  
  
URL: `https://aclanthology.org/2023.findings-emnlp.722/`
Citation: `@inproceedings{...`
```
---

### How to update the contamination_report.csv file

The [contamination_report.csv](https://huggingface.co/spaces/CONDA-Workshop/Data-Contamination-Report/blob/main/contamination_report.csv) file is a csv filed with `;` delimiters. You will need to update the following columns:
- Evaluation Dataset: Name of the evaluation dataset contaminated. If available in the HuggingFace Hub please write the path  (e.g. `uonlp/CulturaX`), otherwise  proviede the name of the dataset.
- Contaminated Source: Name of the model that has been trained with the evaluation dataset or name of the pre-training copora that contains the evaluation datset. If available in the HuggingFace Hub please write the path  (e.g. `allenai/OLMo-7B`), otherwise proviede the name of the model/dataset.
- Train split: Percentage of the train split contaminated. 0 means no contamination. 1 means that the dataset has been fully contamianted. If the dataset doesn't have splits, you can consider that the full dataset is a train or test split. 
- Development split: Percentage of the development split contaminated. 0 means no contamination. 1 means that the dataset has been fully contamianted.
- Train split: Percentage of the test split contaminated. 0 means no contamination. 1 means that the dataset has been fully contamianted.  If the dataset doesn't have splits, you can consider that the full dataset is a train or test split. 
- Approach: data-based or model-based approach. See above for more information.
- Citation: If there is paper or any other resource describing how you have detected this contamination example, provide the URL. 
- PR Link: Leave it blank, we will update it after you create the Pull Request. 
""".strip()


PANEL_MARKDOWN = """
# Data Contamination Report
The 💨Data Contamination Report aims to track instances of data contamination in pre-trained models and corpora. 
This effort is part of [The 1st Workshop on Data Contamination (CONDA)](https://conda-workshop.github.io/) that will be held at ACL 2024.
""".strip()