Spaces:

evaluate-measurement
/

text_duplicates

Running

App Files Files Community

lvwerra HF staff commited on May 27, 2022

Commit

ac8143c

1 Parent(s): 73d6ee6

Update Space (evaluate main: 1a95c8c2)

Browse files

Files changed (4) hide show

README.md +71 -6
app.py +6 -0
requirements.txt +2 -0
text_duplicates.py +82 -0

README.md CHANGED Viewed

@@ -1,12 +1,77 @@
 ---
-title: Text_duplicates
-emoji: 📈
-colorFrom: red
-colorTo: green
 sdk: gradio
-sdk_version: 3.0.6
 app_file: app.py
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces#reference

 ---
+title: Text Duplicates
+emoji: 🤗
+colorFrom: green
+colorTo: purple
 sdk: gradio
+sdk_version: 3.0.2
 app_file: app.py
 pinned: false
+tags:
+- evaluate
+- measurement
 ---
+# Measurement Card for Text Duplicates
+## Measurement Description
+The `text_duplicates` measurement returns the fraction of duplicated strings in the input data.
+## How to Use
+This measurement requires a list of strings as input:
+```python
+>>> data = ["hello sun","hello moon", "hello sun"]
+>>> duplicates = evaluate.load("text_duplicates")
+>>> results = duplicates.compute(data=data)
+```
+### Inputs
+- **data** (list of `str`): The input list of strings for which the duplicates are calculated.
+### Output Values
+- **duplicate_fraction**(`float`): the fraction of duplicates in the input string(s).
+- **duplicates_list**(`list`): (optional) a list of tuples with the duplicate strings and the number of times they are repeated.
+By default, this measurement outputs a dictionary containing the fraction of duplicates in the input string(s) (`duplicate_fraction`):
+  )
+```python
+{'duplicate_fraction': 0.33333333333333337}
+```
+With the `list_duplicates=True` option, this measurement will also output a dictionary of tuples with duplicate strings and their counts.
+```python
+{'duplicate_fraction': 0.33333333333333337, 'duplicates_list': {'hello sun': 2}}
+```
+Warning: the `list_duplicates=True` function can be memory-intensive for large datasets.
+### Examples
+Example with no duplicates
+```python
+>>> data = ["foo", "bar", "foobar"]
+>>> duplicates = evaluate.load("text_duplicates")
+>>> results = duplicates.compute(data=data)
+>>> print(results)
+{'duplicate_fraction': 0.0}
+```
+Example with multiple duplicates and `list_duplicates=True`:
+```python
+>>> data = ["hello sun", "goodbye moon", "hello sun", "foo bar", "foo bar"]
+>>> duplicates = evaluate.load("text_duplicates")
+>>> results = duplicates.compute(data=data)
+>>> print(results)
+{'duplicate_fraction': 0.4, 'duplicates_list': {'hello sun': 2, 'foo bar': 2}}
+```
+## Citation(s)
+## Further References
+- [`hashlib` library](https://docs.python.org/3/library/hashlib.html)

app.py ADDED Viewed

	@@ -0,0 +1,6 @@

+import evaluate
+from evaluate.utils import launch_gradio_widget
+module = evaluate.load("text_duplicates", type="measurement")
+launch_gradio_widget(module)

requirements.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ git+https://github.com/huggingface/evaluate.git@main
2	+ datasets~=2.0

text_duplicates.py ADDED Viewed

	@@ -0,0 +1,82 @@

+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import evaluate
+import datasets
+from collections import Counter
+import hashlib
+logger = evaluate.logging.get_logger(__name__)
+_DESCRIPTION = """
+Returns the duplicate strings (if any) contained in the input.
+"""
+_KWARGS_DESCRIPTION = """
+Args:
+    `data`: a list of `str` to be checked for duplicates.
+Returns:
+    `duplicate_fraction` (`float`) : the fraction of strings that are duplicated.
+    `duplicates_list` (`dict`) (optional) : a dictionary containing tuples with the duplicate strings and the number of times they are repeated.
+Examples:
+    >>> data = ["hello sun","hello moon", "hello sun"]
+    >>> duplicates = evaluate.load("text_duplicates")
+    >>> results = duplicates.compute(data=data)
+    >>> print(results)
+    {'duplicate_fraction': 0.33333333333333337}
+    >>> data = ["hello sun","hello moon", "hello sun"]
+    >>> duplicates = evaluate.load("text_duplicates")
+    >>> results =  duplicates.compute(data=data, list_duplicates=True)
+    >>> print(results)
+    {'duplicate_fraction': 0.33333333333333337, 'duplicates_list': {'hello sun': 2}}
+"""
+# TODO: Add BibTeX citation
+_CITATION = ""
+def get_hash(example):
+    """Get the hash of a string"""
+    return hashlib.md5(example.strip().encode("utf-8")).hexdigest()
+@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
+class TextDuplicates(evaluate.EvaluationModule):
+    """This measurement returns the duplicate strings contained in the input(s)."""
+    def _info(self):
+        # TODO: Specifies the evaluate.EvaluationModuleInfo object
+        return evaluate.EvaluationModuleInfo(
+            # This is the description that will appear on the modules page.
+            module_type="measurement",
+            description=_DESCRIPTION,
+            citation=_CITATION,
+            inputs_description=_KWARGS_DESCRIPTION,
+            # This defines the format of each prediction and reference
+            features=datasets.Features({
+                'data': datasets.Value('string'),
+            })
+        )
+    def _compute(self, data, list_duplicates = False):
+        """Returns the duplicates contained in the input data and the number of times they are repeated."""
+        if list_duplicates == True:
+            logger.warning("This functionality can be memory-intensive for large datasets!")
+            n_dedup = len(set([get_hash(d) for d in data]))
+            c = Counter(data)
+            duplicates = {k: v for k, v in c.items() if v > 1}
+            return {"duplicate_fraction": 1 - (n_dedup/len(data)), "duplicates_list": duplicates}
+        else:
+             n_dedup = len(set([get_hash(d) for d in data]))
+             return  {"duplicate_fraction": 1 - (n_dedup/len(data))}