Spaces:
Runtime error
Runtime error
File size: 5,150 Bytes
3d10a5a 54ad4a5 41db68b 3d10a5a 54ad4a5 3d10a5a 54ad4a5 41db68b 594067e 41db68b 594067e 41db68b 54ad4a5 41db68b 54ad4a5 41db68b 54ad4a5 41db68b 5c3c11c 54ad4a5 41db68b 54ad4a5 594067e 54ad4a5 07cb4fb 54ad4a5 594067e 54ad4a5 41db68b 54ad4a5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 |
---
title: codebleu
tags:
- evaluate
- metric
- code
- codebleu
description: "Unofficial `CodeBLEU` implementation that supports Linux, MacOS and Windows."
sdk: gradio
sdk_version: 3.19.1
app_file: app.py
pinned: false
---
# Metric Card for codebleu
This repository contains an unofficial `CodeBLEU` implementation that supports `Linux`, `MacOS` and `Windows`. It is available through `PyPI` and the `evaluate` library.
Available for: `Python`, `C`, `C#`, `C++`, `Java`, `JavaScript`, `PHP`, `Go`, `Ruby`.
---
The code is based on the original [CodeXGLUE/CodeBLEU](https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/code-to-code-trans/evaluator/CodeBLEU) and updated version by [XLCoST/CodeBLEU](https://github.com/reddy-lab-code-research/XLCoST/tree/main/code/translation/evaluator/CodeBLEU). It has been refactored, tested, built for macOS and Windows, and multiple improvements have been made to enhance usability.
## Metric Description
> An ideal evaluation metric should consider the grammatical correctness and the logic correctness.
> We propose weighted n-gram match and syntactic AST match to measure grammatical correctness, and introduce semantic data-flow match to calculate logic correctness.
> ![CodeBLEU](CodeBLEU.jpg)
[from [CodeXGLUE](https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/code-to-code-trans/evaluator/CodeBLEU) repo]
In a nutshell, `CodeBLEU` is a weighted combination of `n-gram match (BLEU)`, `weighted n-gram match (BLEU-weighted)`, `AST match` and `data-flow match` scores.
The metric has shown higher correlation with human evaluation than `BLEU` and `accuracy` metrics.
## How to Use
### Inputs
- `refarences` (`list[str]` or `list[list[str]]`): reference code
- `predictions` (`list[str]`) predicted code
- `lang` (`str`): code language, see `codebleu.AVAILABLE_LANGS` for available languages (python, c_sharp c, cpp, javascript, java, php, go and ruby at the moment)
- `weights` (`tuple[float,float,float,float]`): weights of the `ngram_match`, `weighted_ngram_match`, `syntax_match`, and `dataflow_match` respectively, defaults to `(0.25, 0.25, 0.25, 0.25)`
- `tokenizer` (`callable`): to split code string to tokens, defaults to `s.split()`
### Output Values
[//]: # (*Explain what this metric outputs and provide an example of what the metric output looks like. Modules should return a dictionary with one or multiple key-value pairs, e.g. {"bleu" : 6.02}*)
[//]: # (*State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."*)
The metric outputs the `dict[str, float]` with following fields:
- `codebleu`: the final `CodeBLEU` score
- `ngram_match_score`: `ngram_match` score (BLEU)
- `weighted_ngram_match_score`: `weighted_ngram_match` score (BLEU-weighted)
- `syntax_match_score`: `syntax_match` score (AST match)
- `dataflow_match_score`: `dataflow_match` score
Each of the scores is in range `[0, 1]`, where `1` is the best score.
### Examples
[//]: # (*Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*)
Using pip package (`pip install codebleu`):
```python
from codebleu import calc_codebleu
prediction = "def add ( a , b ) :\n return a + b"
reference = "def sum ( first , second ) :\n return second + first"
result = calc_codebleu([reference], [prediction], lang="python", weights=(0.25, 0.25, 0.25, 0.25), tokenizer=None)
print(result)
{
'codebleu': 0.5537,
'ngram_match_score': 0.1041,
'weighted_ngram_match_score': 0.1109,
'syntax_match_score': 1.0,
'dataflow_match_score': 1.0
}
```
Or using `evaluate` library (`codebleu` package required):
```python
import evaluate
metric = evaluate.load("k4black/codebleu")
prediction = "def add ( a , b ) :\n return a + b"
reference = "def sum ( first , second ) :\n return second + first"
result = metric.compute([reference], [prediction], lang="python", weights=(0.25, 0.25, 0.25, 0.25), tokenizer=None)
```
Note: `lang` is required;
## Limitations and Bias
[//]: # (*Note any known limitations or biases that the metric has, with links and references if possible.*)
This library requires `so` file compilation with tree-sitter, so it is platform dependent.
Currently available for `Linux` (manylinux), `MacOS` and `Windows` with Python 3.8+.
## Citation
```bibtex
@misc{ren2020codebleu,
title={CodeBLEU: a Method for Automatic Evaluation of Code Synthesis},
author={Shuo Ren and Daya Guo and Shuai Lu and Long Zhou and Shujie Liu and Duyu Tang and Neel Sundaresan and Ming Zhou and Ambrosio Blanco and Shuai Ma},
year={2020},
eprint={2009.10297},
archivePrefix={arXiv},
primaryClass={cs.SE}
}
```
## Further References
The source code is available at GitHub [k4black/codebleu](https://github.com/k4black/codebleu) repository.
|