Skip to content

Commit

Permalink
Merge branch 'bigcode-project:main' into integrate_multipl-e
Browse files Browse the repository at this point in the history
  • Loading branch information
loubnabnl authored Feb 27, 2023
2 parents bd8252f + b8a6d04 commit 71c34cd
Show file tree
Hide file tree
Showing 3 changed files with 37 additions and 39 deletions.
16 changes: 10 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,28 +69,32 @@ accelerate launch main.py \
--do_sample True \
--n_samples 100 \
--batch_size 10 \
--allow_code_execution=False
--allow_code_execution \
--save_generations
```
* `limit` represents the number of problems to solve, if it's not provided all problems in the benchmark are selected.
* `allow_code_execution` is for executing the generated code: read the displayed warning before setting it to `True`.
* Some models with custom code on the HF hub like [SantaCoder](https://huggingface.co/bigcode/santacoder) require adding `--trust_remote_code True`
* `allow_code_execution` is for executing the generated code: it is off by default, read the displayed warning before calling it to enable execution.
* Some models with custom code on the HF hub like [SantaCoder](https://huggingface.co/bigcode/santacoder) require calling `--trust_remote_code`, for private models add `--use_auth_token`.
* `save_generations` saves the post-processed generations in a json file. You can also save references by calling `--save_references`

Some tasks don't require code execution such as
`codexglue_code_to_text-<LANGUAGE>`/`codexglue_code_to_text-python-left`/`conala`/`concode` that use BLEU evaluation. In addition, we generate one candidate solution for each problem in these tasks, so use `n_samples=1` and `batch_size=1`. (Note that `batch_size` should always be equal or less than `n_samples`).
* For APPS tasks, you can use `n_samples=1` for strict and average accuracies (from the original APPS paper) and `n_samples>1` for pass@k.

### Generation only

If you want to generate solutions without executing and evaluating the code, set `generation_only` to True, in addition to the instructions above. This will save the solutions in a json file in the working directory.
If you want to generate solutions without executing and evaluating the code, call `--generation_only`, in addition to the instructions above. This will save the solutions in a json file in the working directory.

This can be useful if you don't want to execute code in the machine you're using for generations for security or efficiency reasons. For instance, you can do the generations on multiple GPUs, but switch to a multiple workers CPU machine for the execution, which can save money and time.

### Evaluation only

If you already have the generations in a json file from this evaluation harness and want to evaluate them, specify the path of the generations via the `generation_path` argument. You may need to reconfigure `accelerate` to use multiple CPUs. For this mode you can also find an example of setup instructions in `evaluation_setup.sh`.
If you already have the generations in a json file from this evaluation harness and want to evaluate them, specify the path of the generations via the `generation_path` argument. You may need to reconfigure `accelerate` to use multiple CPUs. For this mode, you can also find an example of setup instructions in `evaluation_setup.sh`.

Below is an example, be mind of specifying arguments proper to the task you are evaluating on, and note that `model` value here only serves for documenting the experiment.

```bash
accelerate launch main.py --tasks mbpp --allow_code_execution=True --generations_path generations.json --model incoder-temperature-08
accelerate launch main.py --tasks mbpp --allow_code_execution --generations_path generations.json --model incoder-temperature-08
```

## Implementing new tasks
Expand Down
39 changes: 19 additions & 20 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,19 +11,19 @@
# Documentation

Here we document the tasks available in this benchmark. Code generation models, just like natural language models can
be evaluated using match-based metrics such as BLEU score. However these metrics fail in capturing the syntactic and
be evaluated using match-based metrics such as BLEU score. However, these metrics fail in capturing the syntactic and
semantic features of code. A more appropriate way to evaluate these models is functional correctness, where a solution
is considered correct if it passes some unit tests, a poplular metric for this is `pass@k`.
is considered correct if it passes some unit tests, a popular metric for this is `pass@k`.

In this evaluation harness we include tasks with unit tests, but also some tasks with BLEU evaluation, due to the scarcity and evaluation cost of the first type.
In this evaluation harness, we include tasks with unit tests, but also some tasks with BLEU evaluation, due to the scarcity and evaluation cost of the first type.

Before diving into the tasks, here are some instructions that stand for all the benchmarks:
* Adapt `max_length_generation` based on your model's context size and task, by default it is 512. This value is enough for tasks like HumanEval and MBPP but some tasks such as APPS require a larger value because the prompts are long, you can use the full model's context size.
* Adapt the `batch_size` based on your device memory and `n_samples`, by default it is 1. It should be smaller than `n_samples`, but for multiple generations per problem, the larger the batch size the better, since it makes the generation faster.
* `allow_code_execution` allows the execution of the model generated (unstrusted) code on your machine, please read carefully the displayed warning before setting it to `True`.
* `allow_code_execution` allows the execution of the model-generated (untrusted) code on your machine, please read carefully the displayed warning before calling it (it is off by default).
* You can adapt the text generation parameter by changing `do_sample`, `top_p` and `temperature` parameters.
* Some models, such as [InCoder](https://huggingface.co/facebook/incoder-6B), might require adding a prefix before the prompt to give a hint about the language. To add the prefix for InCoder to indicate Python language for example, set `prefix` argument to `"<| file ext=.py |>\n"`.
* The generations are saved with `save_generations` that is set to True, you can visualize the postprocessed model generations used for the evaluaion. You also have the option of saving the references, it can be useful for tasks that use BLEU score and actual solutions as references, just set `save_references` to True.
* The generations are saved with `save_generations` that should be called during the execution, you can visualize the post-processed model generations used for the evaluation. You also have the option of saving the references, it can be useful for tasks that use BLEU score and actual solutions as references, you just need to `save_references`.
* For experimenting, you can choose the number of tasks to evaluate on instead of using the whole test set with the `limit` argument, try using a number that is proportional to your number of devices.

## Code generation benchmarks with unit tests
Expand All @@ -44,15 +44,14 @@ accelerate launch main.py \
--temperature 0.2 \
--n_samples 200 \
--batch_size 10 \
--allow_code_execution=False
--allow_code_execution
```

If you want to evaluate only on the first $n$ samples instead of all the test dataset, set `limit` argument to $n$.

### MBPP
[MBPP](https://huggingface.co/datasets/mbpp): consists of around 1,000 crowd-sourced Python programming problems,
designed to be solvable by entry level programmers. Each problem consists of a task description in English, code solution
and 3 automated test cases. We evaluate on the test set of samples from index 11 to 511.
designed to be solvable by entry-level programmers. Each problem consists of a task description in English, a code solution and 3 automated test cases. We evaluate on the test set of samples from index 11 to 511.

* Prompts and generation: We use a few-shot setting in InCoder style prompt: we feed the prompt to the model as a doctring and only include one solution, to help the model catch the function name which is required in the unit tests.
```python
Expand All @@ -71,7 +70,7 @@ accelerate launch main.py \
--temperature 0.1 \
--n_samples 15 \
--batch_size 10 \
--allow_code_execution=False \
--allow_code_execution
```

Low temperatures generally work better for small $k$ in pass@k.
Expand All @@ -80,11 +79,11 @@ Low temperatures generally work better for small $k$ in pass@k.
[APPS](https://huggingface.co/datasets/codeparrot/apps): is a challenging benchmark for code generation with 10,000 Python problems,
5,000 for the training and 5000 for the evaluation. It has three difficulty levels: introductory, interview and competition.
Most papers finetune models on the training split before the evaluation, since the problems are often challenging the problem descriptions are long.
However, Chen et al. evaluated Codex-12B in a one-shot setting, althought the details about the prompot format aren't given we propose two evaluation modes:
However, Chen et al. evaluated Codex-12B in a one-shot setting, although the details about the prompt format aren't given we propose two evaluation modes:
with fine-tuning and in a one-shot setting:
* Prompts & generation

**1- Fine-tuning:** we provide the code to fine tune autioregressive model on this dataset in
**1- Fine-tuning:** we provide the code to fine-tune autoregressive model on this dataset in
[`finetuning/APPS`](https://github.com/bigcode-project/bigcode-evaluation-harness/tree/main/finetuning/APPS). To evaluate a fine-tuned model,
we a similar prompt format to the original paper of Hendrycks et al. There are two types of calls based if the function name is provided for the sample or not.

Expand Down Expand Up @@ -140,14 +139,14 @@ accelerate launch main.py \
--n_samples 1 \
--temperature 0.1 \
--batch_size 1 \
--allow_code_execution=False
--allow_code_execution
```
We expect a model [finetuned](https://github.com/bigcode-project/bigcode-evaluation-harness/tree/main/finetuning/APPS) on the train split of APPS.
TODO: add fewshot setup for APPS.
TODO: add few-shot setup for APPS.

## Code generation benchmarks without unit tests

For these tasks, we do single generations and compare the generated code againt reference solutions and compute BLEU score. For the following tasks, we use a two-shot setting where we include 2 inputs and their solutions in the prompt, all preceded by an instruction such as: ` "Answer the following instructions in a one line SQL query:\n"`. The solutions consist of one line so we stop the generation when a new line is generated. 3 languages are present: Python, SQL and Java.
For these tasks, we do single generations and compare the generated code against reference solutions and compute BLEU score. For the following tasks, we use a two-shot setting where we include 2 inputs and their solutions in the prompt, all preceded by an instruction such as: ` "Answer the following instructions in a one line SQL query:\n"`. The solutions consist of one line so we stop the generation when a new line is generated. 3 languages are present: Python, SQL and Java.

- [CoNaLa](https://huggingface.co/datasets/neulab/conala)for Python code generation, it has 500 tasks in the test set.
- [Spider](https://huggingface.co/datasets/spider) for SQL code generation, it has 1,034 tasks in the test set.
Expand All @@ -164,19 +163,19 @@ accelerate launch main.py \
--temperature 0.1 \
--batch_size 1
```
If you ever get index out of range errors try using a number of problems `limit` that is proportional to the number of devices you are using.
If you ever get index out-of-range errors try using a number of problems `limit` that is proportional to the number of devices you are using.

## Documentation generation task
Code to text task from [CodeXGLUE](https://huggingface.co/datasets/code_x_glue_ct_code_to_text): is a benchmark for english documentation generation from for 6 programming languages: Python, Go, Ruby, Java, JavaScript and PHP.
Code to text task from [CodeXGLUE](https://huggingface.co/datasets/code_x_glue_ct_code_to_text): is a benchmark for English documentation generation from for 6 programming languages: Python, Go, Ruby, Java, JavaScript and PHP.

For Python: we do the evaluation in a zero-shot setting. We have two options:
* in the first one: we give as a prompt the function signature, that we extract it by splitting at the beginning of the docstring. This task is `codexglue_code_to_text-python-left`.
For Python: we evaluate in a zero-shot setting. We have two options:
* in the first one: we give as a prompt the function signature, which we extract by splitting at the beginning of the docstring. This task is `codexglue_code_to_text-python-left`.
* in the second one: we include the full fucntion body (withoout the docstring) and add this sentence at the end of the prompt: `'\n"""The goal of this function is to:\n'`. This task is `codexglue_code_to_text-python`.
We retrieve the reference solutions from the docstring tokens, similarily to InCoder's approach, since the target docstrings in the dataset include extra context such as argument definitions. We only keep one line in the model generation.

For the other languages (task `codexglue_code_to_text-<language>`): the docstring is not included in the code so we currently don't extract signatures and use the full function body followed by a comment in that language saying `\n=begin The goal of this function is to:\n` for Ruby, and `\n/* The goal of this function is to:\n` for the rest. This task is still not well tested, please report any bugs you might find.

For this task we advise using greedy generation. For evaluation we compute the BLEU score.
For this task, we advise using greedy generation. For evaluation, we compute the BLEU score.

Below are the commands to run the evaluation:
```python
Expand All @@ -198,4 +197,4 @@ These are classification tasks for Java and C, we provide the code to finetune m

## How to add a new benchmark

We welcome contribution to add new code benchmarks to this evaluation harness. You can find a step by step guide in [`guide.md`](https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/docs/guide.md).
We welcome contributions to add new code benchmarks to this evaluation harness. You can find a step-by-step guide in [`guide.md`](https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/docs/guide.md).
21 changes: 8 additions & 13 deletions main.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,12 +43,12 @@ def parse_args():
)
parser.add_argument(
"--use_auth_token",
default=False,
action="store_true",
help="Use the token generated when running `huggingface-cli login` (necessary for private model).",
)
parser.add_argument(
"--trust_remote_code",
default=False,
action="store_true",
help="Use a model with custom code, this requires executing code by the author of the model.",
)
parser.add_argument(
Expand Down Expand Up @@ -77,20 +77,17 @@ def parse_args():
)
parser.add_argument(
"--postprocess",
type=bool,
default=True,
help="Postprocess model outputs before execution, only off during generation tests",
action="store_false",
help="Postprocess model outputs before execution, always on except during generation tests",
)
parser.add_argument(
"--allow_code_execution",
type=bool,
default=False,
action="store_true",
help="Allow code evaluation to execute external/untrusted Python code on your machine",
)
parser.add_argument(
"--generation_only",
type=bool,
default=False,
action="store_true",
help="Do code generation but no evaluation",
)
parser.add_argument(
Expand All @@ -107,14 +104,12 @@ def parse_args():
)
parser.add_argument(
"--save_generations",
type=bool,
default=True,
action="store_true",
help="Whether to save code generations",
)
parser.add_argument(
"--save_references",
type=bool,
default=False,
action="store_true",
help="Whether to save reference solutions/tests",
)
return parser.parse_args()
Expand Down

0 comments on commit 71c34cd

Please sign in to comment.