Skip to content

Commit

Permalink
Initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
jiacheng-ye committed Feb 14, 2023
1 parent 6464c86 commit 8f2494b
Show file tree
Hide file tree
Showing 3 changed files with 26 additions and 12 deletions.
19 changes: 8 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,22 +21,19 @@ The black-box LM is frozen during the whole procedure.
All required packages can be found in ``requirements.txt``.
You can install them in a new environment with
```shell
conda env create -n icl python=3.7
conda create -n icl python=3.7
conda activate icl

git clone [email protected]:HKUNLP/icl-ceil.git
pip install -r requirements.txt
#[Optional] If you want to experiment on Break dataset with LF-EM evaluation metric, you have to clone recursively with the following commands to include third-party dependencies:
#git clone --recurse-submodules [email protected]:HKUNLP/HKUNLP.git

# The following line to be replaced depending on your cuda version.
pip install torch==1.10.1+cu113 -f https://download.pytorch.org/whl/torch_stable.html
```

Optional: If you want to experiment on Break dataset with LF-EM evaluation metric, you have to clone recursively with the following commands to include third-party dependencies:
```shell
git clone --recurse-submodules [email protected]:HKUNLP/HKUNLP.git
```

Activate the environment by running
```shell
conda activate icl
cd icl-ceil
pip install -r requirements.txt
# if you don't want to use API from openai, just comment out the `openai` package in `requirements.txt`.
```

Setup WandB for tracking the training status for `EPR` and `CEIL` in `scripts/run_epr.sh` and `scripts/run_dpp_epr.sh`:
Expand Down
2 changes: 1 addition & 1 deletion configs/bm25_retriever.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ ds_size: null # number of instances used for the dataset, 'null' re
index_reader:
_target_: src.dataset_readers.index_dsr.IndexDatasetReader
task_name: ${task_name}
model_name: 'bert-base-uncased'
model_name: 'bert-base-uncased' # only used for tokenizer when deduplicating the index dataset
field: q # 'field' of the index dataset is used for retrieval
dataset_split: train # the split used in the dataset if the file in `dataset_path` not exists
dataset_path: null # if provided a dataset_path (json file), then that file will be loaded as the index dataset
Expand Down
17 changes: 17 additions & 0 deletions src/utils/tokenizer_util.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
#!/usr/bin/python3
# -*- coding: utf-8 -*-
from transformers import AutoTokenizer


def model_to_tokenizer(model_name):
if "code-" in model_name:
return "SaulLu/codex-like-tokenizer"
if "gpt3" in model_name:
return "gpt2"
return model_name


def get_tokenizer(model_name):
if model_name == 'bm25':
return model_name
return AutoTokenizer.from_pretrained(model_to_tokenizer(model_name))

0 comments on commit 8f2494b

Please sign in to comment.