Initial commit

HKUNLP · Feb 14, 2023 · 8f2494b · 8f2494b
1 parent 6464c86
commit 8f2494b
Show file tree

Hide file tree

Showing 3 changed files with 26 additions and 12 deletions.
diff --git a/README.md b/README.md
@@ -21,22 +21,19 @@ The black-box LM is frozen during the whole procedure.
 All required packages can be found in ``requirements.txt``. 
 You can install them in a new environment with 
 ```shell
-conda env create -n icl python=3.7
+conda create -n icl python=3.7
+conda activate icl
+
 git clone [email protected]:HKUNLP/icl-ceil.git
-pip install -r requirements.txt
+#[Optional] If you want to experiment on Break dataset with LF-EM evaluation metric, you have to clone recursively with the following commands to include third-party dependencies:
+#git clone --recurse-submodules [email protected]:HKUNLP/HKUNLP.git
 
 # The following line to be replaced depending on your cuda version.
 pip install torch==1.10.1+cu113 -f https://download.pytorch.org/whl/torch_stable.html
-```
-
-Optional: If you want to experiment on Break dataset with LF-EM evaluation metric, you have to clone recursively with the following commands to include third-party dependencies:
-```shell
-git clone --recurse-submodules [email protected]:HKUNLP/HKUNLP.git
-```
 
-Activate the environment by running 
-```shell
-conda activate icl
+cd icl-ceil
+pip install -r requirements.txt
+# if you don't want to use API from openai, just comment out the `openai` package in `requirements.txt`.
 ```
 
 Setup WandB for tracking the training status for `EPR` and `CEIL` in `scripts/run_epr.sh` and `scripts/run_dpp_epr.sh`:

diff --git a/configs/bm25_retriever.yaml b/configs/bm25_retriever.yaml
@@ -13,7 +13,7 @@ ds_size: null              # number of instances used for the dataset, 'null' re
 index_reader:
   _target_: src.dataset_readers.index_dsr.IndexDatasetReader
   task_name: ${task_name}
-  model_name: 'bert-base-uncased'
+  model_name: 'bert-base-uncased'  # only used for tokenizer when deduplicating the index dataset
   field: q                  # 'field' of the index dataset is used for retrieval
   dataset_split: train     # the split used in the dataset if the file in `dataset_path` not exists
   dataset_path: null       # if provided a dataset_path (json file), then that file will be loaded as the index dataset

diff --git a/src/utils/tokenizer_util.py b/src/utils/tokenizer_util.py
@@ -0,0 +1,17 @@
+#!/usr/bin/python3
+# -*- coding: utf-8 -*-
+from transformers import AutoTokenizer
+
+
+def model_to_tokenizer(model_name):
+    if "code-" in model_name:
+        return "SaulLu/codex-like-tokenizer"
+    if "gpt3" in model_name:
+        return "gpt2"
+    return model_name
+
+
+def get_tokenizer(model_name):
+    if model_name == 'bm25':
+        return model_name
+    return AutoTokenizer.from_pretrained(model_to_tokenizer(model_name))