A large scale dialog corpus for training the Next-Gen Dialog System.
First download the repository.
# download
git clone https://github.com/qywu/DialogCorpus.git
cd DialogCorpus
You can manually download and process the dataset.
# download data for daily_dialog
python daily_dialog/download_data.py
# process the data
python daily_dialog/process_data.py
# the processed data is stored as the {folder_name}.json
vi daily_dialog/data/daily_dialog.json
Or you can just use one command.
python prepare_all_data.py \
--download \
--process \
--join
-
Daily Dialog
- Removed tokenization space for punctuations
-
Persona Chat
- Used huggingface's version [link]
- Recovered lower cased utterances
- Removed tokenization space for punctuations
-
Cornell Movie Corpus
- Ignored UTF-8 Errors
- Extracted Names
-
- Nothing specific
-
- Nothing specific
-
- Nothing specific
-
- Nothing specific
-
- Nothing specific
-
- Nothing specific
Links