An ASR model for transcribing laughter and speech-laugh in Conversational Speech
The global path to the dataset storage is located in:
path=/deepstore/datasets/hmi/speechlaugh-corpus # global data path
- Using gdown to download the
.zip
file data and then unzip it.
gdown 1VlQlyY3v3wtT2S047lwlTirWisz5mQ18 -O /path/to/data/switchboard.zip
#in this case: path/to/data can be: [global_path]/switchboard_data
cd path/to/data #/deepstore/datasets/hmi/speechlaugh-corpus/switchboard_data
unzip switchboard.zip
after unzipping, the data will contain the following folders:
switchboard_data \
|_ audio_wav
|_ transcripts
- Generate audio_segments folder, this stored in the following path
path=[global_path]/switchboard_data/audio_segments
Similarly, download the PodcastFillers dataset using gdown
and unzip it as follow:
gdown 16qY7Y6KoDcr9jnQb4lofMCDXydjO7yo9 -O ../podcastfillers_data/PodcastFillers.zip
cd ../podcastfillers_data
unzip PodcastFillers.zip
Similarly, we using the following command to download the Buckeye dataset and storing in corresponding path.
- For original Buckeye datasets
gdown 1Vz1cTpTiMGAJoGaO57YPrY0JAGzGLPdy -O ../buckeye_data/Buckeye.zip
cd [global_path]/buckeye_data #/deepstore/datasets/hmi/speechlaugh-corpus/buckeye_data/
unzip Buckeye.zip
The structure of files existing in Buckeye folder is followed:
Buckeye /
|_ s01
| |_ s0101a
| | |_ s0101a.wav [original audio]
| | |_ s0101a.txt [sentence-level transcript (no-timestamp)]
| | |_ s0101a.words [word-level transcript (with timestamp)]
| |_ s0101b
| |_ ...
|_ s02
|_ ...
|_ tagged_words_files
- For clipped corpus: already processed by clipping the audio in seperate transcription based on different sentences
gdown 17mRLTnWhtrrUud25_Ab4lBN1voCqJd7N -O ../buckeye_data/buckeye_refs_wavs.zip
cd [global_path]/buckeye_data #/deepstore/datasets/hmi/speechlaugh-corpus/buckeye_data/
unzip buckeye_refs_wavs.zip
-
Download these datasets from HuggingFace datasets and saving to
data/huggingface_data
folder. However, most of these datasets are cleaned, and only contain normal speech. -
If you want to have the dataset that existing paralinguistic events (e.g. laughter, speechlaugh), it is recommended to download it directly from AMI Corpus
Download the datasets from HuggingFace Datasets
and store it locally through the following steps:
- First set the path to HuggingFace cache to this folder
$ export HF_DATASETS_CACHE="../data/huggingface_data"
# or change to the global datasets
$ export HF_DATASETS_CACHE="/deepstore/datasets/hmi/speechlaugh-corpus/huggingface_data"
- Then download the datasets, given the dataset name in HuggingFace, for example:
- ami: "edinburghcstr/ami" "ihm" split="train"
We use Switchboard
as the dataset for training, since it contains both laughter and speechlaugh events, which is benefits the purpose of our training.
The dataset has been preprocessed, audio-matching, cleaned, retokenized and seperated in train (swb_train
), dev (swb_eval
) and validation set (swb_test
). The datasets stored locally in the following path:
[path\to\directory]\datasets\switchboard\
The summary of these datasets is below:
Train Dataset (70%): Dataset({
features: ['audio', 'sampling_rate', 'transcript'],
num_rows: 185402
})
Validation Dataset (10%): Dataset({
features: ['audio', 'sampling_rate', 'transcript'],
num_rows: 20601
})
Test Dataset (20%): Dataset({
features: ['audio', 'sampling_rate', 'transcript'],
num_rows: 51501
})
NOTES: During training process, we might encountered the caching has been locally store with enormous amount of data, especially after the prepare_dataset
steps. This is due to the tensors are stored in the cache file in the same dataset_directory
, named as cache-*.arrows
These cache files are really large and only used during training process. So after training, consider removing them to get rid of memory capacity issue.
To do this, we can:
-
Check disk usage of models directory, datasets in global storage, navigate to the storage (
dataset_dir
) and usedu
command.cd /path/to/storage du -sh * | sort -hr
-
To remove cache files of these datasets in the same place as the data storage, cause by the flag:
load_from_cache_file=True
, set the flag in dataset mapping process toload_from_cache_file=False
, and use the following to remove:cd /path/to/storage rm -rf cache-*
There are 2 options for evaluating the models, either for original pre-trained model
or finetuned model
The swb_test
dataset is used for evaluating the models, expected to be located in the path: path=[path\to\directory]\datasets\switchboard\swb_test