Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MLCube support for RNN speech recognition #491

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

davidjurado
Copy link
Contributor

@davidjurado davidjurado commented Jun 24, 2021

Used PR #465 as reference.

Current implementation

We'll be updating this section as we merge MLCube PRs and make new MLCube releases.

Project setup

# Create Python environment and install MLCube Docker runner 
virtualenv -p python3 ./env && source ./env/bin/activate && pip install mlcube-docker

# Fetch the RNN speech recognition workload
git clone https://github.com/mlcommons/training && cd ./training
git fetch origin pull/491/head:feature/rnnt_mlcube && git checkout feature/rnnt_mlcube
cd ./rnn_speech_recognition/mlcube

Dataset

The Librispeech dataset will be downloaded, extracted, and processed. Sizes of the dataset in each step:

Dataset Step MLCube Task Format Size
Download (Compressed dataset) download_data Tar files ~62 GB
Extract (Uncompressed dataset) download_data Flac files ~64 GB
Preprocess (Processed dataset) preprocess_data Wav files ~114 GB
Total (After all tasks) All ~240 GB

Tasks execution

# Download Librispeech dataset. Default path = /workspace/data
# To override it, use data_dir=DATA_DIR
mlcube run --task download_data

# Preprocess Librispeech dataset, this will convert .flac audios to .wav format
# It will use the DATA_DIR path defined in the previous step
mlcube run --task preprocess_data

# Run benchmark. Default paths = ./workspace/data
# Parameters to override: data_dir=DATA_DIR, output_dir=OUTPUT_DIR, parameters_file=PATH_TO_TRAINING_PARAMS
mlcube run --task train

We are targeting pull-type installation, so MLCube images should be available on docker hub. If not, try this:

mlcube run ... -Pdocker.build_strategy=always

@github-actions
Copy link

github-actions bot commented Jun 24, 2021

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@davidjurado davidjurado marked this pull request as draft June 28, 2021 15:28
@davidjurado davidjurado marked this pull request as ready for review June 28, 2021 15:29
@davidjurado davidjurado force-pushed the feature/rnnt_mlcube branch from af4e361 to 2b1e1e2 Compare July 22, 2022 16:06
@matthew-frank matthew-frank added rnn_speech_recognition RNN-T model on Librispeech dataset MLCube labels Dec 2, 2022
@johntran-nv johntran-nv requested a review from mwawrzos March 16, 2023 18:50
@mwawrzos
Copy link
Contributor

Hello @davidjurado! I tried to follow the task execution steps, but the last step failed with the following error:

$ mlcube run --task train
Usage: mlcube.py train [OPTIONS]
Try 'mlcube.py train --help' for help.

Error: Missing option '--output_dir'.
2023-05-19 09:35:17 [...]

Your description sais:

# Run benchmark. Default paths = ./workspace/data
# Parameters to override: data_dir=DATA_DIR, output_dir=OUTPUT_DIR, parameters_file=PATH_TO_TRAINING_PARAMS
mlcube run --task train

How to override the output_dir?

@nv-rborkar
Copy link
Contributor

@davidjurado can you answer @mwawrzos 's question. We can merge this accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MLCube rnn_speech_recognition RNN-T model on Librispeech dataset
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants