scotus-speech

Corpus of oral arguments (recorded speech + official transcripts) of the Supreme Court of the United States (SCOTUS).

DEMO VIDEO

Summary: Medium-scale (595 hours 36 minutes 58 seconds) corpus of professionally transcribed formal conversational English speech

Category: Speech

Lincese: MIT

Get the corpus:

Download the corpus manifest and all utterance audio files

default format here (download only corpus.uterances.jsonl.gz (16 MB) for corpus text, audio.tar.gz (16 GB) contains all utterances)
in kur format here (both pages above link the sha256sum of the .tar.gz archives as well)

About this resource:

Scotus Speech is a collection of oral arguments presented before SCOTUS between 2010 and 2018. The conversations (arguments) are formal dialogues between attorneys (counsel) and justices.

Although the conversations follow some procedural rules and formalities, they are fairly straightforward.

Scotus Speech is inspired by the LibriSpeech project, which produced a free ASR corpus from Project Gutenberg audiobooks.

As all recordings and transcripts are in the public domain see SCOTUS website,the ~~sky~~ law is the limit for use cases. Here are a few possibilities:

training & benchmarking:
- automatic speech recognition (ASR)
- speaker diarisation
- biometric speaker recognition
- voice synthesis
full-transcript search of SCOTUS oral arguments
language modeling of legal dialogue
academic study of SCOTUS oral arguments
chatbots / AI

Forced alignment the transcripts to audio is achieved by the aeneas package.

How to regenerate the data:

bash steps.sh will run the pipeline (takes a few days 4 cores). It is recommended to run the steps.sh lines one-by-one to make sure there are no intermediate errors.

scrape the SCOTUS website
download all transcript PDFs and recording MP3s
parse PDF transcripts into conversation text, extracting speaker information
tokenize transcripts to the word and punctuation level

Format notes

Each step manipulates data in the JSON Lines structured data format. For simple parsing tasks, JSONL format enables...

fast debugging using jq
small file size (as long as compression is enabled)
schema flexibility
portability

It is very easy to export this corpus to formats supported by Kur and Kaldi. A script convert_corpus.py is included that does the job for Kur corpus format. The Kur-formatted corpus is also provided at the link above. This format has stripped all characters except lowercase alphabetical, single quote "'", and space " ". I am a fan of the Kur corpus format for simplicity, but I also know that issues can arise with > 100k files in a single directory. To avoid this issue, the "default format" above is a depth-2 folder tree, splitting utterances into 1 directory per case. The file corpus.utterances.jsonl.gz is somewhat self-explanatory:

one utterance per line as JSON object
the JSON key '.utterance_dir' gives the directory within audio/ to grab the utterance mp3 from
the JSON key '.utterance_file' gives the mp3 file name
in other words you need to go to audio/utterance_dir/utterance_file to grab the audio file for each utterance
the JSON key '.text' gives the transcript text with minimal formatting changes (mixed upper/lower case, hyphens, etc.)
- you may wish to preprocess this file by adjusting the formatting / characters in each utterance transcript as done for the KUR format export, depending on your speech recognition engine.

Benefits:

Named (and mostly gendered) speaker labels
High-quality audio and transcripts
Punctuation and capitalization
Public domain data
Historically significant data
Some audio contains multiple speakers when the courtroom is rowdy (this may be a disadvantage depending on your ASR goals)

Disadvantages:

low diversity in:
- accents
- conversation topics
- conversation styles
often repeated speakers (the justices)
some repeated utterances (formal procedure)
no word-level timestamps

Next Steps

The alignment is very good but not 100% perfect. Espeak gives known failure cases. I would like to configure festival and see if it overcomes these failure cases. I've tried AWS Polly as the STT backend and found it to work extremely well with no failure cases yet, however the corpus is 35,652,395 characters which would cost about $140 for Polly STT on the whole thing. Before spending on that I want to do some benchmarks / training / data analysis of the corpus to make sure it's ready.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
corpus_staging		corpus_staging
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
add_times.py		add_times.py
align.sh		align.sh
convert_corpus.py		convert_corpus.py
cut_audio.sh		cut_audio.sh
download.sh		download.sh
parse_pdf.py		parse_pdf.py
play_alignment.sh		play_alignment.sh
requirements.txt		requirements.txt
scrape.py		scrape.py
steps.sh		steps.sh
to_utterances.py		to_utterances.py
to_words.py		to_words.py
transcribe.py		transcribe.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scotus-speech

DEMO VIDEO

Summary: Medium-scale (595 hours 36 minutes 58 seconds) corpus of professionally transcribed formal conversational English speech

Category: Speech

Lincese: MIT

Get the corpus:

About this resource:

How to regenerate the data:

Format notes

Benefits:

Disadvantages:

Next Steps

About

Releases

Packages

Contributors 2

Languages

License

noajshu/scotus-speech

Folders and files

Latest commit

History

Repository files navigation

scotus-speech

DEMO VIDEO

Summary: Medium-scale (595 hours 36 minutes 58 seconds) corpus of professionally transcribed formal conversational English speech

Category: Speech

Lincese: MIT

Get the corpus:

About this resource:

How to regenerate the data:

Format notes

Benefits:

Disadvantages:

Next Steps

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages