[EPIC] Prototype workflow for generating and accessioning speech-to-text extraction #1

jmartin-sul · 2024-09-12T00:28:53Z

Some admonishments/guidelines for the below investigations, based on 2024-09-10 planning meeting

model choice/configuration

We are assuming that we will use Whisper, or some variant (e.g. WhisperX), because it provided what we felt was the best combination of output quality and performance from the tools that were evaluated earlier in 2024 (@edsu may be able to link to that analysis for context?). If we determine that we want to go with a completely different model, we need to write up our reasoning for approval.
We want our solution to provide access to Whisper's tuning parameters so that we can tweak them as needed, so completely blackbox solutions that run Whisper with no access to configuration aren't acceptable.

terminology

After some discussion, we settled on the term "speech to text" to encompass text extraction from speech in audio, whether it has video or not (there was lack of consensus/confusion about whether "caption" applies to audio-only, and it does also apply to still-image descriptions; and "transcript" doesn't quite encompass what captions do for video).

So e.g. speechToTextWF, speech_to_text as a snake-case var name, "speech to text" or "speech-to-text generation" as a human readable term, etc.

infrastructure provisioning

We would like to avoid (or at least minimize as much as possible) vendor lock-in. We're highly likely to go with AWS to start, since we have more departmental expertise there, but GCP isn't out of the question. The cloud vendor has to be an org with which Stanford has a business agreement, and which is available through Cardinal Cloud, so that might rule out anything other than Amazon and Google? But as much as possible, we should use building blocks that have analogs in multiple major cloud vendors
related, but somewhat standalone point: ultimately, we should define and deploy the cloud infrastructure using Terraform. It's meant to be platform agnostic and the department already uses it. But also, all of our permanent prod/stage/qa cloud infrastructure is deployed assuming that Terraform is the source of truth, so things that were created manually (e.g. using the AWS web console or one off aws CLI commands) will cause confusion in the future. It's totally fine to experiment with building blocks by manually spinning them up that way, but once the experiment is done, those should be torn down and defined formally in Terraform.

model usage

⚠️ It is unacceptable for our data to be used to train the models of other orgs. This rules out, for example, OpenAI's hosted Whisper service. This is a SUL-wide rule, at the moment.

todo

The text was updated successfully, but these errors were encountered:

jmartin-sul changed the title ~~[WIP] EPIC: Prototype workflow for transcriptioning and captioning~~ EPIC: Prototype workflow for transcriptioning and captioning Sep 12, 2024

jmartin-sul changed the title ~~EPIC: Prototype workflow for transcriptioning and captioning~~ [EPIC] Prototype workflow for transcriptioning and captioning Sep 12, 2024

jmartin-sul changed the title ~~[EPIC] Prototype workflow for transcriptioning and captioning~~ [EPIC] Prototype workflow for generating and accessioning transcripts/captions Sep 12, 2024

jmartin-sul changed the title ~~[EPIC] Prototype workflow for generating and accessioning transcripts/captions~~ [EPIC] Prototype workflow for generating and accessioning speech-to-text extraction Sep 13, 2024

jmartin-sul mentioned this issue Sep 13, 2024

Finish skeleton common-accessioning robot and workflow def for speechToTextWF sul-dlss/common-accessioning#1341

Closed

jmartin-sul mentioned this issue Dec 10, 2024

Productionize speech-to-text pipeline #7

Open

16 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EPIC] Prototype workflow for generating and accessioning speech-to-text extraction #1

[EPIC] Prototype workflow for generating and accessioning speech-to-text extraction #1

jmartin-sul commented Sep 12, 2024 •

edited

Loading

[EPIC] Prototype workflow for generating and accessioning speech-to-text extraction #1

[EPIC] Prototype workflow for generating and accessioning speech-to-text extraction #1

Comments

jmartin-sul commented Sep 12, 2024 • edited Loading

Some admonishments/guidelines for the below investigations, based on 2024-09-10 planning meeting

model choice/configuration

terminology

infrastructure provisioning

model usage

todo

jmartin-sul commented Sep 12, 2024 •

edited

Loading