You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some admonishments/guidelines for the below investigations, based on 2024-09-10 planning meeting
model choice/configuration
We are assuming that we will use Whisper, or some variant (e.g. WhisperX), because it provided what we felt was the best combination of output quality and performance from the tools that were evaluated earlier in 2024 (@edsu may be able to link to that analysis for context?). If we determine that we want to go with a completely different model, we need to write up our reasoning for approval.
We want our solution to provide access to Whisper's tuning parameters so that we can tweak them as needed, so completely blackbox solutions that run Whisper with no access to configuration aren't acceptable.
terminology
After some discussion, we settled on the term "speech to text" to encompass text extraction from speech in audio, whether it has video or not (there was lack of consensus/confusion about whether "caption" applies to audio-only, and it does also apply to still-image descriptions; and "transcript" doesn't quite encompass what captions do for video).
So e.g. speechToTextWF, speech_to_text as a snake-case var name, "speech to text" or "speech-to-text generation" as a human readable term, etc.
infrastructure provisioning
We would like to avoid (or at least minimize as much as possible) vendor lock-in. We're highly likely to go with AWS to start, since we have more departmental expertise there, but GCP isn't out of the question. The cloud vendor has to be an org with which Stanford has a business agreement, and which is available through Cardinal Cloud, so that might rule out anything other than Amazon and Google? But as much as possible, we should use building blocks that have analogs in multiple major cloud vendors
related, but somewhat standalone point: ultimately, we should define and deploy the cloud infrastructure using Terraform. It's meant to be platform agnostic and the department already uses it. But also, all of our permanent prod/stage/qa cloud infrastructure is deployed assuming that Terraform is the source of truth, so things that were created manually (e.g. using the AWS web console or one off aws CLI commands) will cause confusion in the future. It's totally fine to experiment with building blocks by manually spinning them up that way, but once the experiment is done, those should be torn down and defined formally in Terraform.
model usage
⚠️ It is unacceptable for our data to be used to train the models of other orgs. This rules out, for example, OpenAI's hosted Whisper service. This is a SUL-wide rule, at the moment.
Characterizing hallucinations is another motivator (besides model updates) for having tooling for regression testing and comparison between settings, models, etc
long runs of “thank you” ("thank you" is often hallucinated when there's silence, and often many times for long stretches of silence)
inhumanly high speaking rates (one could, regardless of whisper output segment length, calculate a WPM rate in the whisper output for the heuristic's own segment length, say WPM for each 5 second block in the transcript; segments that cross a certain WPM threshold could be flagged). possible false positives: recordings that contain screen readers running at a typically high word rate, the micro machines guy
“translated by Amara.org” and other known common hallucinated phrases, often marks from other translation companies, since those are so common in training data. possible false positives: recordings of people talking about Whisper hallucinations
English-language recordings that contain non-English characters in the text. possible false positives: sometimes this is correct, e.g. accented characters for loan words, brief phrases spoken in another language.
The text was updated successfully, but these errors were encountered:
jmartin-sul
changed the title
[WIP] EPIC: Prototype workflow for transcriptioning and captioning
EPIC: Prototype workflow for transcriptioning and captioning
Sep 12, 2024
jmartin-sul
changed the title
EPIC: Prototype workflow for transcriptioning and captioning
[EPIC] Prototype workflow for transcriptioning and captioning
Sep 12, 2024
jmartin-sul
changed the title
[EPIC] Prototype workflow for transcriptioning and captioning
[EPIC] Prototype workflow for generating and accessioning transcripts/captions
Sep 12, 2024
jmartin-sul
changed the title
[EPIC] Prototype workflow for generating and accessioning transcripts/captions
[EPIC] Prototype workflow for generating and accessioning speech-to-text extraction
Sep 13, 2024
Some admonishments/guidelines for the below investigations, based on 2024-09-10 planning meeting
model choice/configuration
terminology
After some discussion, we settled on the term "speech to text" to encompass text extraction from speech in audio, whether it has video or not (there was lack of consensus/confusion about whether "caption" applies to audio-only, and it does also apply to still-image descriptions; and "transcript" doesn't quite encompass what captions do for video).
So e.g.
speechToTextWF
,speech_to_text
as a snake-case var name, "speech to text" or "speech-to-text generation" as a human readable term, etc.infrastructure provisioning
aws
CLI commands) will cause confusion in the future. It's totally fine to experiment with building blocks by manually spinning them up that way, but once the experiment is done, those should be torn down and defined formally in Terraform.model usage
todo
speechToTextWF
common-accessioning#1341speechToTextWF
: notify speech_to_text_generation_service that there is content to be STTed common-accessioning#1356speechToTextWF
stage-files
: stage transcription files generated by speech_to_text_generation_service common-accessioning#1360speechToTextWF
update-cocina
: update cocina structural to reference files generated by speech_to_text_generation_service common-accessioning#1361speechToTextWF
end-stt
: hand off toaccessionWF
common-accessioning#1362The text was updated successfully, but these errors were encountered: