Skip to content

Bergbok/Jerma-Subtitle-Search

Repository files navigation

Subtitles

Video Count    : 2002
Word Count     : 25,273,515
Duration       : 5385:16:33
Oldest Video   : 2011-06-11
Latest Video   : 2025-01-19

Subtitles were obtained using this Python script. Audio gets downloaded with yt-dlp, which gets transcribed using WhisperX (large-v3 model) and converted to LRC format with ffmpeg.

Relevant information gets written to a JSON file, which gets indexed and compressed using this JS script.

The Python script also supports downloading YouTube's auto-generated subtitles, and optionally only transcribing videos which don't have auto-generated subtitles available.

Read More

Initially used YouTube's auto-generated subtitles, but far too many videos either didn't have them available or had censored swears.

Tried using OpenAI's Whisper next, but after transcribing a bunch of videos with it I realized it kinda sucks in some aspects. It hallucinated a lot, especially during sections with no speech. Timestamps were incorrect on some transcriptions, and the first timestamp would always start at zero seconds, which was normally wrong. It's also pretty slow, especially if you use some of the bigger models.

Switching to WhisperX mostly solved the aforementioned problems. However, it's still far from perfect and does have some limitations.

Webpage

Uses Mithril, MiniSearch, lite-youtube-embed and fflate.

screenshot of webpage search results for the query: "GitHub"

Running Locally

# feel free to substitute bun with npm/yarn/whatever
git clone https://github.com/Bergbok/Jerma-Subtitle-Search.git
cd Jerma-Subtitle-Search
git lfs install
git lfs pull
bun install
bun run dev

jermaHeart Twitch Emote