<--- Link to project in Google Colab
Generative AI pipeline that produces image interpolations from an audio track, leveraging Stable Diffusion technology.
Steve.Reich.-.Music.for.Pieces.of.Wood.30.seconds.extract.mp4
Steve Reich - Music for Pieces of Wood (30 second extract) (fps=7, num_inference_steps=20)
Karlheinz.Stockhausen.-.Helicopter.String.Quartet.mp4
Karlheinz Stockhausen - Helicopter String Quartet (25 seconds) (fps=5, num_inference_steps=30)
Jean-Claude.Risset.-.SUD.mp4
Jean-Claude Risset - SUD (30 second extract) (fps=7, num_inference_steps=20)
Antonio.Vivaldi.-Winter.mp4
Antonio Vivaldi - Winter (15 seconds extract) (fps=7, num_inference_steps=20)
The core of the system is the Stable Diffusion 'img2img' by Hugging Face. Image embeddings are created using the Image Bind model by Meta, which employs multimodality and transforms audio data into image embeddings.
The interpolation part is adapted from the publicly available code by nateraw (https://github.com/nateraw/stable-diffusion-videos.git), and the detextifier is also adapted from the publicly available code by iuliaturc (https://github.com/iuliaturc/detextify.git). The Stable Diffusion and ImageBind models are incorporated into the public code provided by Zeqiang-Lai (https://github.com/Zeqiang-Lai/Anything2Image.git).