You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is there any official source for this integration ? I have query bit not sure this is the right forum. @i-am-shreya@ControlNet As in this part of the paper
Lip Synchronization (LS) is another line of research that require facial region specific spatio-temporal synchronization. This downstream adaptation further elaborates the adaptation capability of MARLIN for face generation tasks. For adaptation, we replace the facial encoder module in Wav2Lip [57] with MARLIN, and adjust the temporal window accordingly i.e. from 5 frames to T frames. For evaluation, we use the LRS2 [22] dataset having 45,838 train, 1,082 val, and 1,243 test videos. Following the prior literature [57, 74], we use Lip-Sync Error-Distance (LSE-D ↓), Lip-Sync Error-Confidence (LSE-C ↑) and Frechet Inception Distance (FID ↓) [38] as evaluation matrices.
Did you folks train a wav2lip with a marlin encoder and if yes
The flattened face sequences are processed by the Marlin encoder's extract_features method to produce the final face feature map.
Only the final output of the extract_features method is used in the forward pass.
OR
Intermediate Feature Storage: where the extract_features method is modified to store selected intermediate outputs from the transformer blocks, ensuring the number of stored features matches the number of CNN decoder blocks.
During Integration with Decoder Blocks as in during the forward pass of the Wav2Lip model, the decoder blocks process the audio embeddings.
At each decoder block, the corresponding intermediate feature map from face_features is concatenated with the current decoder output.
The features are accessed in reverse order to match the original processing sequence.
The text was updated successfully, but these errors were encountered:
Is there any official source for this integration ? I have query bit not sure this is the right forum. @i-am-shreya @ControlNet As in this part of the paper
Lip Synchronization (LS) is another line of research that require facial region specific spatio-temporal synchronization. This downstream adaptation further elaborates the adaptation capability of MARLIN for face generation tasks. For adaptation, we replace the facial encoder module in Wav2Lip [57] with MARLIN, and adjust the temporal window accordingly i.e. from 5 frames to T frames. For evaluation, we use the LRS2 [22] dataset having 45,838 train, 1,082 val, and 1,243 test videos. Following the prior literature [57, 74], we use Lip-Sync Error-Distance (LSE-D ↓), Lip-Sync Error-Confidence (LSE-C ↑) and Frechet Inception Distance (FID ↓) [38] as evaluation matrices.
Did you folks train a wav2lip with a marlin encoder and if yes
OR
The text was updated successfully, but these errors were encountered: