Lip Synching Head Talking Videos At 4K using Wav2lip and Vqgan

Hello everyone, in this article I wanted to talk about a paper that is a follow-up of Wav2Lip:: Towards Generating Ultra-High Resolution Talking-Face Videos with Lip
synchronization.

This paper makes an improvement of the Wav2Lip model so that it is able to generate lip-synching videos at 4K resolutions. And how does it manage to achieve that?

To grossly simplify the approach, instead of training on real-size images, they train Wav2Lip using compressed images by using VqGan. This allows them to train with much higher resolution size images up to 4K resolution.

So in many ways, the architecture of Wav2Lip is very similar to before.

Wav2Lip – Training an Expert Discriminator

Let’s first dive into the Wav2Lip paper, then it will be easier to understand what improvements have been made.

First of all, in the original paper, a lip sync discriminator is trained on a dataset of videos using SyncNet architecture. The SyncNet model is used in machine learning for audio-visual discrimination tasks.

The discriminator takes two inputs, a sequence of frames from the video, with the lower half of the face extracted, resized to 96×96 pixels, and the second input is a segment of audio converted to a Mel Spectrogram. The Mel spectrogram is a technique used to visualize differences in sounds in a way that is perceptible to humans. This is because the traditional Hertz scale is not very helpful in this regard.

The expert discriminator then is trained so that it can distinguish whether the frames given, match the audio.
The way that the training is done is by feeding the sync net model within and out-of-sync pairs of video and audio.
The window of time used in Wav2Lip was of 5 frames at 25 frames per second.

The main limitation of Wav2Lip was that it is feeding RGB images directly to the SyncNet model for training. But in the more recent paper by using the encoder in Vqgan to compress the image, i.e. working with images in the quantized space, it is able to train the expert lip discriminator at much higher resolutions. A no-brainer right?

Wav2Lip – Training the Lip Generator and the Visual Quality Discriminator

Now that we understand how the lip-sync discriminator works and is trained, we now can take a look at how Wav2Lip trains the lip generator and the visual quality discriminator in a GAN setup.

It is important to highlight that the lip-sync discriminator and the visual quality discriminators are trained at the same time in a GAN setup. The lip sync discriminator is trained first, and independently, and its weights are frozen.

The lip generator generates lip sequences for a given audio segment, N frames at a time( 5 for Wav2Lip). At the same time, the generator also receives as input a sequence of concatenated frames(N frames), i.e. cropped faces from a random section of the video, from the same speaker. And lastly, the final input is a sequence of N frames of cropped faces, with the bottom section masked, that correspond to the audio segment given as input.

The output is the sequence of frames that were masked with the bottom half filled in with the lip movements corresponding to the audio.

During training every time the lip generator outputs a target sequence of frames, these are first assessed by the lip sync discriminator for accuracy. And then they are also judged for visual quality by the visual quality discriminator. During training, the lip generator and the visual quality discriminator get better and better until it gets to a point where the lip sync is so good that neither the lip sync discriminator nor the visual discriminator is able to distinguish it from real lip synching.

Anyways, by using Vqgan to compress the images, instead of training with images 96x96x3(27648) channels, we are training with vectors of size 96/16 * 96 / 16 * 256 = 9216 .

References:

  1. R. Prajwal, R. Mukhija, V. P. Namboodiri, and C. V. Jawahar, “Wav2Lip: Towards Generating Ultra-High Resolution Talking-Face Videos with Lip Synchronization,” arXiv preprint arXiv:2008.10010, 2020. [Online]. Available: https://arxiv.org/abs/2008.10010
  2. S. Gupta, V. Jain, and V. P. Namboodiri, “Towards Generating Ultra-High Resolution Talking-Face Videos With Lip Synchronization,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023, pp. 1-10. [Online]. https://openaccess.thecvf.com/content/WACV2023/papers/Gupta_Towards_Generating_Ultra-High_Resolution_Talking-Face_Videos_With_Lip_Synchronization_WACV_2023_paper.pdf