Tacotron-2 and WaveGlow model for Audio Deepfake generation

Tacotron-2 and WaveGlow model for Audio Deepfake generation
Photo by Catherine Breslin / Unsplash

AI is used to gather data to create natural sounding voices that can read digital text. Hence audio deep fakes are closely related to text to speech - TTS technology. Text to speech (TTS) is a technology that converts text input into spoken audio. It can read aloud PDFs, websites, and books using natural AI voices.

Tacotron 2 is a neural network architecture for speech synthesis directly from text. The Tacotron 2 and WaveGlow model form a text-to-speech system. The Tacotron 2 model produces mel spectrograms from input text using encoder-decoder architecture. WaveGlow is a flow-based model that consumes the mel spectrograms to generate speech.

Model architecture

The Tacotron 2 model is a recurrent sequence-to-sequence model with attention that predicts mel-spectrograms from text. The encoder (blue blocks in the figure below) transforms the whole text into a fixed-size hidden feature representation. This feature representation is then consumed by the autoregressive decoder (orange blocks) that produces one spectrogram frame at a time. In our implementation, the autoregressive WaveNet (green block) is replaced by the flow-based generative WaveGlow.

The example given below executes the following steps:

  1. Load the pre-trained Tacotron2 and Wave Glow models provided in torch.hub.
  2. Process the input text given into a tensor representation.
  3. Tacotron2 generates a Mel spectrogram from the tensor representation of input text given in the code.
  4. Wave Glow generates audio from the mel spectrogram.
  5. Output is saved as an “audio.wav” file.

Let’s see how the code works:

  • This code requires the use of a GPU( Graphics Processing Unit) as it ensures faster computation of Pytorch objects.
  • Install the required libraries such as numpy, scipy, etc., needed for preprocessing and producing the output on the notebook.
%%bash
pip install numpy scipy librosa unidecode inflect librosa
apt-get update
apt-get install -y libsndfile1
  • Import torch library and load the Wave Glow model that is pre-trained  on a publicly available LJ Speech dataset.
import torch
waveglow = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_waveglow', model_math='fp32')
  • Prepare the WaveGlow model for inference.
waveglow = waveglow.remove_weightnorm(waveglow)
waveglow = waveglow.to('cuda')
waveglow.eval()
  • Load the pretrained tacotron-2 model from Pytorch’s Hub.
tacotron2 = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_tacotron2', model_math='fp32')
tacotron2 = tacotron2.to('cuda')
tacotron2.eval()
  • Let’s create an audio file based on the given text.
text = "Welcome to AiEnsured"
  • Format the input text into a tensor using the utils method.
utils = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_tts_utils')
sequences, lengths = utils.prepare_input_sequence([text])
  • Run the chained models to generate the mel spectogram.
with torch.no_grad():
    mel, _, _ = tacotron2.infer(sequences, lengths)
    audio = waveglow.infer(mel)
audio_numpy = audio[0].data.cpu().numpy()
rate = 22050
  • Write the output to an audio file named “audio.wav”
from scipy.io.wavfile import write
write("audio.wav", rate, audio_numpy)
  • Play the audio file using the Ipython module.
from IPython.display import Audio
Audio(audio_numpy, rate=rate)

References

https://catalog.ngc.nvidia.com/orgs/nvidia/resources/tacotron_2_and_waveglow_for_pytorch

https://pytorch.org/

https://paperswithcode.com/method/tacotron-2#:~:text=Tacotron%202%20is%20a%20neural,speech%20synthesis%20directly%20from%20text.

Colab notebook link-

https://colab.research.google.com/drive/1Y1hHA2AnLrVtBxlqoHifr_CuLc2wGZnD#scrollTo=b4a85503

By Hamsini Ramesh