Using Hugging Face Transformers, Running Wav2Vec2 Automatic Speech Recognition

Today, we will implement an automatic speech recognition (ASR) feature using the Wav2Vec2 model provided by Hugging Face’s Transformers library. Wav2Vec2 is a speech recognition model and one of the latest deep learning models that excels in continuous speech recognition. We will explore the process of converting speech data into text using this model in detail.

1. Understanding the Wav2Vec2 Model

Wav2Vec2 is a speech recognition model developed by Facebook AI. This model understands speech data through unsupervised learning and significantly improves performance through self-supervised learning. Wav2Vec2 takes speech signals as input and performs the process of converting them into text. In particular, it has the advantage of learning with less labeled data compared to traditional speech recognition methods.

1.1 Structure of Wav2Vec2

The Wav2Vec2 model is divided into two main components:

Encoder: Encodes the input speech signal to produce high-dimensional representations.
Decoder: Generates text based on the representations obtained from the encoder.

In this process, the model is trained with many speech samples and their corresponding text labels.

2. Setting Up the Environment

To use Wav2Vec2, you first need to install the necessary libraries. Use the following code to install transformers, torchaudio, and torch libraries:

!pip install transformers torchaudio torch

3. Loading the Wav2Vec2 Model

Once the model is installed, the next step is to load the Wav2Vec2 model. Using Hugging Face’s transformers library makes this easy:

from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-960h")
tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-large-960h")

In the code above, we import a pre-trained model called facebook/wav2vec2-large-960h. This model has been trained on 960 hours of English speech data.

4. Preparing the Audio File

To use the Wav2Vec2 model, you need an audio file. WAV is a supported audio format. You can use libraries like torchaudio or librosa to read audio files. Below is the code to load an audio file using torchaudio:

import torchaudio

# Path to the audio file
audio_file = "path_to_your_audio_file.wav"
# Load the audio file
waveform, sample_rate = torchaudio.load(audio_file)

5. Performing Speech Recognition

Now we are ready to perform speech recognition using the Wav2Vec2 model. We can pass the loaded audio file to the model to convert it into text. Before inputting to the model, we need to match the sample rate of the audio sample:

# Change sample rate
waveform = waveform.squeeze().numpy()  # (channels, time) -> (time,)
inputs = tokenizer(waveform, return_tensors="pt", padding="longest")

Now we can perform recognition through the model:

with torch.no_grad():
    logits = model(inputs["input_values"]).logits
    
# Find the index with the highest probability
predicted_ids = torch.argmax(logits, dim=-1)
# Convert index to text
transcription = tokenizer.batch_decode(predicted_ids)[0]

Here, the transcription variable holds the result of the text conversion of the speech.

6. Complete Code Example

We will summarize the entire speech recognition process by combining all the above steps into a single code block:

import torchaudio
import torch
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer

# Load model and tokenizer
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-960h")
tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-large-960h")

# Path to the audio file
audio_file = "path_to_your_audio_file.wav"
# Load the audio file
waveform, sample_rate = torchaudio.load(audio_file)

# Change sample rate
waveform = waveform.squeeze().numpy()  # (channels, time) -> (time,)
inputs = tokenizer(waveform, return_tensors="pt", padding="longest")

# Perform recognition
with torch.no_grad():
    logits = model(inputs["input_values"]).logits
    predicted_ids = torch.argmax(logits, dim=-1)

# Convert to text
transcription = tokenizer.batch_decode(predicted_ids)[0]
print(transcription)

7. Check the Results

Running the above code will print the text result for the given audio file. This is the implementation of a simple automatic speech recognition system leveraging the Wav2Vec2 model. The accuracy of the results may vary depending on the quality and length of the audio file.

8. Conclusion

We have implemented an automatic speech recognition system using the Wav2Vec2 model utilizing Hugging Face’s Transformers library. This example allowed us to experiment with the basic processes of speech recognition using deep learning models and the powerful performance of Wav2Vec2. Since speech recognition technology has high applicability in various fields, those interested in this domain are encouraged to deepen their learning to build expertise.

9. Additional Resources

For more information, please refer to the following resources: