Using Hugging Face Transformers Tutorial, Sample Image Dataset

Recently, the Hugging Face library has been widely used in both natural language processing (NLP) and computer vision (CV) fields within artificial intelligence and deep learning. In this article, we will explain how to process image datasets and train models using the Hugging Face Transformers library, and we will explore this in detail with example code.

1. What are Hugging Face and Transformers?

Hugging Face is a library that provides various pre-trained models related to natural language processing, making it easy to use models like BERT, GPT-2, and T5. However, recently, image processing models such as Vision Transformer (ViT) and CLIP have been added, which demonstrate strong performance in computer vision tasks as well.

2. Required Packages and Environment Setup

Before using the Hugging Face Transformers library, you must first install the required packages. You can easily install them with the following code.

pip install transformers torchvision torch

3. Sample Image Dataset

In this tutorial, we will use the CIFAR-10 dataset as sample data. CIFAR-10 consists of 60,000 32×32 color images distributed across 10 classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck). This dataset is very suitable for image classification problems.

3.1 Loading the Dataset

We will use the torchvision library in Python to load the CIFAR-10 dataset and split it into training and validation sets.

import torchvision
import torchvision.transforms as transforms

# Set data transformations
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
])

# Load training and validation datasets
trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=4, shuffle=True, num_workers=2)

testset = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=4, shuffle=False, num_workers=2)

3.2 Data Preprocessing

The loaded dataset needs to undergo a transformation process where the image data is converted to tensor form and normalized. In the code above, the ToTensor() method is used to convert images to tensors, and the Normalize() method performs normalization based on the mean and standard deviation.

4. Building the Vision Transformer (ViT) Model

Now we will build the ViT model to classify images from the CIFAR-10 dataset. The model definition can be easily implemented using the Hugging Face Transformers library.

from transformers import ViTForImageClassification, ViTFeatureExtractor

# Initialize ViT model and feature extractor
feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224')
model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224', num_labels=10)

The above code initializes the Vision Transformer model, where the `num_labels` parameter is used to set the number of classes. Here, we set it to 10, as we have 10 classes.

5. Model Training

To train the model, we need to define the loss function and optimization algorithm. In this case, we will use the CrossEntropyLoss loss function and the Adam optimizer.

import torch.optim as optim

# Define loss function and optimization algorithm
criterion = torch.nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Set device for model training
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Train the model
for epoch in range(10):  # Train for 10 epochs
    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        inputs, labels = data
        inputs, labels = inputs.to(device), labels.to(device)

        # Zero the gradients
        optimizer.zero_grad()

        # Forward + backward + optimize
        outputs = model(inputs).logits
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()
        if i % 2000 == 1999:  # Print every 2000 mini-batches
            print(f"[{epoch + 1}, {i + 1}] loss: {running_loss / 2000:.3f}")
            running_loss = 0.0

The above code represents the process of training the model according to the number of epochs and mini-batches. It calculates the loss for each batch and updates the weights through backpropagation.

6. Model Evaluation

To evaluate the trained model, we will use the test dataset. The method for evaluating the model’s accuracy is as follows.

correct = 0
total = 0

with torch.no_grad():
    for data in testloader:
        images, labels = data
        images, labels = images.to(device), labels.to(device)
        outputs = model(images).logits
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print(f'Accuracy of the network on the 10000 test images: {100 * correct / total:.2f}%')

The above code calculates the model’s accuracy on the test dataset. It compares the predicted values against the actual labels for each image to compute the accuracy.

7. Conclusion

In this article, we explored how to process CIFAR-10 data using the Hugging Face Transformers library and how to build a Vision Transformer model to classify images. Utilizing the Hugging Face library allows for easily constructing complex models and optimizing performance with various datasets. We encourage you to continue exploring deep learning use cases with diverse models and datasets in the future.

If you have any questions or need additional information, please feel free to leave a comment.