Hugging Face Transformers Tutorial: Loading Pretrained Model Based on CLIP

1. Introduction

Recent advancements in the field of artificial intelligence are progressing at a remarkable pace. In particular,
deep learning models are demonstrating extraordinary performance in the fields of computer vision and natural
language processing. Among them, the CLIP (Contrastive Language–Image Pre-Training) model has gained attention as
a powerful model capable of understanding and processing text and images simultaneously. In this course, we will
provide detailed explanations on how to load and use CLIP-based pre-trained models utilizing the Hugging Face
Transformers library.

2. Concept of the CLIP Model

The CLIP model is a model introduced by OpenAI, trained to connect and understand text and images. This model learns
from large-scale datasets of text-image pairs, enabling it to generate descriptions for given images or select images
that correspond to given text.

The core idea of CLIP is “contrastive learning.” This approach ensures that pairs of similar text and images are
positioned closely in vector space, while pairs with different content are learned to be far apart. This allows CLIP
to exhibit remarkable performance even with unsupervised learning.

3. Hugging Face Transformers Library

The Hugging Face Transformers library is a tool that makes it easy to use various models related to natural language
processing (NLP). Through this library, users can easily load various pre-trained models and perform tasks such as
tokenization and data preprocessing. The CLIP model is also supported by this library.

4. Environment Setup

To use the CLIP model, you first need to install the necessary libraries. In a Python environment, you can install
Transformers library and related packages using the command below.

pip install transformers torch torchvision

5. Loading the CLIP Model

Now, let’s explain how to load the CLIP model in earnest. The library provides easy access to pre-trained CLIP models.
We will look at an example of loading the CLIP model and tokenizer using the Python code below.

from transformers import CLIPProcessor, CLIPModel

# Load CLIP model and processor
model = CLIPModel.from_pretrained('openai/clip-vit-base-patch16')
processor = CLIPProcessor.from_pretrained('openai/clip-vit-base-patch16')

5.1. Explanation of the Model and Processor

In the code above, we use the `from_pretrained` method to load the pre-trained CLIP model and processor. The processor
serves to process the input text and images, transforming them into a format the model can understand. In other words,
it converts images into tensor format and tokenizes the text so that it can be accepted as model input.

6. Input of Images and Text

The CLIP model can take both images and text as input. The code below demonstrates the process of downloading a random
image and inputting it into the model along with its corresponding text.

import requests
from PIL import Image

# Download image
url = "https://example.com/sample.jpg"
image = Image.open(requests.get(url, stream=True).raw)

# Text input
text = "A sample image description"

6.1. Preparing the Image File

In the code above, the requests library is used to download the image file. Then, the Pillow library is used to open
the image. You can specify the URL of the actual image you want to use for downloading, or you can use an image file
stored locally.

7. Inference with the CLIP Model

Now, let’s input the image and text into the model and proceed with the inference. You can check the model’s output
with the following code.

# Preprocess input data
inputs = processor(text=text, images=image, return_tensors="pt", padding=True)

# Model inference
outputs = model(**inputs)

# Extract similarity scores
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)

print(f"Prediction probability: {probs}")

7.1. Explanation of the Model Inference Process

After preprocessing the input text and image using the `processor`, we input that information into the model for inference.
The logits returned from the model are then converted into a probability distribution using the softmax function to derive
the final prediction probability.

8. Example: Utilizing the CLIP Model

Below is the full code showing how to actually utilize the CLIP model. This code evaluates the similarity of images
based on the given text.

import requests
import torch
from PIL import Image
from transformers import CLIPProcessor, CLIPModel

# Load the model and processor
model = CLIPModel.from_pretrained('openai/clip-vit-base-patch16')
processor = CLIPProcessor.from_pretrained('openai/clip-vit-base-patch16')

# Download image
url = "https://example.com/sample.jpg"
image = Image.open(requests.get(url, stream=True).raw)

# Text input
text = "A sample image description"

# Preprocess input data
inputs = processor(text=text, images=image, return_tensors="pt", padding=True)

# Model inference
with torch.no_grad():
    outputs = model(**inputs)

# Extract similarity scores
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)

print(f"Prediction probability: {probs}")

9. Conclusion

In this lecture, we explained how to load a CLIP-based pre-trained model using the Hugging Face Transformers library and
evaluate the similarity by inputting image-text pairs. The CLIP model has various applications and can contribute to
the development of more advanced AI systems. We encourage you to continue expanding the possibilities of artificial
intelligence using various deep learning technologies.