In-depth Course on Hugging Face Transformers, CLIP Preprocessing

One of the latest trends in deep learning is the emergence of various multi-modal models. In particular, OpenAI’s CLIP (Contrastive Language–Image Pretraining) model is a very powerful methodology that learns the relationship between images and text, allowing it to perform various tasks. In this article, we will explore how to utilize the CLIP model through the Hugging Face library and its preprocessing steps.

1. Introduction to the CLIP Model

The CLIP model learns image and text pairs simultaneously, enabling it to understand what the image contains and measure the similarity between the given text description and the image. This approach can be flexibly applied to various tasks through unsupervised learning without needing to select a specific dataset.

2. CLIP Preprocessing Steps

To use the CLIP model, appropriate preprocessing of the input images and text is required. The preprocessing steps consist of the following:

Loading and resizing the image
Normalizing the image
Tokenizing the text

2.1 Loading and Resizing the Image

The image input to the model must be resized to a consistent size. Typically, the CLIP model requires images of size 224×224. The Python PIL library can be used for this.

2.2 Image Normalization

To improve the model’s performance, the pixel values of the image need to be normalized. The CLIP model typically uses mean=[0.48145466, 0.4578275, 0.40821073] and std=[0.26862954, 0.26130258, 0.27577711] for normalization.

2.3 Text Tokenization

The text must be encoded using a predefined tokenizer. CLIP uses the BPE (Byte Pair Encoding) model to convert the text into integer indices.

3. Code Example

Now let’s implement the above preprocessing steps in Python code. In this example, we will use the Hugging Face transformers library and the PIL library. First, we will install the necessary libraries.

pip install transformers torch torchvision pillow

3.1 Image Preprocessing Code


from PIL import Image
import requests
from transformers import CLIPProcessor

# Load CLIPProcessor
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch16")

# Image URL
image_url = "https://example.com/image.jpg"  # Change to a valid image URL.
image = Image.open(requests.get(image_url, stream=True).raw)

# Preprocess the image
inputs = processor(images=image, return_tensors="pt", padding=True)
print(inputs)

3.2 Text Preprocessing Code


# Text input
text = "A label describing the image"
text_inputs = processor(text=[text], return_tensors="pt", padding=True)
print(text_inputs)

4. Model Prediction

Once the image and text are preprocessed, you can input them into the model for prediction. You can proceed with the prediction using Hugging Face’s CLIPModel as follows.


from transformers import CLIPModel

# Load CLIP model
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch16")

# Extract features of the image and text
with torch.no_grad():
    outputs = model(**inputs, **text_inputs)

# Calculate the similarity between the image and text
logits_per_image = outputs.logits_per_image  # (batch_size, text_length)
logits_per_text = outputs.logits_per_text      # (batch_size, image_length)
print(logits_per_image)
print(logits_per_text)

5. Conclusion

In this post, we explored the preprocessing steps for the CLIP model using Hugging Face’s transformers. Preprocessing images and text is a crucial step to maximize the model’s performance. Now, you can better understand the relationships between images and text using the CLIP model.