{"id":36107,"date":"2024-11-01T09:45:48","date_gmt":"2024-11-01T09:45:48","guid":{"rendered":"http:\/\/atmokpo.com\/w\/?p=36107"},"modified":"2024-11-01T09:45:48","modified_gmt":"2024-11-01T09:45:48","slug":"use-of-hugging-face-transformers-extracting-logits-in-clip-inference","status":"publish","type":"post","link":"https:\/\/atmokpo.com\/w\/36107\/","title":{"rendered":"Use of Hugging Face Transformers, Extracting Logits in CLIP Inference"},"content":{"rendered":"<p>As deep learning and the fields of natural language processing and computer vision advance, a variety of models have emerged. Among them, OpenAI&#8217;s CLIP (Contrastive Language-Image Pretraining) is a powerful model that can understand and utilize both text and images simultaneously. In this course, we will detail how to utilize the CLIP model using the Hugging Face Transformers library and extract logits from it.<\/p>\n<h2>1. Overview of the CLIP Model<\/h2>\n<p>CLIP is a model pre-trained on various pairs of images and text. This model can find the image that best matches the given text description or generate the most suitable text description for a given image. The CLIP model mainly uses two types of input: image and text.<\/p>\n<h3>1.1 Structure of CLIP<\/h3>\n<p>CLIP consists of an image encoder and a text encoder. The image encoder uses a CNN (Convolutional Neural Network) or Vision Transformer to convert images into feature vectors. In contrast, the text encoder uses the Transformer architecture to convert text into feature vectors. The outputs of these two encoders are trained to be positioned in the same vector space, enabling similarity measurement between the two domains.<\/p>\n<h2>2. Environment Setup<\/h2>\n<p>To use the CLIP model, you first need to install Hugging Face Transformers and the necessary libraries. The following packages are required:<\/p>\n<ul>\n<li>transformers<\/li>\n<li>torch<\/li>\n<li>PIL (Python Imaging Library)<\/li>\n<\/ul>\n<p>You can install the required libraries as follows:<\/p>\n<pre><code>pip install torch torchvision transformers pillow<\/code><\/pre>\n<h2>3. Loading the CLIP Model and Preprocessing Images\/Text<\/h2>\n<p>Now, let&#8217;s look at how to load the CLIP model and preprocess the images and text using Hugging Face&#8217;s Transformers library.<\/p>\n<pre><code>import torch\nfrom transformers import CLIPProcessor, CLIPModel\nfrom PIL import Image\n\n# Load the CLIP model and processor\nmodel = CLIPModel.from_pretrained(\"openai\/clip-vit-base-patch16\")\nprocessor = CLIPProcessor.from_pretrained(\"openai\/clip-vit-base-patch16\")\n\n# Load and preprocess the image\nimage = Image.open(\"path\/to\/your\/image.jpg\")\n\n# Prepare the text\ntexts = [\"a photo of a cat\", \"a photo of a dog\", \"a photo of a bird\"]\n\n# Preprocess the text and image with the CLIP processor\ninputs = processor(text=texts, images=image, return_tensors=\"pt\", padding=True)<\/code><\/pre>\n<h3>3.1 Explanation of Text and Image Preprocessing<\/h3>\n<p>In the above code, we load the image and preprocess it against the provided list of text using the CLIP processor. In this step, the text and image are transformed into a format suitable for the CLIP model.<\/p>\n<h2>4. Inference with the CLIP Model and Logit Extraction<\/h2>\n<p>After preparing the model, we proceed to input the image to the model and extract logits.<\/p>\n<pre><code># Switch the model to evaluation mode\nmodel.eval()\n\n# Input to the model to obtain output logits\nwith torch.no_grad():\n    outputs = model(**inputs)\n\n# Extract logits\nlogits_per_image = outputs.logits_per_image  # Image to Text Logits\nlogits_per_text = outputs.logits_per_text      # Text to Image Logits\n<\/code><\/pre>\n<h3>4.1 Explanation of Logits<\/h3>\n<p>In the code above, logits are scores that represent the similarity between the image and the text. A higher logit value indicates a better match between the image and the text. <code>logits_per_image<\/code> indicates how well the image matches each text, while <code>logits_per_text<\/code> indicates how well the text matches each image.<\/p>\n<h2>5. Interpreting the Results<\/h2>\n<p>Now, let&#8217;s interpret the extracted logits. Logit values can be transformed into probabilities for each pair by passing them through a softmax function. This allows us to visualize the matching probabilities of the images for each text.<\/p>\n<pre><code>import torch.nn.functional as F\n\n# Calculate probabilities using softmax\nprobs = F.softmax(logits_per_image, dim=1)\n\n# Output the probabilities of images for each text\nfor i, text in enumerate(texts):\n    print(f\"'{text}': {probs[0][i].item():.4f}\") # Output probability<\/code><\/pre>\n<h3>5.1 Interpretation of Probabilities<\/h3>\n<p>The probability values provide a measure of how similar each text description is to the provided image. The closer the probability is to 1, the better the text matches the image. This allows us to evaluate the performance of the CLIP model.<\/p>\n<h2>6. Examples of CLIP Applications<\/h2>\n<p>CLIP can be used to create a variety of applications. For example:<\/p>\n<ul>\n<li>Image tagging: Generating appropriate tags for images.<\/li>\n<li>Image search: Image search based on text queries.<\/li>\n<li>Content-based recommendation systems: Image recommendations tailored to user preferences.<\/li>\n<\/ul>\n<h2>7. Conclusion<\/h2>\n<p>In this course, we learned how to load the CLIP model using the Hugging Face Transformers library, process images and text, and extract logits. The CLIP model is a very useful tool for solving problems based on various pairs of image and text data. We encourage you to try various more advanced examples using the CLIP model in the future!<\/p>\n<h2>8. References<\/h2>\n<ul>\n<li><a href=\"https:\/\/huggingface.co\/docs\/transformers\/model_doc\/clip\">Hugging Face CLIP Documentation<\/a><\/li>\n<li><a href=\"https:\/\/openai.com\/research\/clip\">OpenAI CLIP Paper<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/openai\/CLIP\">OpenAI CLIP GitHub Repository<\/a><\/li>\n<\/ul>\n<p><strong>The End!<\/strong><\/p>\n","protected":false},"excerpt":{"rendered":"<p>As deep learning and the fields of natural language processing and computer vision advance, a variety of models have emerged. Among them, OpenAI&#8217;s CLIP (Contrastive Language-Image Pretraining) is a powerful model that can understand and utilize both text and images simultaneously. In this course, we will detail how to utilize the CLIP model using the &hellip; <a href=\"https:\/\/atmokpo.com\/w\/36107\/\" class=\"more-link\">\ub354 \ubcf4\uae30<span class=\"screen-reader-text\"> &#8220;Use of Hugging Face Transformers, Extracting Logits in CLIP Inference&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[108],"tags":[],"class_list":["post-36107","post","type-post","status-publish","format-standard","hentry","category---en"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.2 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Use of Hugging Face Transformers, Extracting Logits in CLIP Inference - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/atmokpo.com\/w\/36107\/\" \/>\n<meta property=\"og:locale\" content=\"ko_KR\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Use of Hugging Face Transformers, Extracting Logits in CLIP Inference - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\" \/>\n<meta property=\"og:description\" content=\"As deep learning and the fields of natural language processing and computer vision advance, a variety of models have emerged. Among them, OpenAI&#8217;s CLIP (Contrastive Language-Image Pretraining) is a powerful model that can understand and utilize both text and images simultaneously. In this course, we will detail how to utilize the CLIP model using the &hellip; \ub354 \ubcf4\uae30 &quot;Use of Hugging Face Transformers, Extracting Logits in CLIP Inference&quot;\" \/>\n<meta property=\"og:url\" content=\"https:\/\/atmokpo.com\/w\/36107\/\" \/>\n<meta property=\"og:site_name\" content=\"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\" \/>\n<meta property=\"article:published_time\" content=\"2024-11-01T09:45:48+00:00\" \/>\n<meta name=\"author\" content=\"root\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@bebubo4\" \/>\n<meta name=\"twitter:site\" content=\"@bebubo4\" \/>\n<meta name=\"twitter:label1\" content=\"\uae00\uc4f4\uc774\" \/>\n\t<meta name=\"twitter:data1\" content=\"root\" \/>\n\t<meta name=\"twitter:label2\" content=\"\uc608\uc0c1 \ub418\ub294 \ud310\ub3c5 \uc2dc\uac04\" \/>\n\t<meta name=\"twitter:data2\" content=\"4\ubd84\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/atmokpo.com\/w\/36107\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/atmokpo.com\/w\/36107\/\"},\"author\":{\"name\":\"root\",\"@id\":\"https:\/\/atmokpo.com\/w\/#\/schema\/person\/91b6b3b138fbba0efb4ae64b1abd81d7\"},\"headline\":\"Use of Hugging Face Transformers, Extracting Logits in CLIP Inference\",\"datePublished\":\"2024-11-01T09:45:48+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/atmokpo.com\/w\/36107\/\"},\"wordCount\":585,\"publisher\":{\"@id\":\"https:\/\/atmokpo.com\/w\/#organization\"},\"articleSection\":[\"Using Hugging Face\"],\"inLanguage\":\"ko-KR\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/atmokpo.com\/w\/36107\/\",\"url\":\"https:\/\/atmokpo.com\/w\/36107\/\",\"name\":\"Use of Hugging Face Transformers, Extracting Logits in CLIP Inference - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\",\"isPartOf\":{\"@id\":\"https:\/\/atmokpo.com\/w\/#website\"},\"datePublished\":\"2024-11-01T09:45:48+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/atmokpo.com\/w\/36107\/#breadcrumb\"},\"inLanguage\":\"ko-KR\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/atmokpo.com\/w\/36107\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/atmokpo.com\/w\/36107\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"\ud648\",\"item\":\"https:\/\/atmokpo.com\/w\/en\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Use of Hugging Face Transformers, Extracting Logits in CLIP Inference\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/atmokpo.com\/w\/#website\",\"url\":\"https:\/\/atmokpo.com\/w\/\",\"name\":\"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/atmokpo.com\/w\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/atmokpo.com\/w\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"ko-KR\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/atmokpo.com\/w\/#organization\",\"name\":\"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\",\"url\":\"https:\/\/atmokpo.com\/w\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"ko-KR\",\"@id\":\"https:\/\/atmokpo.com\/w\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/atmokpo.com\/w\/wp-content\/uploads\/2024\/11\/logo.png\",\"contentUrl\":\"https:\/\/atmokpo.com\/w\/wp-content\/uploads\/2024\/11\/logo.png\",\"width\":400,\"height\":400,\"caption\":\"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\"},\"image\":{\"@id\":\"https:\/\/atmokpo.com\/w\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/bebubo4\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/atmokpo.com\/w\/#\/schema\/person\/91b6b3b138fbba0efb4ae64b1abd81d7\",\"name\":\"root\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"ko-KR\",\"@id\":\"https:\/\/atmokpo.com\/w\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/708197b41fc6435a7ce22d951b25d4a47e9e904270cb1f04682d4f025066f80c?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/708197b41fc6435a7ce22d951b25d4a47e9e904270cb1f04682d4f025066f80c?s=96&d=mm&r=g\",\"caption\":\"root\"},\"sameAs\":[\"http:\/\/atmokpo.com\/w\"],\"url\":\"https:\/\/atmokpo.com\/w\/author\/root\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Use of Hugging Face Transformers, Extracting Logits in CLIP Inference - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/atmokpo.com\/w\/36107\/","og_locale":"ko_KR","og_type":"article","og_title":"Use of Hugging Face Transformers, Extracting Logits in CLIP Inference - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","og_description":"As deep learning and the fields of natural language processing and computer vision advance, a variety of models have emerged. Among them, OpenAI&#8217;s CLIP (Contrastive Language-Image Pretraining) is a powerful model that can understand and utilize both text and images simultaneously. In this course, we will detail how to utilize the CLIP model using the &hellip; \ub354 \ubcf4\uae30 \"Use of Hugging Face Transformers, Extracting Logits in CLIP Inference\"","og_url":"https:\/\/atmokpo.com\/w\/36107\/","og_site_name":"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","article_published_time":"2024-11-01T09:45:48+00:00","author":"root","twitter_card":"summary_large_image","twitter_creator":"@bebubo4","twitter_site":"@bebubo4","twitter_misc":{"\uae00\uc4f4\uc774":"root","\uc608\uc0c1 \ub418\ub294 \ud310\ub3c5 \uc2dc\uac04":"4\ubd84"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/atmokpo.com\/w\/36107\/#article","isPartOf":{"@id":"https:\/\/atmokpo.com\/w\/36107\/"},"author":{"name":"root","@id":"https:\/\/atmokpo.com\/w\/#\/schema\/person\/91b6b3b138fbba0efb4ae64b1abd81d7"},"headline":"Use of Hugging Face Transformers, Extracting Logits in CLIP Inference","datePublished":"2024-11-01T09:45:48+00:00","mainEntityOfPage":{"@id":"https:\/\/atmokpo.com\/w\/36107\/"},"wordCount":585,"publisher":{"@id":"https:\/\/atmokpo.com\/w\/#organization"},"articleSection":["Using Hugging Face"],"inLanguage":"ko-KR"},{"@type":"WebPage","@id":"https:\/\/atmokpo.com\/w\/36107\/","url":"https:\/\/atmokpo.com\/w\/36107\/","name":"Use of Hugging Face Transformers, Extracting Logits in CLIP Inference - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","isPartOf":{"@id":"https:\/\/atmokpo.com\/w\/#website"},"datePublished":"2024-11-01T09:45:48+00:00","breadcrumb":{"@id":"https:\/\/atmokpo.com\/w\/36107\/#breadcrumb"},"inLanguage":"ko-KR","potentialAction":[{"@type":"ReadAction","target":["https:\/\/atmokpo.com\/w\/36107\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/atmokpo.com\/w\/36107\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"\ud648","item":"https:\/\/atmokpo.com\/w\/en\/"},{"@type":"ListItem","position":2,"name":"Use of Hugging Face Transformers, Extracting Logits in CLIP Inference"}]},{"@type":"WebSite","@id":"https:\/\/atmokpo.com\/w\/#website","url":"https:\/\/atmokpo.com\/w\/","name":"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","description":"","publisher":{"@id":"https:\/\/atmokpo.com\/w\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/atmokpo.com\/w\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"ko-KR"},{"@type":"Organization","@id":"https:\/\/atmokpo.com\/w\/#organization","name":"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","url":"https:\/\/atmokpo.com\/w\/","logo":{"@type":"ImageObject","inLanguage":"ko-KR","@id":"https:\/\/atmokpo.com\/w\/#\/schema\/logo\/image\/","url":"https:\/\/atmokpo.com\/w\/wp-content\/uploads\/2024\/11\/logo.png","contentUrl":"https:\/\/atmokpo.com\/w\/wp-content\/uploads\/2024\/11\/logo.png","width":400,"height":400,"caption":"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8"},"image":{"@id":"https:\/\/atmokpo.com\/w\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/bebubo4"]},{"@type":"Person","@id":"https:\/\/atmokpo.com\/w\/#\/schema\/person\/91b6b3b138fbba0efb4ae64b1abd81d7","name":"root","image":{"@type":"ImageObject","inLanguage":"ko-KR","@id":"https:\/\/atmokpo.com\/w\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/708197b41fc6435a7ce22d951b25d4a47e9e904270cb1f04682d4f025066f80c?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/708197b41fc6435a7ce22d951b25d4a47e9e904270cb1f04682d4f025066f80c?s=96&d=mm&r=g","caption":"root"},"sameAs":["http:\/\/atmokpo.com\/w"],"url":"https:\/\/atmokpo.com\/w\/author\/root\/"}]}},"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack-related-posts":[],"_links":{"self":[{"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/posts\/36107","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/comments?post=36107"}],"version-history":[{"count":1,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/posts\/36107\/revisions"}],"predecessor-version":[{"id":36108,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/posts\/36107\/revisions\/36108"}],"wp:attachment":[{"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/media?parent=36107"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/categories?post=36107"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/tags?post=36107"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}