{"id":36177,"date":"2024-11-01T09:46:24","date_gmt":"2024-11-01T09:46:24","guid":{"rendered":"http:\/\/atmokpo.com\/w\/?p=36177"},"modified":"2024-11-01T09:46:24","modified_gmt":"2024-11-01T09:46:24","slug":"using-hugging-face-transformers-running-wav2vec2-automatic-speech-recognition","status":"publish","type":"post","link":"https:\/\/atmokpo.com\/w\/36177\/","title":{"rendered":"Using Hugging Face Transformers, Running Wav2Vec2 Automatic Speech Recognition"},"content":{"rendered":"<p><body><\/p>\n<p>Today, we will implement an automatic speech recognition (ASR) feature using the Wav2Vec2 model provided by Hugging Face&#8217;s Transformers library. Wav2Vec2 is a speech recognition model and one of the latest deep learning models that excels in continuous speech recognition. We will explore the process of converting speech data into text using this model in detail.<\/p>\n<h2>1. Understanding the Wav2Vec2 Model<\/h2>\n<p>Wav2Vec2 is a speech recognition model developed by Facebook AI. This model understands speech data through unsupervised learning and significantly improves performance through self-supervised learning. Wav2Vec2 takes speech signals as input and performs the process of converting them into text. In particular, it has the advantage of learning with less labeled data compared to traditional speech recognition methods.<\/p>\n<h3>1.1 Structure of Wav2Vec2<\/h3>\n<p>The Wav2Vec2 model is divided into two main components:<\/p>\n<ul>\n<li><strong>Encoder<\/strong>: Encodes the input speech signal to produce high-dimensional representations.<\/li>\n<li><strong>Decoder<\/strong>: Generates text based on the representations obtained from the encoder.<\/li>\n<\/ul>\n<p>In this process, the model is trained with many speech samples and their corresponding text labels.<\/p>\n<h2>2. Setting Up the Environment<\/h2>\n<p>To use Wav2Vec2, you first need to install the necessary libraries. Use the following code to install <strong>transformers<\/strong>, <strong>torchaudio<\/strong>, and <strong>torch<\/strong> libraries:<\/p>\n<pre><code>!pip install transformers torchaudio torch<\/code><\/pre>\n<h2>3. Loading the Wav2Vec2 Model<\/h2>\n<p>Once the model is installed, the next step is to load the Wav2Vec2 model. Using Hugging Face&#8217;s <strong>transformers<\/strong> library makes this easy:<\/p>\n<pre><code>from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer\nmodel = Wav2Vec2ForCTC.from_pretrained(\"facebook\/wav2vec2-large-960h\")\ntokenizer = Wav2Vec2Tokenizer.from_pretrained(\"facebook\/wav2vec2-large-960h\")<\/code><\/pre>\n<p>In the code above, we import a pre-trained model called <strong>facebook\/wav2vec2-large-960h<\/strong>. This model has been trained on 960 hours of English speech data.<\/p>\n<h2>4. Preparing the Audio File<\/h2>\n<p>To use the Wav2Vec2 model, you need an audio file. WAV is a supported audio format. You can use libraries like <strong>torchaudio<\/strong> or <strong>librosa<\/strong> to read audio files. Below is the code to load an audio file using torchaudio:<\/p>\n<pre><code>import torchaudio\n\n# Path to the audio file\naudio_file = \"path_to_your_audio_file.wav\"\n# Load the audio file\nwaveform, sample_rate = torchaudio.load(audio_file)<\/code><\/pre>\n<h2>5. Performing Speech Recognition<\/h2>\n<p>Now we are ready to perform speech recognition using the Wav2Vec2 model. We can pass the loaded audio file to the model to convert it into text. Before inputting to the model, we need to match the sample rate of the audio sample:<\/p>\n<pre><code># Change sample rate\nwaveform = waveform.squeeze().numpy()  # (channels, time) -> (time,)\ninputs = tokenizer(waveform, return_tensors=\"pt\", padding=\"longest\")<\/code><\/pre>\n<p>Now we can perform recognition through the model:<\/p>\n<pre><code>with torch.no_grad():\n    logits = model(inputs[\"input_values\"]).logits\n    \n# Find the index with the highest probability\npredicted_ids = torch.argmax(logits, dim=-1)\n# Convert index to text\ntranscription = tokenizer.batch_decode(predicted_ids)[0]<\/code><\/pre>\n<p>Here, the <strong>transcription<\/strong> variable holds the result of the text conversion of the speech.<\/p>\n<h2>6. Complete Code Example<\/h2>\n<p>We will summarize the entire speech recognition process by combining all the above steps into a single code block:<\/p>\n<pre><code>import torchaudio\nimport torch\nfrom transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer\n\n# Load model and tokenizer\nmodel = Wav2Vec2ForCTC.from_pretrained(\"facebook\/wav2vec2-large-960h\")\ntokenizer = Wav2Vec2Tokenizer.from_pretrained(\"facebook\/wav2vec2-large-960h\")\n\n# Path to the audio file\naudio_file = \"path_to_your_audio_file.wav\"\n# Load the audio file\nwaveform, sample_rate = torchaudio.load(audio_file)\n\n# Change sample rate\nwaveform = waveform.squeeze().numpy()  # (channels, time) -> (time,)\ninputs = tokenizer(waveform, return_tensors=\"pt\", padding=\"longest\")\n\n# Perform recognition\nwith torch.no_grad():\n    logits = model(inputs[\"input_values\"]).logits\n    predicted_ids = torch.argmax(logits, dim=-1)\n\n# Convert to text\ntranscription = tokenizer.batch_decode(predicted_ids)[0]\nprint(transcription)<\/code><\/pre>\n<h2>7. Check the Results<\/h2>\n<p>Running the above code will print the text result for the given audio file. This is the implementation of a simple automatic speech recognition system leveraging the Wav2Vec2 model. The accuracy of the results may vary depending on the quality and length of the audio file.<\/p>\n<h2>8. Conclusion<\/h2>\n<p>We have implemented an automatic speech recognition system using the Wav2Vec2 model utilizing Hugging Face&#8217;s Transformers library. This example allowed us to experiment with the basic processes of speech recognition using deep learning models and the powerful performance of Wav2Vec2. Since speech recognition technology has high applicability in various fields, those interested in this domain are encouraged to deepen their learning to build expertise.<\/p>\n<h2>9. Additional Resources<\/h2>\n<p>For more information, please refer to the following resources:<\/p>\n<ul>\n<li><a href=\"https:\/\/huggingface.co\/transformers\/model_doc\/wav2vec2.html\" target=\"_blank\" rel=\"noopener\">Hugging Face &#8211; Wav2Vec2 Documentation<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/pytorch\/fairseq\/tree\/main\/examples\/wav2vec\" target=\"_blank\" rel=\"noopener\">Fairseq Wav2Vec2 GitHub Repository<\/a><\/li>\n<\/ul>\n<p><\/body><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Today, we will implement an automatic speech recognition (ASR) feature using the Wav2Vec2 model provided by Hugging Face&#8217;s Transformers library. Wav2Vec2 is a speech recognition model and one of the latest deep learning models that excels in continuous speech recognition. We will explore the process of converting speech data into text using this model in &hellip; <a href=\"https:\/\/atmokpo.com\/w\/36177\/\" class=\"more-link\">\ub354 \ubcf4\uae30<span class=\"screen-reader-text\"> &#8220;Using Hugging Face Transformers, Running Wav2Vec2 Automatic Speech Recognition&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[108],"tags":[],"class_list":["post-36177","post","type-post","status-publish","format-standard","hentry","category---en"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.2 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Using Hugging Face Transformers, Running Wav2Vec2 Automatic Speech Recognition - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/atmokpo.com\/w\/36177\/\" \/>\n<meta property=\"og:locale\" content=\"ko_KR\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Using Hugging Face Transformers, Running Wav2Vec2 Automatic Speech Recognition - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\" \/>\n<meta property=\"og:description\" content=\"Today, we will implement an automatic speech recognition (ASR) feature using the Wav2Vec2 model provided by Hugging Face&#8217;s Transformers library. Wav2Vec2 is a speech recognition model and one of the latest deep learning models that excels in continuous speech recognition. We will explore the process of converting speech data into text using this model in &hellip; \ub354 \ubcf4\uae30 &quot;Using Hugging Face Transformers, Running Wav2Vec2 Automatic Speech Recognition&quot;\" \/>\n<meta property=\"og:url\" content=\"https:\/\/atmokpo.com\/w\/36177\/\" \/>\n<meta property=\"og:site_name\" content=\"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\" \/>\n<meta property=\"article:published_time\" content=\"2024-11-01T09:46:24+00:00\" \/>\n<meta name=\"author\" content=\"root\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@bebubo4\" \/>\n<meta name=\"twitter:site\" content=\"@bebubo4\" \/>\n<meta name=\"twitter:label1\" content=\"\uae00\uc4f4\uc774\" \/>\n\t<meta name=\"twitter:data1\" content=\"root\" \/>\n\t<meta name=\"twitter:label2\" content=\"\uc608\uc0c1 \ub418\ub294 \ud310\ub3c5 \uc2dc\uac04\" \/>\n\t<meta name=\"twitter:data2\" content=\"4\ubd84\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/atmokpo.com\/w\/36177\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/atmokpo.com\/w\/36177\/\"},\"author\":{\"name\":\"root\",\"@id\":\"https:\/\/atmokpo.com\/w\/#\/schema\/person\/91b6b3b138fbba0efb4ae64b1abd81d7\"},\"headline\":\"Using Hugging Face Transformers, Running Wav2Vec2 Automatic Speech Recognition\",\"datePublished\":\"2024-11-01T09:46:24+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/atmokpo.com\/w\/36177\/\"},\"wordCount\":542,\"publisher\":{\"@id\":\"https:\/\/atmokpo.com\/w\/#organization\"},\"articleSection\":[\"Using Hugging Face\"],\"inLanguage\":\"ko-KR\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/atmokpo.com\/w\/36177\/\",\"url\":\"https:\/\/atmokpo.com\/w\/36177\/\",\"name\":\"Using Hugging Face Transformers, Running Wav2Vec2 Automatic Speech Recognition - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\",\"isPartOf\":{\"@id\":\"https:\/\/atmokpo.com\/w\/#website\"},\"datePublished\":\"2024-11-01T09:46:24+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/atmokpo.com\/w\/36177\/#breadcrumb\"},\"inLanguage\":\"ko-KR\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/atmokpo.com\/w\/36177\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/atmokpo.com\/w\/36177\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"\ud648\",\"item\":\"https:\/\/atmokpo.com\/w\/en\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Using Hugging Face Transformers, Running Wav2Vec2 Automatic Speech Recognition\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/atmokpo.com\/w\/#website\",\"url\":\"https:\/\/atmokpo.com\/w\/\",\"name\":\"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/atmokpo.com\/w\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/atmokpo.com\/w\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"ko-KR\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/atmokpo.com\/w\/#organization\",\"name\":\"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\",\"url\":\"https:\/\/atmokpo.com\/w\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"ko-KR\",\"@id\":\"https:\/\/atmokpo.com\/w\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/atmokpo.com\/w\/wp-content\/uploads\/2024\/11\/logo.png\",\"contentUrl\":\"https:\/\/atmokpo.com\/w\/wp-content\/uploads\/2024\/11\/logo.png\",\"width\":400,\"height\":400,\"caption\":\"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\"},\"image\":{\"@id\":\"https:\/\/atmokpo.com\/w\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/bebubo4\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/atmokpo.com\/w\/#\/schema\/person\/91b6b3b138fbba0efb4ae64b1abd81d7\",\"name\":\"root\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"ko-KR\",\"@id\":\"https:\/\/atmokpo.com\/w\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/708197b41fc6435a7ce22d951b25d4a47e9e904270cb1f04682d4f025066f80c?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/708197b41fc6435a7ce22d951b25d4a47e9e904270cb1f04682d4f025066f80c?s=96&d=mm&r=g\",\"caption\":\"root\"},\"sameAs\":[\"http:\/\/atmokpo.com\/w\"],\"url\":\"https:\/\/atmokpo.com\/w\/author\/root\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Using Hugging Face Transformers, Running Wav2Vec2 Automatic Speech Recognition - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/atmokpo.com\/w\/36177\/","og_locale":"ko_KR","og_type":"article","og_title":"Using Hugging Face Transformers, Running Wav2Vec2 Automatic Speech Recognition - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","og_description":"Today, we will implement an automatic speech recognition (ASR) feature using the Wav2Vec2 model provided by Hugging Face&#8217;s Transformers library. Wav2Vec2 is a speech recognition model and one of the latest deep learning models that excels in continuous speech recognition. We will explore the process of converting speech data into text using this model in &hellip; \ub354 \ubcf4\uae30 \"Using Hugging Face Transformers, Running Wav2Vec2 Automatic Speech Recognition\"","og_url":"https:\/\/atmokpo.com\/w\/36177\/","og_site_name":"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","article_published_time":"2024-11-01T09:46:24+00:00","author":"root","twitter_card":"summary_large_image","twitter_creator":"@bebubo4","twitter_site":"@bebubo4","twitter_misc":{"\uae00\uc4f4\uc774":"root","\uc608\uc0c1 \ub418\ub294 \ud310\ub3c5 \uc2dc\uac04":"4\ubd84"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/atmokpo.com\/w\/36177\/#article","isPartOf":{"@id":"https:\/\/atmokpo.com\/w\/36177\/"},"author":{"name":"root","@id":"https:\/\/atmokpo.com\/w\/#\/schema\/person\/91b6b3b138fbba0efb4ae64b1abd81d7"},"headline":"Using Hugging Face Transformers, Running Wav2Vec2 Automatic Speech Recognition","datePublished":"2024-11-01T09:46:24+00:00","mainEntityOfPage":{"@id":"https:\/\/atmokpo.com\/w\/36177\/"},"wordCount":542,"publisher":{"@id":"https:\/\/atmokpo.com\/w\/#organization"},"articleSection":["Using Hugging Face"],"inLanguage":"ko-KR"},{"@type":"WebPage","@id":"https:\/\/atmokpo.com\/w\/36177\/","url":"https:\/\/atmokpo.com\/w\/36177\/","name":"Using Hugging Face Transformers, Running Wav2Vec2 Automatic Speech Recognition - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","isPartOf":{"@id":"https:\/\/atmokpo.com\/w\/#website"},"datePublished":"2024-11-01T09:46:24+00:00","breadcrumb":{"@id":"https:\/\/atmokpo.com\/w\/36177\/#breadcrumb"},"inLanguage":"ko-KR","potentialAction":[{"@type":"ReadAction","target":["https:\/\/atmokpo.com\/w\/36177\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/atmokpo.com\/w\/36177\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"\ud648","item":"https:\/\/atmokpo.com\/w\/en\/"},{"@type":"ListItem","position":2,"name":"Using Hugging Face Transformers, Running Wav2Vec2 Automatic Speech Recognition"}]},{"@type":"WebSite","@id":"https:\/\/atmokpo.com\/w\/#website","url":"https:\/\/atmokpo.com\/w\/","name":"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","description":"","publisher":{"@id":"https:\/\/atmokpo.com\/w\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/atmokpo.com\/w\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"ko-KR"},{"@type":"Organization","@id":"https:\/\/atmokpo.com\/w\/#organization","name":"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","url":"https:\/\/atmokpo.com\/w\/","logo":{"@type":"ImageObject","inLanguage":"ko-KR","@id":"https:\/\/atmokpo.com\/w\/#\/schema\/logo\/image\/","url":"https:\/\/atmokpo.com\/w\/wp-content\/uploads\/2024\/11\/logo.png","contentUrl":"https:\/\/atmokpo.com\/w\/wp-content\/uploads\/2024\/11\/logo.png","width":400,"height":400,"caption":"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8"},"image":{"@id":"https:\/\/atmokpo.com\/w\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/bebubo4"]},{"@type":"Person","@id":"https:\/\/atmokpo.com\/w\/#\/schema\/person\/91b6b3b138fbba0efb4ae64b1abd81d7","name":"root","image":{"@type":"ImageObject","inLanguage":"ko-KR","@id":"https:\/\/atmokpo.com\/w\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/708197b41fc6435a7ce22d951b25d4a47e9e904270cb1f04682d4f025066f80c?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/708197b41fc6435a7ce22d951b25d4a47e9e904270cb1f04682d4f025066f80c?s=96&d=mm&r=g","caption":"root"},"sameAs":["http:\/\/atmokpo.com\/w"],"url":"https:\/\/atmokpo.com\/w\/author\/root\/"}]}},"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack-related-posts":[],"_links":{"self":[{"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/posts\/36177","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/comments?post=36177"}],"version-history":[{"count":1,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/posts\/36177\/revisions"}],"predecessor-version":[{"id":36178,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/posts\/36177\/revisions\/36178"}],"wp:attachment":[{"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/media?parent=36177"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/categories?post=36177"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/tags?post=36177"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}