“Hugging Face Transformer Utilization Course, Moderna COVID-19 Wikipedia Text Retrieval”

With the advancements in deep learning and Natural Language Processing (NLP), the methods for processing and analyzing text data have diversified. In this post, I will explain in detail how to retrieve COVID-19 information related to Moderna from Wikipedia using the Hugging Face library. Hugging Face Transformers provide many pretrained models widely used in NLP tasks, allowing users to easily analyze text data.

1. What is Hugging Face?

Hugging Face is a platform that provides various pretrained models and tools to facilitate easy use of NLP models. In particular, the Transformers library includes various state-of-the-art transformer models, such as BERT, GPT-2, and T5, enabling users to perform natural language processing tasks more easily.

1.1 Key Features of Hugging Face

  • Provision of pretrained models: Pretrained models for various NLP tasks are available.
  • Easy utilization of models: Models can be used straightforwardly without writing complex code.
  • Large community: User-created models and datasets are shared, providing various options to choose from.

2. Installation and Environment Setup

To use the Hugging Face library, you need to set up a Python environment. You can install the required libraries with the command below.

pip install transformers wikipedia-api

3. Retrieving Information from Wikipedia

To retrieve information related to Moderna and COVID-19 from Wikipedia, we will use wikipedia-api. This library provides the ability to easily search Wikipedia pages and fetch their content.

3.1 Example of Retrieving Wikipedia Data

The code below is a simple example that searches for information about Moderna and prints the content.

import wikipediaapi

    # Initialize Wikipedia API
    wiki_wiki = wikipediaapi.Wikipedia('en')

    # Retrieve "Moderna" page
    page = wiki_wiki.page("Moderna")

    # Print page content
    if page.exists():
        print("Title: ", page.title)
        print("Summary: ", page.summary[0:1000])  # Print first 1000 characters
    else:
        print("Page does not exist.")

By running the above code, you can retrieve content from the Wikipedia page of Moderna. Now, let’s check for additional information related to COVID-19.

3.2 Retrieving COVID-19 Related Information

Similarly, the code to retrieve information about COVID-19 from Wikipedia is as follows.

# Retrieve "COVID-19" page
    covid_page = wiki_wiki.page("COVID-19")

    # Print page content
    if covid_page.exists():
        print("Title: ", covid_page.title)
        print("Summary: ", covid_page.summary[0:1000])  # Print first 1000 characters
    else:
        print("Page does not exist.")

4. Text Preprocessing

The text retrieved from Wikipedia must go through a preprocessing step before being inputted into the model. This process involves removing unnecessary characters or symbols and organizing the necessary information.

4.1 Preprocessing Steps

The code below shows how to remove unnecessary characters from the retrieved text and organize it in a list format.

import re

    def preprocess_text(text):
        # Remove special characters
        text = re.sub(r'[^A-Za-z0-9\s]', '', text)
        # Replace multiple spaces with a single space
        text = re.sub(r'\s+', ' ', text)
        return text.strip()

    # Example of preprocessing
    processed_text_moderna = preprocess_text(page.summary)
    processed_text_covid = preprocess_text(covid_page.summary)

    print("Processed Moderna Text: ", processed_text_moderna)
    print("Processed COVID-19 Text: ", processed_text_covid)

5. Analyzing Information with Hugging Face Transformers

To analyze the retrieved data, we can use Hugging Face Transformers. Here, we will look at how to input the preprocessed text into the BERT model to extract features.

5.1 Using BERT Model

Let’s use the Hugging Face BERT model to extract features from the preprocessed text. Please refer to the code below.

from transformers import BertTokenizer, BertModel
    import torch

    # Load BERT model and tokenizer
    model_name = 'bert-base-multilingual-cased'
    tokenizer = BertTokenizer.from_pretrained(model_name)
    model = BertModel.from_pretrained(model_name)

    # Tokenize text and convert to tensor
    inputs = tokenizer(processed_text_moderna, return_tensors='pt', padding=True, truncation=True)

    # Feed into model and extract features
    with torch.no_grad():
        outputs = model(**inputs)
    
    # Feature vector
    embeddings = outputs.last_hidden_state
    print("Embedding Size: ", embeddings.shape)

6. Practice Example: Summarizing COVID-19 Related Documents

Now we will create a summary based on COVID-19 information. We can generate a summary using the GPT-2 model from the Hugging Face library.

6.1 Summarizing with GPT-2 Model

from transformers import GPT2Tokenizer, GPT2LMHeadModel

    # Load GPT-2 model and tokenizer
    gpt2_model = GPT2LMHeadModel.from_pretrained('gpt2')
    gpt2_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

    # Input text for summarization
    input_text = "COVID-19 is caused by SARS-CoV-2..."
    input_ids = gpt2_tokenizer.encode(input_text, return_tensors='pt')

    # Generate summary
    summary_ids = gpt2_model.generate(input_ids, max_length=50, num_beams=5, early_stopping=True)
    summary = gpt2_tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    print("Generated Summary: ", summary)

Conclusion

In this post, we explored how to retrieve information related to Moderna and COVID-19 from Wikipedia using Hugging Face Transformers, as well as the processes of preprocessing and analyzing the data. Hugging Face is a great tool for easily utilizing the latest natural language processing models, enabling more effective utilization of text data. In the future, we can further develop our data analysis skills through various NLP tasks.

Moreover, Hugging Face is continuously updating new models and datasets through collaboration with the community, so ongoing learning and application are encouraged. I hope you will challenge yourselves with diverse NLP tasks and achieve greater achievements.

References