Using Hugging Face Transformers, Wikipedia English Keyword Search

The Hugging Face Transformers library has established itself as a powerful tool in the fields of deep learning and natural language processing (NLP). In this course, we will explain how to use the Hugging Face Transformers library along with the Wikipedia API to search for relevant documents on Wikipedia based on a given keyword.

1. What is Hugging Face Transformers?

Hugging Face is a platform providing library for training, inference, and deployment of natural language processing models. The Transformers library makes it easy to use pre-trained models and is compatible with PyTorch and TensorFlow. This library can be used for various NLP tasks. For example, it excels in tasks such as text classification, question answering, and text generation.

2. Introduction to the Wikipedia API

Wikipedia is an open online encyclopedia that provides information on a wide range of topics. It supports users in programmatically searching for information through its API. By utilizing the API, you can search for Wikipedia pages based on specific keywords and easily retrieve the necessary information.

3. Installing Required Libraries

To install the libraries needed for the task, use the command below. You need to install the transformers and wikipedia-api packages to use the Hugging Face library and the Wikipedia API.

pip install transformers wikipedia-api

4. Choosing a Hugging Face Model

We will use a pre-trained model to evaluate the relevance of documents. For example, we can use the distilbert-base-uncased model. This model is a variant of BERT and is used to obtain embeddings of documents and measure the similarity between two documents.

5. Code Explanation

Now, we will write Python code based on the information outlined above. We will include a step-by-step explanation of the code.

5.1 Importing Required Libraries


import wikipediaapi
from transformers import AutoTokenizer, AutoModel
import torch
        

5.2 Preparing the Model and Tokenizer

Now we will initialize the model and tokenizer using Transformers.


# Initialize Hugging Face model and tokenizer
model_name = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
        

5.3 Implementing the Wikipedia Search Function

Define a function that searches for keywords on Wikipedia and returns relevant documents.


def search_wikipedia(keyword):
    wiki_wiki = wikipediaapi.Wikipedia('en')
    page = wiki_wiki.page(keyword)
    if page.exists():
        return page.text
    else:
        return None
        

5.4 Creating Document Embeddings

Create a function that generates embeddings for the retrieved document.


def create_embedding(text):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs['last_hidden_state'].mean(dim=1)
        

5.5 Finding Relevant Documents for the Keyword

Use the generated embeddings to find related information and similar pages for the given keyword.


keyword = "Deep Learning"
wiki_text = search_wikipedia(keyword)

if wiki_text:
    embedding = create_embedding(wiki_text)
    print("Title:", keyword)
    print("Content Embedding:", embedding)
else:
    print("Could not find a Wikipedia page for the given keyword.")
        

6. Running the Code and Results

Running the code above will provide the content of the Wikipedia document for the given keyword and its embedding. These embeddings can later be used to calculate the similarity with other documents.

7. Calculating Similarity

Additionally, you can calculate the similarity with other documents, allowing exploration of related topics to the input keyword. Let’s try to find similar documents by calculating the cosine similarity between embeddings.


from sklearn.metrics.pairwise import cosine_similarity

# Generate two embeddings and calculate the similarity
other_keyword = "Machine Learning"
other_wiki_text = search_wikipedia(other_keyword)

if other_wiki_text:
    other_embedding = create_embedding(other_wiki_text)
    similarity_score = cosine_similarity(embedding.numpy(), other_embedding.numpy())
    print(f"Similarity between {keyword} and {other_keyword}:", similarity_score[0][0])
else:
    print("Could not find a Wikipedia page for the given keyword.")
        

8. Conclusion

In this course, we learned how to use the Hugging Face Transformers library and Wikipedia API to search for relevant information based on a specific keyword and generate embeddings of that content to evaluate its similarity with other documents. This can be applied in various fields such as search engine construction, recommendation systems, and information extraction.

9. Next Steps

Now, based on this basic structure, try to implement additional features. For instance, consider searching multiple documents and clustering, or creating a user interface that allows users to easily search for keywords. Utilize the diverse models of Hugging Face and the Wikipedia API to implement more functionalities.

10. References

Hugging Face Transformers Documentation
Wikipedia API Documentation