Machine Learning and Deep Learning Algorithm Trading, Generating doc2vec Input from Yelp Sentiment Data

Today, machine learning and deep learning play a vital role in financial markets. This article focuses on processing Yelp review data using doc2vec, which plays an important role in sentiment analysis, and applying it to trading algorithms.

1. Importance of Machine Learning and Deep Learning

Machine learning and deep learning technologies demonstrate outstanding performance in analyzing and predicting large amounts of data. In particular, in financial trading, it is essential to analyze the impact of unstructured data such as social media, news, and reviews on price fluctuations, in addition to market data. Such data can be utilized to build models that support decision-making.

2. Introduction to Yelp Sentiment Data

Yelp is a platform where users leave reviews about restaurants and businesses, including text reviews, ratings, and user information. By performing sentiment analysis on Yelp data, we can identify patterns in positive or negative reviews and use them as predictive indicators for stock prices.

3. Introduction and Necessity of doc2vec

Doc2vec is a technique that understands the context of text data and represents the meaning of documents in vector form. It is based on the advancements in word embedding technology and generates unique vectors for each document. This vectorization significantly contributes to enhancing the performance of subsequent machine learning models.

3.1 Structure of the doc2vec Model

Doc2vec is based on two main algorithms: Distributed Bag of Words (DBOW) and Distributed Memory (DM). DBOW disregards the context of words by using the labels of documents, capturing the meaning of the document. DM works by predicting subsequent words based on past words.

4. Data Collection

To collect Yelp sentiment data, the latest APIs or web scraping technologies can be utilized. Here, we describe the process of gathering data as an example using Python’s requests library and BeautifulSoup.

import requests
from bs4 import BeautifulSoup

def fetch_yelp_reviews(business_id):
    url = f'https://www.yelp.com/biz/{business_id}'
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    reviews = soup.find_all('p', class_='comment')
    return [review.text for review in reviews]

business_id = "example-business-id"
reviews = fetch_yelp_reviews(business_id)
print(reviews)

5. Data Preprocessing

Preprocessing is necessary to input the collected review data into the doc2vec model. This includes processes such as text cleaning, tokenization, stopword removal, and stemming.

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

def preprocess_reviews(reviews):
    processed_reviews = []
    for review in reviews:
        tokens = nltk.word_tokenize(review.lower())
        filtered_tokens = [stemmer.stem(word) for word in tokens if word.isalnum() and word not in stop_words]
        processed_reviews.append(filtered_tokens)
    return processed_reviews

cleaned_reviews = preprocess_reviews(reviews)
print(cleaned_reviews)

6. Training the doc2vec Model

This is the stage of training the doc2vec model using the preprocessed review data. We create and train the model using Gensim’s Doc2Vec library.

from gensim.models import Doc2Vec, TaggedDocument

documents = [TaggedDocument(words=review, tags=[str(i)]) for i, review in enumerate(cleaned_reviews)]

model = Doc2Vec(vector_size=50, min_count=2, epochs=40)
model.build_vocab(documents)
model.train(documents, total_examples=model.corpus_count, epochs=model.epochs)

# Check document vector
vector = model.infer_vector(['great', 'food'])
print(vector)

7. Designing Trading Strategies

Using the document vectors obtained from the trained doc2vec model, we design trading strategies. For instance, we can develop a return prediction model based on sentiment indices.

7.1 Structure of the Prediction Model

A trading model generally has the following structure:

Data collection and preprocessing
Feature vector generation (including document vectors)
Model training (regression or classification model)
Model evaluation and optimization
Real-time trading execution

8. Model Evaluation

The trained model’s performance should be evaluated using a test dataset. Commonly used metrics include RMSE, accuracy, and MAPE.

from sklearn.metrics import mean_squared_error
from math import sqrt

# Compare predicted values with actual values
y_true = [2.5, 3.0, 4.5] # Actual values
y_pred = [2.0, 3.5, 4.0] # Predicted values

rmse = sqrt(mean_squared_error(y_true, y_pred))
print(f'RMSE: {rmse}

9. Conclusion

This tutorial explained the process of generating doc2vec vectors to input into machine learning and deep learning models using Yelp sentiment data. This data can provide valuable signals for algorithmic trading and can be utilized in real-time financial markets. In fact, one can build their own trading algorithms and maximize their performance by using such methods.

10. References

Le, Q. & Mikolov, T. (2014). Distributed Representations of Sentences and Documents. ICML.
Gensim Documentation. (n.d.). Gensim.
NLTK Documentation. (n.d.). NLTK.

11. Additional Exercises

Additional practice problems are provided for the reader. Try to collect Yelp data, generate vectors using doc2vec, and design various trading algorithms.

Collect and compare data from different business categories
Hyperparameter tuning to improve the prediction model
Testing and optimizing the model with real-time data included

Thank you!