Kartikeya Sharma

Building a Personalized Steam Game Recommendation System Using BERT and LDA

Introduction

Recommendation systems are everywhere, from Netflix suggesting your next binge-worthy show to Amazon recommending products you might like. This tutorial takes inspiration from a research paper that combined sentiment analysis and matrix factorization for recommendations. Instead, we will focus on integrating BERT embeddings with LDA topic modeling.

What We’ll Cover:

  1. Data Preparation: Fetching and cleaning data from the Steam API.
  2. Word Embeddings with BERT: Understanding and implementing BERT for word embeddings.
  3. Topic Modeling with LDA: Using LDA to extract topics from game reviews.
  4. Combining BERT and LDA: Merging the two feature sets to power the recommendation engine.
  5. Building the Streamlit App: Deploying the model in a user-friendly web app.

Step 1: Data Preparation

Before we dive into modeling, we need to fetch and clean the data. We’ll be using data from the Steam Web API, which provides details on thousands of games.

import sqlite3
import pandas as pd

# Connect to the SQLite database
conn = sqlite3.connect('steam_games.db')

# Load the game details and reviews into pandas DataFrames
games_df = pd.read_sql_query("SELECT * FROM game_details", conn)
reviews_df = pd.read_sql_query("SELECT * FROM game_reviews", conn)

# Close the connection
conn.close()

# Remove unwanted entries like DLCs, soundtracks, and demos
filtered_games_df = games_df[~games_df['name'].str.contains('soundtrack|OST|demo|DLC|playtest|resource pack', case=False, na=False)]

# Filter reviews based on the filtered games
filtered_reviews_df = reviews_df[reviews_df['appid'].isin(filtered_games_df['appid'])]

filtered_games_df.to_csv('filtered_games_df.csv', index=False)

Key Points: • Use SQLite to store data locally for easy manipulation. • Focus on actual games by removing irrelevant entries such as soundtracks, DLCs, and demos.

Step 2: Word Embeddings with BERT

What Are Word Embeddings?

Word embeddings map words or phrases from a vocabulary to vectors of real numbers. BERT (Bidirectional Encoder Representations from Transformers) provides context-aware embeddings, meaning the word “bank” will have different embeddings in “river bank” and “bank account.”

Implementing BERT for Game Descriptions

from transformers import BertTokenizer, BertModel
import numpy as np

# Load pre-trained BERT model and tokenizer
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')

# Function to get BERT embeddings
def get_embedding(text):
    inputs = bert_tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=512)
    outputs = bert_model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).cpu().detach().numpy()

# Generate embeddings for all game descriptions
embeddings = []
for description in filtered_games_df['description']:
    embeddings.append(get_embedding(description).flatten())

bert_item_feature_matrix = np.array(embeddings)
np.save('bert_item_feature_matrix.npy', bert_item_feature_matrix)

Key Points: • BERT embeddings capture the semantic meaning of game descriptions. • These embeddings are stored in a feature matrix for later use.

Step 3: Topic Modeling with LDA

### What is LDA?

Latent Dirichlet Allocation (LDA) is a statistical model that identifies topics within a set of documents. In this case, LDA extracts topics from game reviews.

Implementing LDA for Game Reviews

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import re

# Text preprocessing function
def clean_text(text):
    text = text.lower()
    text = re.sub(r'\s+', ' ', text)
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'\d+', '', text)
    text = text.strip()
    return text

# Apply the clean_text function to the reviews
filtered_reviews_df['cleaned_text'] = filtered_reviews_df['review_text'].apply(clean_text)

# Vectorize the reviews
vectorizer = CountVectorizer(max_features=5000, stop_words='english')
reviews_vectorized = vectorizer.fit_transform(filtered_reviews_df['cleaned_text'])

# Fit the LDA model
lda_model = LatentDirichletAllocation(n_components=20, random_state=42)
lda_topic_matrix = lda_model.fit_transform(reviews_vectorized)

# Save the LDA topics per game
lda_df = pd.DataFrame(lda_topic_matrix, columns=[f'topic_{i}' for i in range(lda_topic_matrix.shape[1])])
lda_df['appid'] = filtered_reviews_df['appid'].values
lda_topic_matrix_per_game = lda_df.groupby('appid').mean().to_numpy()

Key Points: • Use LDA to identify major themes in game reviews. • Each game gets a topic distribution vector for better recommendations.

Step 4: Combining BERT and LDA

Now that we have two feature sets, we combine them to create a robust recommendation system.

Combine BERT and LDA features

combined_feature_matrix = np.hstack((bert_item_feature_matrix, lda_topic_matrix_per_game))
np.save('combined_feature_matrix.npy', combined_feature_matrix)

Step 5: Building the Streamlit App

Finally, we use Streamlit to create a web app where users can input game descriptions and get recommendations.

Optimizational Paradigm for Supervised Machine Learning

Let there be Training Data (covariates, labels) \((x_{i}, y_{i}) \text{ for } i \in \{1, 2, \dots, n\}.\) Let \(f_{\theta}(\cdot)\) be our model, where \(\vec{\theta}\) is the parameter vector. And \(L(\vec{y}, \hat{y})\) is the loss function.

We minimize the empirical risk: \(\hat{\theta} = \arg\max_\theta \frac{1}{n} \sum_{i=1}^{n} L(\vec{y}, f_{\theta}(x_{i}))\)

Where: \(\hat{y} = f_{\hat{\theta}}(x)\)

Goal:


Complications

1. Don’t Have Access to

\(P(X, Y)\)

Solution: Collect a test set: \((x_{\text{text i}}, y_{\text{text i}})\) which we never touch after collection, except to calculate: \(\frac{1}{n_{\text{test}}} \sum_{i=1}^{n_{\text{test}}} L(\vec{y}, f_{\theta}(x_{\text{test i}}))\)


2. The Loss We Care About Is Not Compatible with the Optimizer

Example: The optimizer requires derivatives, but the loss is not differentiable or has zero derivatives.

Solution: Use a surrogate loss that works, such as:

Warning: Only change the training loss function, not the test loss.


3. Huge Values in

\(\hat{\theta}\) (Overfitting)

Solution A: Add a regularizer during training: \(\hat{\theta} = \arg\max_\theta \frac{1}{n} \sum_{i=1}^{n} L(\vec{y}, f_{\theta}(x_{i})) + R(\theta)\)

Solution B: Perform hyperparameter search:


4. Optimizer Might Have Its Own Hyperparameters

Example: Gradient Descent Learning Rate: \(\theta_{t+1} = \theta_t − \eta \nabla_{\theta} L_{\text{train}, \theta}\)

Data Systems Paradigms

Extract: Scrape raw data from all the source systems, e.g., transactions, sensors, log files, experiments, tables, bytestreams etc.

Transform: Apply a series of rules or functions, wrangle data into schema(s)/format(s)

Load: Load data into a data storage solution

ETL (Traditional Warehouses)

Extract or scraping from API or log file, transform into common schema/format, load in parallel to “data warehouse”

ELT (e.g. Snowflake)

Extract or scraping from API or log file, Load without doing a lot of transformation, with transformations done in SQL 

Faster to get going, and more scalable, but requires more data warehousing knowledge (& may be more expensive).

ET (Data Lakes)

No need to “manage” data. Extract directly into a Data Lake, later transform for specific use cases.

Data is dumped in cheaply and massaged as needed for various use-cases Usually code-centric (Spark)

Data Warehouses ~ 1990s

Data Lake ~ 2010s

[!NOTE] ETLT & the Lakehouse Modern solutions are likely Many-to-Many. Sometimes start with a data lake. Empower data scientists to work on ad-hoc use cases .Allow for datasets that “graduate” to a carefully managed warehouse. Some datasets may directly be loaded into a data warehouse.

Databricks uses a Lakehouse which make managing such a system much easier.

Important Considerations

Data Discovery, Data Assessment
Data Quality & Integrity
Application Metadata:
Behavioral Metadata:
Change Metadata
Operationalization (Ops):
Feedback: