Introduction
Recommendation systems are everywhere, from Netflix suggesting your next binge-worthy show to Amazon recommending products you might like. This tutorial takes inspiration from a research paper that combined sentiment analysis and matrix factorization for recommendations. Instead, we will focus on integrating BERT embeddings with LDA topic modeling.
What We’ll Cover:
- Data Preparation: Fetching and cleaning data from the Steam API.
- Word Embeddings with BERT: Understanding and implementing BERT for word embeddings.
- Topic Modeling with LDA: Using LDA to extract topics from game reviews.
- Combining BERT and LDA: Merging the two feature sets to power the recommendation engine.
- Building the Streamlit App: Deploying the model in a user-friendly web app.
Step 1: Data Preparation
Before we dive into modeling, we need to fetch and clean the data. We’ll be using data from the Steam Web API, which provides details on thousands of games.
import sqlite3
import pandas as pd
# Connect to the SQLite database
conn = sqlite3.connect('steam_games.db')
# Load the game details and reviews into pandas DataFrames
games_df = pd.read_sql_query("SELECT * FROM game_details", conn)
reviews_df = pd.read_sql_query("SELECT * FROM game_reviews", conn)
# Close the connection
conn.close()
# Remove unwanted entries like DLCs, soundtracks, and demos
filtered_games_df = games_df[~games_df['name'].str.contains('soundtrack|OST|demo|DLC|playtest|resource pack', case=False, na=False)]
# Filter reviews based on the filtered games
filtered_reviews_df = reviews_df[reviews_df['appid'].isin(filtered_games_df['appid'])]
filtered_games_df.to_csv('filtered_games_df.csv', index=False)
Key Points:
• Use SQLite to store data locally for easy manipulation.
• Focus on actual games by removing irrelevant entries such as soundtracks, DLCs, and demos.
Step 2: Word Embeddings with BERT
What Are Word Embeddings?
Word embeddings map words or phrases from a vocabulary to vectors of real numbers. BERT (Bidirectional Encoder Representations from Transformers) provides context-aware embeddings, meaning the word “bank” will have different embeddings in “river bank” and “bank account.”
Implementing BERT for Game Descriptions
from transformers import BertTokenizer, BertModel
import numpy as np
# Load pre-trained BERT model and tokenizer
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')
# Function to get BERT embeddings
def get_embedding(text):
inputs = bert_tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=512)
outputs = bert_model(**inputs)
return outputs.last_hidden_state.mean(dim=1).cpu().detach().numpy()
# Generate embeddings for all game descriptions
embeddings = []
for description in filtered_games_df['description']:
embeddings.append(get_embedding(description).flatten())
bert_item_feature_matrix = np.array(embeddings)
np.save('bert_item_feature_matrix.npy', bert_item_feature_matrix)
Key Points:
• BERT embeddings capture the semantic meaning of game descriptions.
• These embeddings are stored in a feature matrix for later use.
Step 3: Topic Modeling with LDA
### What is LDA?
Latent Dirichlet Allocation (LDA) is a statistical model that identifies topics within a set of documents. In this case, LDA extracts topics from game reviews.
Implementing LDA for Game Reviews
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import re
# Text preprocessing function
def clean_text(text):
text = text.lower()
text = re.sub(r'\s+', ' ', text)
text = re.sub(r'[^\w\s]', '', text)
text = re.sub(r'\d+', '', text)
text = text.strip()
return text
# Apply the clean_text function to the reviews
filtered_reviews_df['cleaned_text'] = filtered_reviews_df['review_text'].apply(clean_text)
# Vectorize the reviews
vectorizer = CountVectorizer(max_features=5000, stop_words='english')
reviews_vectorized = vectorizer.fit_transform(filtered_reviews_df['cleaned_text'])
# Fit the LDA model
lda_model = LatentDirichletAllocation(n_components=20, random_state=42)
lda_topic_matrix = lda_model.fit_transform(reviews_vectorized)
# Save the LDA topics per game
lda_df = pd.DataFrame(lda_topic_matrix, columns=[f'topic_{i}' for i in range(lda_topic_matrix.shape[1])])
lda_df['appid'] = filtered_reviews_df['appid'].values
lda_topic_matrix_per_game = lda_df.groupby('appid').mean().to_numpy()
Key Points:
• Use LDA to identify major themes in game reviews.
• Each game gets a topic distribution vector for better recommendations.
Step 4: Combining BERT and LDA
Now that we have two feature sets, we combine them to create a robust recommendation system.
Combine BERT and LDA features
combined_feature_matrix = np.hstack((bert_item_feature_matrix, lda_topic_matrix_per_game))
np.save('combined_feature_matrix.npy', combined_feature_matrix)
Step 5: Building the Streamlit App
Finally, we use Streamlit to create a web app where users can input game descriptions and get recommendations.
Let there be Training Data (covariates, labels)
\((x_{i}, y_{i}) \text{ for } i \in \{1, 2, \dots, n\}.\)
Let
\(f_{\theta}(\cdot)\)
be our model, where
\(\vec{\theta}\)
is the parameter vector. And
\(L(\vec{y}, \hat{y})\)
is the loss function.
We minimize the empirical risk:
\(\hat{\theta} = \arg\max_\theta \frac{1}{n} \sum_{i=1}^{n} L(\vec{y}, f_{\theta}(x_{i}))\)
Where:
\(\hat{y} = f_{\hat{\theta}}(x)\)
Goal:
-
Good performance in the real world on new
\(x \text{ (i.e., } x \text{ we didn't see).}\)
-
Low generalization error: We assume the
\(x\text{'s}\)
we didn’t see are drawn from some distribution:
\(E_{X, Y}[L(y, f_{\hat{\theta}}(x))]\)
-
We believe the distribution of
\(X \text{ and } Y\)
exists.
Complications
1. Don’t Have Access to
\(P(X, Y)\)
Solution: Collect a test set:
\((x_{\text{text i}}, y_{\text{text i}})\)
which we never touch after collection, except to calculate:
\(\frac{1}{n_{\text{test}}} \sum_{i=1}^{n_{\text{test}}} L(\vec{y}, f_{\theta}(x_{\text{test i}}))\)
2. The Loss We Care About Is Not Compatible with the Optimizer
Example: The optimizer requires derivatives, but the loss is not differentiable or has zero derivatives.
Solution: Use a surrogate loss that works, such as:
- Logistic Loss or Hinge Loss for binary classification.
- Cross Entropy Loss for multi-class classification.
Warning: Only change the training loss function, not the test loss.
3. Huge Values in
\(\hat{\theta}\)
(Overfitting)
Solution A: Add a regularizer during training:
\(\hat{\theta} = \arg\max_\theta \frac{1}{n} \sum_{i=1}^{n} L(\vec{y}, f_{\theta}(x_{i})) + R(\theta)\)
- Example: Ridge Regularization.
- Transition from Maximum Likelihood Estimation (MLE) to Maximum A Posteriori Estimation (MAP).
- Introduces a hyperparameter.
Solution B: Perform hyperparameter search:
- Hold out additional data (validation set) to evaluate how well you’re adjusting the hyperparameter.
4. Optimizer Might Have Its Own Hyperparameters
Example: Gradient Descent Learning Rate:
\(\theta_{t+1} = \theta_t − \eta \nabla_{\theta} L_{\text{train}, \theta}\)
Extract: Scrape raw data from all the source systems, e.g., transactions, sensors, log files, experiments, tables, bytestreams etc.
Transform: Apply a series of rules or functions, wrangle data into schema(s)/format(s)
Load: Load data into a data storage solution
ETL (Traditional Warehouses)
Extract or scraping from API or log file,
transform into common schema/format,
load in parallel to “data warehouse”
ELT (e.g. Snowflake)
Extract or scraping from API or log file,
Load without doing a lot of transformation,
with transformations done in SQL
Faster to get going, and more scalable, but requires more data warehousing knowledge (& may be more expensive).
ET (Data Lakes)
No need to “manage” data. Extract directly into a Data Lake, later transform for specific use cases.
Data is dumped in cheaply and massaged as needed for various use-cases
Usually code-centric (Spark)
Data Warehouses ~ 1990s
- “Single source of truth”: A central, organized repository of data used for analytics throughout an enterprise.
- Design the uber-schema up-front of all of the rectangular tables you’d ever want.
- Extract from trusted sources
- Transform to warehouse schema using custom tools
- Load data warehouse
- Old school ETL solution: Informatica
- Warehouses expect structure
- Transformation is costly, not necessarily just computing but engineering time
Data Lake ~ 2010s
- Emerged during Hadoop/Spark revolution
- “Landing zone”: unconstrained storage for any and all data
- Data is then analyzed on demand
- Extract into files/storage
- Load into storage (easy!)
- Transform on demand for any use.
- Create new files in the lake, catalog files as they go for reuse
- Often code-centric
[!NOTE] ETLT & the Lakehouse
Modern solutions are likely Many-to-Many. Sometimes start with a data lake. Empower data scientists to work on ad-hoc use cases .Allow for datasets that “graduate” to a carefully managed warehouse. Some datasets may directly be loaded into a data warehouse.
Databricks uses a Lakehouse which make managing such a system much easier.
Important Considerations
Data Discovery, Data Assessment
- Ad-Hoc: End-users land data, explore it, label it
- Systematic: Crawl/index the data lake for files
- E.g., for CSV/JSON
- Very content-centric: really a form of analytics/prediction
- Try to figure out what type of data you have.
- AI + People!
Data Quality & Integrity
- Boolean Integrity checks
- Often specified by people, also “mined” by AI
- Data changes ALL the time, especially from clients.
- Enforced: can “reject” or “sequester” data that violate.
- e.g no two products that have the same product ID!
- Data entities (e.g. students, courses, employees for a university)
- Relationships between data
- Constraints
- Data Lineage – where did it come from?
- Audit Trails of Usage – who ran this job, and what did it do?
- Version info for all the above
- Timestamps
Operationalization (Ops):
- When do jobs kick off, and what do they do?
- How are tests registered, exceptions handled, people alerted?
- How do experiments “graduate” into processes?
Feedback:
- Some are datasets in their own right. If you produce a table, that’s also data!
- Many are new processes that generating new data feeds!
- ML models: Constantly yielding predictions.
- Compare old predictions to new predictions?