Building a Personalized Steam Game Recommendation System Using BERT and LDA

December 07, 2024

Introduction

Recommendation systems are everywhere, from Netflix suggesting your next binge-worthy show to Amazon recommending products you might like. This tutorial takes inspiration from a research paper that combined sentiment analysis and matrix factorization for recommendations. Instead, we will focus on integrating BERT embeddings with LDA topic modeling.

What We’ll Cover:

Data Preparation: Fetching and cleaning data from the Steam API.
Word Embeddings with BERT: Understanding and implementing BERT for word embeddings.
Topic Modeling with LDA: Using LDA to extract topics from game reviews.
Combining BERT and LDA: Merging the two feature sets to power the recommendation engine.
Building the Streamlit App: Deploying the model in a user-friendly web app.

Step 1: Data Preparation

Before we dive into modeling, we need to fetch and clean the data. We’ll be using data from the Steam Web API, which provides details on thousands of games.

import sqlite3
import pandas as pd

# Connect to the SQLite database
conn = sqlite3.connect('steam_games.db')

# Load the game details and reviews into pandas DataFrames
games_df = pd.read_sql_query("SELECT * FROM game_details", conn)
reviews_df = pd.read_sql_query("SELECT * FROM game_reviews", conn)

# Close the connection
conn.close()

# Remove unwanted entries like DLCs, soundtracks, and demos
filtered_games_df = games_df[~games_df['name'].str.contains('soundtrack|OST|demo|DLC|playtest|resource pack', case=False, na=False)]

# Filter reviews based on the filtered games
filtered_reviews_df = reviews_df[reviews_df['appid'].isin(filtered_games_df['appid'])]

filtered_games_df.to_csv('filtered_games_df.csv', index=False)

Key Points: • Use SQLite to store data locally for easy manipulation. • Focus on actual games by removing irrelevant entries such as soundtracks, DLCs, and demos.

Step 2: Word Embeddings with BERT

What Are Word Embeddings?

Word embeddings map words or phrases from a vocabulary to vectors of real numbers. BERT (Bidirectional Encoder Representations from Transformers) provides context-aware embeddings, meaning the word “bank” will have different embeddings in “river bank” and “bank account.”

Implementing BERT for Game Descriptions

from transformers import BertTokenizer, BertModel
import numpy as np

# Load pre-trained BERT model and tokenizer
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')

# Function to get BERT embeddings
def get_embedding(text):
    inputs = bert_tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=512)
    outputs = bert_model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).cpu().detach().numpy()

# Generate embeddings for all game descriptions
embeddings = []
for description in filtered_games_df['description']:
    embeddings.append(get_embedding(description).flatten())

bert_item_feature_matrix = np.array(embeddings)
np.save('bert_item_feature_matrix.npy', bert_item_feature_matrix)

Key Points: • BERT embeddings capture the semantic meaning of game descriptions. • These embeddings are stored in a feature matrix for later use.

Step 3: Topic Modeling with LDA

### What is LDA?

Latent Dirichlet Allocation (LDA) is a statistical model that identifies topics within a set of documents. In this case, LDA extracts topics from game reviews.

Implementing LDA for Game Reviews

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import re

# Text preprocessing function
def clean_text(text):
    text = text.lower()
    text = re.sub(r'\s+', ' ', text)
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'\d+', '', text)
    text = text.strip()
    return text

# Apply the clean_text function to the reviews
filtered_reviews_df['cleaned_text'] = filtered_reviews_df['review_text'].apply(clean_text)

# Vectorize the reviews
vectorizer = CountVectorizer(max_features=5000, stop_words='english')
reviews_vectorized = vectorizer.fit_transform(filtered_reviews_df['cleaned_text'])

# Fit the LDA model
lda_model = LatentDirichletAllocation(n_components=20, random_state=42)
lda_topic_matrix = lda_model.fit_transform(reviews_vectorized)

# Save the LDA topics per game
lda_df = pd.DataFrame(lda_topic_matrix, columns=[f'topic_{i}' for i in range(lda_topic_matrix.shape[1])])
lda_df['appid'] = filtered_reviews_df['appid'].values
lda_topic_matrix_per_game = lda_df.groupby('appid').mean().to_numpy()

Key Points: • Use LDA to identify major themes in game reviews. • Each game gets a topic distribution vector for better recommendations.

Step 4: Combining BERT and LDA

Now that we have two feature sets, we combine them to create a robust recommendation system.

Combine BERT and LDA features

combined_feature_matrix = np.hstack((bert_item_feature_matrix, lda_topic_matrix_per_game))
np.save('combined_feature_matrix.npy', combined_feature_matrix)

Step 5: Building the Streamlit App

Finally, we use Streamlit to create a web app where users can input game descriptions and get recommendations.

import streamlit as st
from sklearn.metrics.pairwise import cosine_similarity

@st.cache_data
def load_data():
    games_df = pd.read_csv('filtered_games_df.csv')
    combined_feature_matrix = np.load('combined_feature_matrix.npy')
    return games_df, combined_feature_matrix

def recommend_games(user_input, combined_feature_matrix, games_df):
    user_embedding = get_embedding(user_input)
    similarities = cosine_similarity(user_embedding, combined_feature_matrix)
    top_n = 5
    recommendations = similarities[0].argsort()[-top_n:][::-1]
    return recommendations

# Streamlit app
st.title("Steam Game Recommendation System")

user_input = st.text_input("Describe your ideal game:")
if user_input:
    games_df, combined_feature_matrix = load_data()
    recommendations = recommend_games(user_input, combined_feature_matrix, games_df)
    st.subheader("Top 5 Recommended Games")
    for idx in recommendations:
        game_info = games_df.iloc[idx]
        st.image(f"https://steamcdn-a.akamaihd.net/steam/apps/{game_info['appid']}/header.jpg")
        st.write(f"**{game_info['name']}**")
        st.write(f"Description: {game_info['description']}")
        st.write(f"Price: {game_info['price']}")
        st.write(f"Release Date: {game_info['release_date']}")
        st.write("---")