Applying Machine Learning Is Not Enough, Predicting Ratings From Games Genres

Do not get captivated if you applied machine learning

Preface

In this blog post we fit a naive Bayes model to predict games critic ratings given their genres. Data are scraped from best metacritic ps4 games through a scrapper of mine. We do not claim a result of value. An introductory recommender system course yields results much more accurate and reliable than approach presented here. Why am I doing this then? As machine learning today is overwhelmingly hyped, It is nice to have a facet of it in my portfolio. Jump directly to discussion to briefly get into the gist of this post.


Table of Contentes

Intro

Data Preprocessing

Applying Machine Learning


Import Libraries and Local Files

# 3rd-party libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
# local-files
import jsonRW as jsRW
import discretizeIntoCategories as discIntCat

Data Cleansing

Read Data

# read its local json file
metacritic_json = jsRW.readJson('metacritic2019ps4_data')
# parse it as pandas dataframe, then map platform name to it
df = pd.DataFrame(metacritic_json)
df

critic_rating developer genres name publisher release_data users_rating
0 91 Square Enix [Role-Playing, Massively Multiplayer] Final Fantasy XIV: Shadowbringers Square Enix Jul 2, 2019 8.3
1 91 PlatinumGames [Role-Playing, Action RPG] NieR: Automata - Game of the YoRHa Edition Square Enix Feb 26, 2019 8.5
2 91 Capcom R&D Division 1 [Action Adventure, Survival] Resident Evil 2 Capcom Jan 25, 2019 8.8
3 90 From Software [Action Adventure, General] Sekiro: Shadows Die Twice Activision Mar 22, 2019 7.9
4 89 Capcom [Role-Playing, Action RPG] Monster Hunter: World - Iceborne Capcom Sep 6, 2019 8.4
... ... ... ... ... ... ... ...
336 39 High Voltage Software [Action, Shooter, Shoot-'Em-Up, Top-Down] Zombieland: Double Tap - Road Trip GameMill Entertainment Oct 15, 2019 4.6
337 37 Square Enix, ilinx inc. [Action Adventure, General] Left Alive Square Enix Mar 5, 2019 8.3
338 36 Void Studios [Role-Playing, Action RPG] Eternity: The Last Unicorn 1C Company Mar 5, 2019 3.6
339 31 Dean 'Rocket' Hall [Sci-Fi, Action, Shooter, First-Person, Arcade] DayZ Bohemia Interactive May 30, 2019 2.9
340 31 Wakefield Interactive [Puzzle, Action] Where the Bees Make Honey Wakefield Interactive Mar 29, 2019 3.2

341 rows × 7 columns


Drop Irrelevant Columns

df = df.drop(['developer', 'name', 'publisher', 'release_data', 'users_rating'], axis=1)
df

critic_rating genres
0 91 [Role-Playing, Massively Multiplayer]
1 91 [Role-Playing, Action RPG]
2 91 [Action Adventure, Survival]
3 90 [Action Adventure, General]
4 89 [Role-Playing, Action RPG]
... ... ...
336 39 [Action, Shooter, Shoot-'Em-Up, Top-Down]
337 37 [Action Adventure, General]
338 36 [Role-Playing, Action RPG]
339 31 [Sci-Fi, Action, Shooter, First-Person, Arcade]
340 31 [Puzzle, Action]

341 rows × 2 columns

Critic Rating Data Type To Integer

df.dtypes
critic_rating    object
genres           object
dtype: object
df['critic_rating'] = pd.to_numeric(df['critic_rating'])
df.dtypes
critic_rating     int64
genres           object
dtype: object

Discretize Critic Rating

# categories to be mapped as they fall within certain ranges
categories = pd.Series(["very_low", "low", "moderate", "high", "very_high"])
# critic ratings ranges to be mapped
intervals_categories = [0, 20, 40, 60, 80]
# compute categories according to ranges specified
df['category'] = df.apply(discIntCat.numToCat, axis=1, args=('critic_rating', categories, intervals_categories))
# let categories be recognized by pandas
df['category'] = df['category'].astype("category")
# order categories
df['category'] = df['category'].cat.set_categories(categories, ordered=True)    
df

critic_rating genres category
0 91 [Role-Playing, Massively Multiplayer] very_high
1 91 [Role-Playing, Action RPG] very_high
2 91 [Action Adventure, Survival] very_high
3 90 [Action Adventure, General] very_high
4 89 [Role-Playing, Action RPG] very_high
... ... ... ...
336 39 [Action, Shooter, Shoot-'Em-Up, Top-Down] low
337 37 [Action Adventure, General] low
338 36 [Role-Playing, Action RPG] low
339 31 [Sci-Fi, Action, Shooter, First-Person, Arcade] low
340 31 [Puzzle, Action] low

341 rows × 3 columns


Drop Critic Rating

df = df.drop(['critic_rating'], axis=1)
df

genres category
0 [Role-Playing, Massively Multiplayer] very_high
1 [Role-Playing, Action RPG] very_high
2 [Action Adventure, Survival] very_high
3 [Action Adventure, General] very_high
4 [Role-Playing, Action RPG] very_high
... ... ...
336 [Action, Shooter, Shoot-'Em-Up, Top-Down] low
337 [Action Adventure, General] low
338 [Role-Playing, Action RPG] low
339 [Sci-Fi, Action, Shooter, First-Person, Arcade] low
340 [Puzzle, Action] low

341 rows × 2 columns


Obtain Unique Series of Genres

sr_genres = df['genres']
sr_genres
0                [Role-Playing, Massively Multiplayer]
1                           [Role-Playing, Action RPG]
2                         [Action Adventure, Survival]
3                          [Action Adventure, General]
4                           [Role-Playing, Action RPG]
                            ...                       
336          [Action, Shooter, Shoot-'Em-Up, Top-Down]
337                        [Action Adventure, General]
338                         [Role-Playing, Action RPG]
339    [Sci-Fi, Action, Shooter, First-Person, Arcade]
340                                   [Puzzle, Action]
Name: genres, Length: 341, dtype: object
# concatenate genres lists, then filter duplicated elements
unique_genres = np.unique(np.concatenate(sr_genres, axis=0))
unique_genres
array(['2D', '3D', '4X', 'Action', 'Action Adventure', 'Action RPG',
       'Adventure', 'Arcade', 'Automobile', 'Baseball', 'Basketball',
       "Beat-'Em-Up", 'Biking', 'Billiards', 'Boxing / Martial Arts',
       'Business / Tycoon', 'Card Battle', 'Career', 'Civilian', 'Combat',
       'Command', 'Compilation', 'Cricket', 'Dancing', 'Defense',
       'Fantasy', 'Fighting', 'First-Person', 'Flight', 'Football',
       'General', 'Golf', 'Government', 'Ice Hockey', 'Individual',
       'Japanese-Style', 'Light Gun', 'Linear', 'Management', 'Marine',
       'Massively Multiplayer', 'Matching', 'Miscellaneous', 'Music',
       'Open-World', 'Other', 'Party / Minigame', 'Platformer',
       'Point-and-Click', 'Puzzle', 'Racing', 'Real-Time', 'Rhythm',
       'Roguelike', 'Role-Playing', 'Sandbox', 'Sci-Fi', "Shoot-'Em-Up",
       'Shooter', 'Sim', 'Simulation', 'Skate / Skateboard', 'Soccer',
       'Space', 'Sports', 'Strategy', 'Survival', 'Tactical', 'Tactics',
       'Team', 'Third-Person', 'Top-Down', 'Turn-Based', 'Vehicle',
       'Virtual', 'Virtual Life', 'Visual Novel', 'Western-Style',
       'Wrestling'], dtype='<U21')

Remove Spaces, Slashes and Dashes From Genres Names

# spaces, slashes and dashes converter to underscores and empty string
def underscoreCleaner(strLis_in):
    tem_string = strLis_in
    tem_string = tem_string.replace(' ', '_')
    tem_string = tem_string.replace('/', '')
    tem_string = tem_string.replace('-', '_')
    return tem_string
temLis = pd.Series(unique_genres)
# apply cleaner
temLis = temLis.apply(underscoreCleaner)
cleanedUniqueGenres = temLis
cleanedUniqueGenres
0                   2D
1                   3D
2                   4X
3               Action
4     Action_Adventure
            ...       
74             Virtual
75        Virtual_Life
76        Visual_Novel
77       Western_Style
78           Wrestling
Length: 79, dtype: object

Create Column For Each Genre. Its Value Corresponds To Whether It is in Game’s Genres

# maps genresList_in to a boolean array, corresponding to whether a genre is in game's genres list
def isGenreIn(row_in, column_in, genresList_in):
    # game's genres list
    row_value = pd.Series(row_in[column_in])
    # all unique genres
    genresSer = pd.Series(genresList_in)
    # check whether each genre in all unique genres is in game's genres list
    # return a boolean array, corresponding to whether genre is found in game's list.
    return genresSer.isin(row_value)
# apply above function
genresAsColumns = df.apply(isGenreIn, axis=1, args=('genres', cleanedUniqueGenres))
# rename columns to all unique genres
genresAsColumns.columns = cleanedUniqueGenres
genresAsColumns

2D 3D 4X Action Action_Adventure Action_RPG Adventure Arcade Automobile Baseball ... Team Third_Person Top_Down Turn_Based Vehicle Virtual Virtual_Life Visual_Novel Western_Style Wrestling
0 False False False False False False False False False False ... False False False False False False False False False False
1 False False False False False False False False False False ... False False False False False False False False False False
2 False False False False False False False False False False ... False False False False False False False False False False
3 False False False False False False False False False False ... False False False False False False False False False False
4 False False False False False False False False False False ... False False False False False False False False False False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
336 False False False True False False False False False False ... False False False False False False False False False False
337 False False False False False False False False False False ... False False False False False False False False False False
338 False False False False False False False False False False ... False False False False False False False False False False
339 False False False True False False False True False False ... False False False False False False False False False False
340 False False False True False False False False False False ... False False False False False False False False False False

341 rows × 79 columns

# map boolean values to integers. True to 1 and False to 0
genresAsColumns = genresAsColumns.astype(int)
genresAsColumns

2D 3D 4X Action Action_Adventure Action_RPG Adventure Arcade Automobile Baseball ... Team Third_Person Top_Down Turn_Based Vehicle Virtual Virtual_Life Visual_Novel Western_Style Wrestling
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
336 0 0 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
337 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
338 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
339 0 0 0 1 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
340 0 0 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

341 rows × 79 columns


Map Classes To Their Corresponding Numeric Indices

# a mapper from class string to its index
classesArr = np.array(["very_low", "low", "moderate", "high", "very_high"])
catToIndDict = {"very_low":0, "low":1, "moderate":2, "high":3, "very_high":4}
def catToInd(str_in):
    return catToIndDict[str_in]
classes = df['category']
classes
0      very_high
1      very_high
2      very_high
3      very_high
4      very_high
         ...    
336          low
337          low
338          low
339          low
340          low
Name: category, Length: 341, dtype: category
Categories (5, object): [very_low < low < moderate < high < very_high]
# apply above mapper
classes = classes.apply(lambda x: catToInd(x))
classes
0      4
1      4
2      4
3      4
4      4
      ..
336    1
337    1
338    1
339    1
340    1
Name: category, Length: 341, dtype: category
Categories (5, int64): [0 < 1 < 2 < 3 < 4]

Naive Gaussian Bayes

Convert X and Y To Numpy Arrays

genresAsColumns = genresAsColumns.to_numpy()
classes = classes.to_numpy()
genresAsColumns
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])
classes
array([4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1])

Split Data For Testing and Training

X_train, X_test, y_train, y_test = train_test_split(genresAsColumns, classes, test_size=0.3, random_state=0)

Fit Model

# initialize a guassian model
gnb = GaussianNB()
# fit model games genres and classes
clf = gnb.fit(X_train, y_train)

Predicting Test Data

y_pred = clf.predict(X_test)

Accuracy

# Number of correctly labeled points out of total, respectively
correctlyLabeledNum = (y_test == y_pred).sum()
totalPointsNum = X_test.shape[0]
print("Number of Correctly Labeled: ", correctlyLabeledNum)
print("Number of Total Points: ", totalPointsNum)
Number of Correctly Labeled:  17
Number of Total Points:  103
print("Accuracy Percentage: ", correctlyLabeledNum/totalPointsNum)
Accuracy Percentage:  0.1650485436893204

Predicting Upcoming Games

A Mapper From A Series Element To Its Corresponding Index

def indexFromName(series_in, str_in):
    return series_in[series_in == str_in].index[0]
indexFromName(cleanedUniqueGenres, '2D')
0

Construct Array of Indices From Strings

def arrayFromNames(series_in, strLis_in):
    # Initialize an array of 79 zeros, corresponding to genres numbers
    gameGenres = np.array([0] * 79)
    
    # for each input genre, set its corresponding value to 1
    for name in strLis_in:
        gameGenres[indexFromName(series_in, name)] = 1
        
    return gameGenres

Predict Class Name From Genres Strings

def predictClassFromGenres(strLis_in):
    # map genres strings to their corresponding indices
    gameGenreArray = arrayFromNames(cleanedUniqueGenres, strLis_in)
    # prediction of class numeric value
    gamePred = clf.predict([
        gameGenreArray
    ])
    # return predicted class name
    return classesArr[gamePred[0]]

Doom Eternal

doom-eternal

predictClassFromGenres(['Action', 'Shooter', 'First_Person', 'Arcade'])
'low'

Control: The Foundation

control-the-foundation

predictClassFromGenres(['Action_Adventure', 'General'])
'low'

Resident Evil 3

resident-evil-3

predictClassFromGenres(['Action_Adventure', 'Survival'])
'moderate'

Discussion

Let’s now analyze the embarrassing results I reached. A realization of ignorance is not as bad as an ignorance of being ignorant. For the latter case, There is no chance for remediation, but for the former, I am skeptic of guaranteed chances.

There is no pattern to be fitted. I have seen plenty of computer science students who just care about machine learning models and give no interest to the data itself! That is exactly alike claiming astronomy is all about telescopes. In fact, Data science is all based about our understanding of real-life data and whether we could discover and verify patterns found in them. Machine learning models are toolbox for the data scientist so that he could reveal insights in data, but they are not his principal goal. Regarding our case in this blog post, It is well-known that genres are not indicators of a game’s quality at all. If the data contains no pattern, then the hypothesized pattern shall not emerge from whatever model you apply. I would doubt my self in case the model reached a high accuracy rate.

Features vectors is ridiculously simplifing the item. Two action-adventure 3rd-person games probably have totally different playing-style/theme. Simplifing games by their genres is alike describing a student’s skills qualifications through his faculty. Is graduating from CS major an indicator of student’s skills? He might be either a lazy or a dedicated student. A curious and challenging inquiry arises here. How do we represent aesthetics in terms of numbers? How do we objectively measure a game’s degree of fun? Is it even possible for science to reach at someday objective measures of human-feelings? The only aspect I am sure of is that no one is sure of answers to these questions (sounds like a self-contradictory statement, right?)

Finally, Note that naive bayes is based on the assumption that features are independant from each other, which is not the case here. Action games are more likely to be adventure, for instance.

Mostafa Touny
Mostafa Touny
Software Engineering Undergrad