Applying Machine Learning Is Not Enough, Predicting Ratings From Games Genres

Do not get captivated if you applied machine learning

Mar 13, 2020

Preface

In this blog post we fit a naive Bayes model to predict games critic ratings given their genres. Data are scraped from best metacritic ps4 games through a scrapper of mine. We do not claim a result of value. An introductory recommender system course yields results much more accurate and reliable than approach presented here. Why am I doing this then? As machine learning today is overwhelmingly hyped, It is nice to have a facet of it in my portfolio. Jump directly to discussion to briefly get into the gist of this post.

Table of Contentes

Intro

Preface

Data Preprocessing

Data Cleansing
Discretize Critic Rating
Obtain Unique Series of Genres
Create Column For Each Genre. Its Value Corresponds To Whether It is in Game’s Genres

Applying Machine Learning

Naive Gaussian Bayes
Predicting Upcoming Games
Discussion

Import Libraries and Local Files

# 3rd-party libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
# local-files
import jsonRW as jsRW
import discretizeIntoCategories as discIntCat

Data Cleansing

Read Data

# read its local json file
metacritic_json = jsRW.readJson('metacritic2019ps4_data')
# parse it as pandas dataframe, then map platform name to it
df = pd.DataFrame(metacritic_json)

df

	critic_rating	developer	genres	name	publisher	release_data	users_rating
0	91	Square Enix	[Role-Playing, Massively Multiplayer]	Final Fantasy XIV: Shadowbringers	Square Enix	Jul 2, 2019	8.3
1	91	PlatinumGames	[Role-Playing, Action RPG]	NieR: Automata - Game of the YoRHa Edition	Square Enix	Feb 26, 2019	8.5
2	91	Capcom R&D Division 1	[Action Adventure, Survival]	Resident Evil 2	Capcom	Jan 25, 2019	8.8
3	90	From Software	[Action Adventure, General]	Sekiro: Shadows Die Twice	Activision	Mar 22, 2019	7.9
4	89	Capcom	[Role-Playing, Action RPG]	Monster Hunter: World - Iceborne	Capcom	Sep 6, 2019	8.4
...	...	...	...	...	...	...	...
336	39	High Voltage Software	[Action, Shooter, Shoot-'Em-Up, Top-Down]	Zombieland: Double Tap - Road Trip	GameMill Entertainment	Oct 15, 2019	4.6
337	37	Square Enix, ilinx inc.	[Action Adventure, General]	Left Alive	Square Enix	Mar 5, 2019	8.3
338	36	Void Studios	[Role-Playing, Action RPG]	Eternity: The Last Unicorn	1C Company	Mar 5, 2019	3.6
339	31	Dean 'Rocket' Hall	[Sci-Fi, Action, Shooter, First-Person, Arcade]	DayZ	Bohemia Interactive	May 30, 2019	2.9
340	31	Wakefield Interactive	[Puzzle, Action]	Where the Bees Make Honey	Wakefield Interactive	Mar 29, 2019	3.2

341 rows × 7 columns

Drop Irrelevant Columns

df = df.drop(['developer', 'name', 'publisher', 'release_data', 'users_rating'], axis=1)

df

	critic_rating	genres
0	91	[Role-Playing, Massively Multiplayer]
1	91	[Role-Playing, Action RPG]
2	91	[Action Adventure, Survival]
3	90	[Action Adventure, General]
4	89	[Role-Playing, Action RPG]
...	...	...
336	39	[Action, Shooter, Shoot-'Em-Up, Top-Down]
337	37	[Action Adventure, General]
338	36	[Role-Playing, Action RPG]
339	31	[Sci-Fi, Action, Shooter, First-Person, Arcade]
340	31	[Puzzle, Action]

341 rows × 2 columns

Critic Rating Data Type To Integer

df.dtypes

critic_rating    object
genres           object
dtype: object

df['critic_rating'] = pd.to_numeric(df['critic_rating'])

df.dtypes

critic_rating     int64
genres           object
dtype: object

Discretize Critic Rating

# categories to be mapped as they fall within certain ranges
categories = pd.Series(["very_low", "low", "moderate", "high", "very_high"])
# critic ratings ranges to be mapped
intervals_categories = [0, 20, 40, 60, 80]

# compute categories according to ranges specified
df['category'] = df.apply(discIntCat.numToCat, axis=1, args=('critic_rating', categories, intervals_categories))
# let categories be recognized by pandas
df['category'] = df['category'].astype("category")
# order categories
df['category'] = df['category'].cat.set_categories(categories, ordered=True)

df

	critic_rating	genres	category
0	91	[Role-Playing, Massively Multiplayer]	very_high
1	91	[Role-Playing, Action RPG]	very_high
2	91	[Action Adventure, Survival]	very_high
3	90	[Action Adventure, General]	very_high
4	89	[Role-Playing, Action RPG]	very_high
...	...	...	...
336	39	[Action, Shooter, Shoot-'Em-Up, Top-Down]	low
337	37	[Action Adventure, General]	low
338	36	[Role-Playing, Action RPG]	low
339	31	[Sci-Fi, Action, Shooter, First-Person, Arcade]	low
340	31	[Puzzle, Action]	low

341 rows × 3 columns

Drop Critic Rating

df = df.drop(['critic_rating'], axis=1)

df

	genres	category
0	[Role-Playing, Massively Multiplayer]	very_high
1	[Role-Playing, Action RPG]	very_high
2	[Action Adventure, Survival]	very_high
3	[Action Adventure, General]	very_high
4	[Role-Playing, Action RPG]	very_high
...	...	...
336	[Action, Shooter, Shoot-'Em-Up, Top-Down]	low
337	[Action Adventure, General]	low
338	[Role-Playing, Action RPG]	low
339	[Sci-Fi, Action, Shooter, First-Person, Arcade]	low
340	[Puzzle, Action]	low

341 rows × 2 columns

Obtain Unique Series of Genres

sr_genres = df['genres']

sr_genres

0                [Role-Playing, Massively Multiplayer]
1                           [Role-Playing, Action RPG]
2                         [Action Adventure, Survival]
3                          [Action Adventure, General]
4                           [Role-Playing, Action RPG]
                            ...                       
336          [Action, Shooter, Shoot-'Em-Up, Top-Down]
337                        [Action Adventure, General]
338                         [Role-Playing, Action RPG]
339    [Sci-Fi, Action, Shooter, First-Person, Arcade]
340                                   [Puzzle, Action]
Name: genres, Length: 341, dtype: object

# concatenate genres lists, then filter duplicated elements
unique_genres = np.unique(np.concatenate(sr_genres, axis=0))

unique_genres

array(['2D', '3D', '4X', 'Action', 'Action Adventure', 'Action RPG',
       'Adventure', 'Arcade', 'Automobile', 'Baseball', 'Basketball',
       "Beat-'Em-Up", 'Biking', 'Billiards', 'Boxing / Martial Arts',
       'Business / Tycoon', 'Card Battle', 'Career', 'Civilian', 'Combat',
       'Command', 'Compilation', 'Cricket', 'Dancing', 'Defense',
       'Fantasy', 'Fighting', 'First-Person', 'Flight', 'Football',
       'General', 'Golf', 'Government', 'Ice Hockey', 'Individual',
       'Japanese-Style', 'Light Gun', 'Linear', 'Management', 'Marine',
       'Massively Multiplayer', 'Matching', 'Miscellaneous', 'Music',
       'Open-World', 'Other', 'Party / Minigame', 'Platformer',
       'Point-and-Click', 'Puzzle', 'Racing', 'Real-Time', 'Rhythm',
       'Roguelike', 'Role-Playing', 'Sandbox', 'Sci-Fi', "Shoot-'Em-Up",
       'Shooter', 'Sim', 'Simulation', 'Skate / Skateboard', 'Soccer',
       'Space', 'Sports', 'Strategy', 'Survival', 'Tactical', 'Tactics',
       'Team', 'Third-Person', 'Top-Down', 'Turn-Based', 'Vehicle',
       'Virtual', 'Virtual Life', 'Visual Novel', 'Western-Style',
       'Wrestling'], dtype='<U21')

Remove Spaces, Slashes and Dashes From Genres Names

# spaces, slashes and dashes converter to underscores and empty string
def underscoreCleaner(strLis_in):
    tem_string = strLis_in
    tem_string = tem_string.replace(' ', '_')
    tem_string = tem_string.replace('/', '')
    tem_string = tem_string.replace('-', '_')
    return tem_string

temLis = pd.Series(unique_genres)

# apply cleaner
temLis = temLis.apply(underscoreCleaner)

cleanedUniqueGenres = temLis

cleanedUniqueGenres

0                   2D
1                   3D
2                   4X
3               Action
4     Action_Adventure
            ...       
74             Virtual
75        Virtual_Life
76        Visual_Novel
77       Western_Style
78           Wrestling
Length: 79, dtype: object

Create Column For Each Genre. Its Value Corresponds To Whether It is in Game’s Genres

# maps genresList_in to a boolean array, corresponding to whether a genre is in game's genres list
def isGenreIn(row_in, column_in, genresList_in):
    # game's genres list
    row_value = pd.Series(row_in[column_in])
    # all unique genres
    genresSer = pd.Series(genresList_in)
    # check whether each genre in all unique genres is in game's genres list
    # return a boolean array, corresponding to whether genre is found in game's list.
    return genresSer.isin(row_value)

# apply above function
genresAsColumns = df.apply(isGenreIn, axis=1, args=('genres', cleanedUniqueGenres))

# rename columns to all unique genres
genresAsColumns.columns = cleanedUniqueGenres

genresAsColumns

	2D	3D	4X	Action	Action_Adventure	Action_RPG	Adventure	Arcade	Automobile	Baseball	...	Team	Third_Person	Top_Down	Turn_Based	Vehicle	Virtual	Virtual_Life	Visual_Novel	Western_Style	Wrestling
0	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
1	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
2	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
3	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
4	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
336	False	False	False	True	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
337	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
338	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
339	False	False	False	True	False	False	False	True	False	False	...	False	False	False	False	False	False	False	False	False	False
340	False	False	False	True	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False

341 rows × 79 columns

# map boolean values to integers. True to 1 and False to 0
genresAsColumns = genresAsColumns.astype(int)

genresAsColumns

	2D	3D	4X	Action	Action_Adventure	Action_RPG	Adventure	Arcade	Automobile	Baseball	...	Team	Third_Person	Top_Down	Turn_Based	Vehicle	Virtual	Virtual_Life	Visual_Novel	Western_Style	Wrestling
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
336	0	0	0	1	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
337	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
338	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
339	0	0	0	1	0	0	0	1	0	0	...	0	0	0	0	0	0	0	0	0	0
340	0	0	0	1	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

341 rows × 79 columns

Map Classes To Their Corresponding Numeric Indices

# a mapper from class string to its index
classesArr = np.array(["very_low", "low", "moderate", "high", "very_high"])
catToIndDict = {"very_low":0, "low":1, "moderate":2, "high":3, "very_high":4}
def catToInd(str_in):
    return catToIndDict[str_in]

classes = df['category']

classes

0      very_high
1      very_high
2      very_high
3      very_high
4      very_high
         ...    
336          low
337          low
338          low
339          low
340          low
Name: category, Length: 341, dtype: category
Categories (5, object): [very_low < low < moderate < high < very_high]

# apply above mapper
classes = classes.apply(lambda x: catToInd(x))

classes

0      4
1      4
2      4
3      4
4      4
      ..
336    1
337    1
338    1
339    1
340    1
Name: category, Length: 341, dtype: category
Categories (5, int64): [0 < 1 < 2 < 3 < 4]

Naive Gaussian Bayes

Convert X and Y To Numpy Arrays

genresAsColumns = genresAsColumns.to_numpy()

classes = classes.to_numpy()

genresAsColumns

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

classes

array([4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1])

Split Data For Testing and Training

X_train, X_test, y_train, y_test = train_test_split(genresAsColumns, classes, test_size=0.3, random_state=0)

Fit Model

# initialize a guassian model
gnb = GaussianNB()
# fit model games genres and classes
clf = gnb.fit(X_train, y_train)

Predicting Test Data

y_pred = clf.predict(X_test)

Accuracy

# Number of correctly labeled points out of total, respectively
correctlyLabeledNum = (y_test == y_pred).sum()
totalPointsNum = X_test.shape[0]
print("Number of Correctly Labeled: ", correctlyLabeledNum)
print("Number of Total Points: ", totalPointsNum)

Number of Correctly Labeled:  17
Number of Total Points:  103

print("Accuracy Percentage: ", correctlyLabeledNum/totalPointsNum)

Accuracy Percentage:  0.1650485436893204

Predicting Upcoming Games

A Mapper From A Series Element To Its Corresponding Index

def indexFromName(series_in, str_in):
    return series_in[series_in == str_in].index[0]

indexFromName(cleanedUniqueGenres, '2D')

Construct Array of Indices From Strings

def arrayFromNames(series_in, strLis_in):
    # Initialize an array of 79 zeros, corresponding to genres numbers
    gameGenres = np.array([0] * 79)
    
    # for each input genre, set its corresponding value to 1
    for name in strLis_in:
        gameGenres[indexFromName(series_in, name)] = 1
        
    return gameGenres

Predict Class Name From Genres Strings

def predictClassFromGenres(strLis_in):
    # map genres strings to their corresponding indices
    gameGenreArray = arrayFromNames(cleanedUniqueGenres, strLis_in)
    # prediction of class numeric value
    gamePred = clf.predict([
        gameGenreArray
    ])
    # return predicted class name
    return classesArr[gamePred[0]]

Doom Eternal

predictClassFromGenres(['Action', 'Shooter', 'First_Person', 'Arcade'])

'low'

Control: The Foundation

predictClassFromGenres(['Action_Adventure', 'General'])

'low'

Resident Evil 3

predictClassFromGenres(['Action_Adventure', 'Survival'])

'moderate'

Discussion

Let’s now analyze the embarrassing results I reached. A realization of ignorance is not as bad as an ignorance of being ignorant. For the latter case, There is no chance for remediation, but for the former, I am skeptic of guaranteed chances.

There is no pattern to be fitted. I have seen plenty of computer science students who just care about machine learning models and give no interest to the data itself! That is exactly alike claiming astronomy is all about telescopes. In fact, Data science is all based about our understanding of real-life data and whether we could discover and verify patterns found in them. Machine learning models are toolbox for the data scientist so that he could reveal insights in data, but they are not his principal goal. Regarding our case in this blog post, It is well-known that genres are not indicators of a game’s quality at all. If the data contains no pattern, then the hypothesized pattern shall not emerge from whatever model you apply. I would doubt my self in case the model reached a high accuracy rate.

Features vectors is ridiculously simplifing the item. Two action-adventure 3rd-person games probably have totally different playing-style/theme. Simplifing games by their genres is alike describing a student’s skills qualifications through his faculty. Is graduating from CS major an indicator of student’s skills? He might be either a lazy or a dedicated student. A curious and challenging inquiry arises here. How do we represent aesthetics in terms of numbers? How do we objectively measure a game’s degree of fun? Is it even possible for science to reach at someday objective measures of human-feelings? The only aspect I am sure of is that no one is sure of answers to these questions (sounds like a self-contradictory statement, right?)

Finally, Note that naive bayes is based on the assumption that features are independant from each other, which is not the case here. Action games are more likely to be adventure, for instance.