[Python]Principal Component Analysis and K-means clustering with IMDB movie datasets

Hello, today’s post would be the first post that I present the result in Python! Although I love R and I’m loyal to it, Python is widely loved by many data scientists.  Python is quite easy to learn and it has a lot of great functions.

In this post, I implemented unsupervised learning methods: 1. Principal Component Analysis and 2. K-means Clustering. Then a reader who has no background knowledge in Machine Learning would think,”what the hell is unsupervised learning?” I will try my best to explain this concept

Unsupervised Learning

Ok, let’s imagine you are going to backpacking to a new country. Isn’t it exciting? But you did not know much about the country – their food, culture, language etc. However from day 1, you start making sense there, learning to eat new cuisines including what not to eat, find a way to that beach.

In this example,you have lots of information but you do not know what to do with it initially. There is no clear guidance and you have to find the way by yourself. Like this traveling example, unsupervised learning is the method of training your machine learning task only with a set of inputs. Principal Component Analysis and K-means clustering are the most famous examples of unsupervised learning. I will explain them a little bit later.


Before I begin talking about how I analyzed the data, let’s talk about the data. There are total 5,043 movies with 28 attributes. The attributes range from director name to the number of facebook likes.

Screen Shot 2017-09-07 at 9.41.16 PM

1. Data Cleaning

In Statistics class, we often get clean data: no missing values, no NA values. But in reality, the clean data is just like a dream. There are always some messed part of the data and it’s our job to trim the data useable before executing the analysis.

Here are some libraries you need for this post.

Screen Shot 2017-09-07 at 9.46.26 PM

First, let’s do some filtering to extract only the numbered columns and not the ones with words. So, I created a Python list containing the numbered column names “num_list”

Screen Shot 2017-09-07 at 9.45.39 PM

By the way, when it comes to using Python, pandas library is a must-have item. Using pandas library, we can create a new dataframe (movie_num) containing just the numbers

Screen Shot 2017-09-07 at 9.48.19 PM

By using function “fillna(filtering NA)”, we can easily discard NaN values.

If the distribution of certain variables are skewed, we can implement standardization.

Screen Shot 2017-09-07 at 9.50.24 PM

2. Correlation Analysis

Hexbin Plot

Let’s look at some hexbin visualisations first to get a feel for how the correlations between the different features compare to one another. In the hexbin plots, the lighter in color the hexagonal pixels, the more correlated one feature is to another.

Screen Shot 2017-09-07 at 9.52.24 PM

Screen Shot 2017-09-07 at 9.16.31 PM

This is a Hexbin Plot between IMDB Scroe and gross revenue. We can see it’s lighter around the score between 6 and 7.


Screen Shot 2017-09-07 at 9.16.22 PM

This is a Hexbin Plot between IMDB Scroe and duration(days). Again, the score between 6 and 7 is lighter.

We can examine the correlation more using Pearson correlation plot.

Screen Shot 2017-09-07 at 9.58.22 PM.png

Screen Shot 2017-09-07 at 9.17.40 PM.png

As we can see from the heatmap, there are regions (features) where we can see quite positive linear correlations amongst each other, given the darker shade of the colours – top left-hand corner and bottom right quarter. This is a good sign as it means we may be able to find linearly correlated features for which we can perform PCA projections on.

3. EXPLAINED VARIANCE MEASURE &Principal Component Analysis

Now you know what unsupervised learning is (I hope so). Then, let me explain about principal component analysis. The explanation would not be as entertaining as the one in unsupervised learning but I’ll try my best!

Principal component analysis (PCA) is a technique used to emphasize variation and bring out strong patterns in a dataset. It’s often used to make data easy to explore and visualize.  Principal components are dimensions along which your data points are most spread out:

Screen Shot 2017-09-07 at 10.18.24 PM.png

<From: https://www.quora.com/What-is-an-intuitive-explanation-for-PCA>

Let me give you an example. Imagine that you are a nutritionist trying to explore the nutritional content of food. What is the best way to differentiate food items? By vitamin content? Protein levels? Or perhaps a combination of both?

Knowing the variables that best differentiate your items has several uses:

1. Visualization. Using the right variables to plot items will give more insights.

2. Uncovering Clusters. With good visualizations, hidden categories or clusters could be identified. Among food items for instance, we may identify broad categories like meat and vegetables, as well as sub-categories such as types of vegetables.

The question is, how do we derive the variables that best differentiate items?

So, the first step to answer this question is Principal Component Analysis.

A principal component can be expressed by one or more existing variables. For example, we may use a single variable – vitamin C – to differentiate food items. Because vitamin C is present in vegetables but absent in meat, the resulting plot (below, left) will differentiate vegetables from meat, but meat items will clumped together.

To spread the meat items out, we can use fat content in addition to vitamin C levels, since fat is present in meat but absent in vegetables. However, fat and vitamin C levels are measured in different units. So to combine the two variables, we first have to normalize them, meaning to shift them onto a uniform standard scale, which would allow us to calculate a new variable – vitamin C – fat. Combining the two variables helps to spread out both vegetable and meat items.

The spread can be further improved by adding fiber, of which vegetable items have varying levels. This new variable – (vitamin C + fiber) – fat – achieves the best data spread yet.


So,  that’s my explanation of Principal Component analysis and K-means clustering at the same time. Let me apply Principal Component Analysis to this dataset and show how it works.

Explained Variance Measure

I will be using a particular measure called Explained Variance which will be useful in this context to help us determine the number of PCA projection components we should be looking at.

Before calculating explained variance, we need to get eigenvectors and eigenvalues.The eigenvectors and eigenvalues of a covariance (or correlation) matrix represent the “core” of a PCA: The eigenvectors (principal components) determine the directions of the new feature space, and the eigenvalues determine their magnitude. In other words, the eigenvalues explain the variance of the data along the new feature axes.


Screen Shot 2017-09-07 at 10.04.38 PM.png

After sorting the eigenpairs, the next question is “how many principal components are we going to choose for our new feature subspace?”. The explained variance tells us how much information (variance) can be attributed to each of the principal components.

Screen Shot 2017-09-07 at 10.05.06 PM.png

Screen Shot 2017-09-07 at 10.06.02 PMScreen Shot 2017-09-07 at 9.19.19 PM

From the plot above, it can be seen that approximately 90% of the variance can be explained with the 9 principal components. Therefore for the purposes of this notebook, let’s implement PCA with 9 components ( although to ensure that we are not excluding useful information, one should really go for 95% or greater variance level which corresponds to about 12 components).

Screen Shot 2017-09-07 at 10.07.13 PMScreen Shot 2017-09-07 at 9.21.01 PM

There does not seem to be any discernible clusters. However keeping in mind that our PCA projections contain another 7 components, perhaps looking at plots with the other components may be fruitful. For now, let us assume that will be trying a 3-cluster (just as a naive guess) KMeans to see if we are able to visualize any distinct clusters.

5.Visualization with K-means clustering

Screen Shot 2017-09-07 at 10.09.22 PMScreen Shot 2017-09-07 at 9.21.52 PM

This KMeans plot looks more promising now as if our simple clustering model assumption turns out to be right, we can observe 3 distinguishable clusters via this color visualization scheme. However I would also like to generate a KMeans visualization for other possible combinations of the projections against one another. I will use Seaborn’s convenient pairplot function to do the job. Basically pairplot automatically plots all the features in the dataframe (in this case our PCA projected movie data) in pairwise manner. I will pairplot the first 3 projections against one another and the resultant plot is given below:

Screen Shot 2017-09-07 at 10.10.36 PM.png

Screen Shot 2017-09-07 at 9.23.08 PM



Random Forest with Pokemon dataset (with correlation plot) with R

I was so pleased that so many people read my first statistical blog post: “which u.s. state does produce the most beer?”. 

So, I decided to write another one. For this blog post, I worked on pokemon dataset from Kaggle.

Before talking about random forest ,which is one of the most popular machine learning methods,I will begin with corrleation plot. It’s like an appetizer before the main course.


Brief Idea of the dataset

Last time, I felt bad that I forgot to give a brief idea of the data. From now on, I will make sure to provide how the data looks like for each post.

Here is the snapshot of the head of the dataset. This dataset has total 13 variables with 800 pokemons.

Screen Shot 2017-08-13 at 3.35.56 PM

  • Number: ID for each pokemon
  • Name: Name of each pokemon
  • Type 1: Each pokemon has a type, this determines weakness/resistance to attacks
  • Type 2: Some pokemon are dual type and have 2
  • Total: sum of all stats that come after this, a general guide to how strong a pokemon is
  • HP: hit points, or health, defines how much damage a pokemon can withstand before fainting
  • Attack: the base modifier for normal attacks (eg. Scratch, Punch)
  • Defense: the base damage resistance against normal attacks
  • SP Atk: special attack, the base modifier for special attacks (e.g. fire blast, bubble beam)
  • SP Def: the base damage resistance against special attacks
  • Speed: determines which pokemon attacks first each round


Correlation Plot

R has a “corrplot” package and it offers great quality of visualzition on correlation analysis.

Since correlation analysis is for numerical variables, let’s separate variables into categorical and numerical.

Screen Shot 2017-08-13 at 3.41.31 PM

Screen Shot 2017-08-13 at 3.42.05 PM




Simple,isn’t it? Now when you have a simple statistical report homework or work to submit, you can use this library and simply use “corrplot” function.It looks like total variable(dependent variable) is quite correlated with the rest of the variables.  Interestingly, speed is not so related with defense skill but slightly related with special defense.


Random Forest

I know this concept might be new to many readers for this post. But let me try to explain this concept as simple as possible.


<Image: decision tree from https://www.edureka.co/blog/decision-trees/&gt;

I think many of you have seen this tree. This is called decision tree. If you understand this, you are on the half way of understanding random forest. There are two keywords here – random and forests. Random forest is a collection of many decision trees. Instead of relying on a single decision tree, you build many decision trees say 100 of them. And you know what a collection of trees is called – a forest. And for higher accuracy, it’s randomized.


Let’s begin!


Step 1:  Divide the dataset into training set and test set (for Cross validation).

The first step is randomly select “k” features from total “m” features.

Screen Shot 2017-08-13 at 3.59.07 PM

In this example, I randomly assigned 70% of the data as “training” set while the rest of the data is assigned as “test” set. This procedure is called “Cross Validation“.

Then, what is Cross Validation?

<Image: Cross Validation from https://www.edureka.co/blog/implementation-of-decision-tree>

This image is quite self explanatory but let me elaborate. The example is 5- fold cross validation. For each fold, the each test set doesn’t duplicate to another. Using measurement metric(e.g. Mean Absolute Error, Root Suared Mean Error) , it averages the final measure of performance for each fold. By cross validation, it ensures randomization.

Step 2:  Build the random forest model

Screen Shot 2017-08-13 at 3.59.42 PM

Screen Shot 2017-08-13 at 3.59.46 PM

Luckily, R is an open source so there are a lot of packages that make people life easier. Random Forest package provides randomForest function that enables to build random forest model so easily. After building the model on the train dataset, test the prediction on the test dataset.


Step 3: Variable Importance

Screen Shot 2017-08-13 at 4.00.03 PM

Screen Shot 2017-08-13 at 3.23.07 PM

After building the random forest model, you can examine variable importance for the model. Again, I’m using ggplot to create nicer looking graphs.  We can see that generation is the least important while special defense is the most significant in the model.

Step 4: Examine how the model is performing

Screen Shot 2017-08-13 at 4.18.55 PM

Screen Shot 2017-08-13 at 4.21.10 PM

I’m using the two measurements in here: R-squared and MSE. R squared indicates how close the data are to the fitted regression line and MSE is the squared difference between the estimator and what is estimated. In short, the higher R-squared and lower MSE make the better model.


Screen Shot 2017-08-13 at 4.22.15 PM

As a result, R squared is 0.93 and MSE is 994.81. It means that the model explains 93% of the variability of the response data around its mean.

I hope you guys enjoyed reading this!

Data Source: https://www.kaggle.com/abcsds/pokemon