Random Forest with Pokemon dataset (with correlation plot) with R

I was so pleased that so many people read my first statistical blog post: “which u.s. state does produce the most beer?”. 

So, I decided to write another one. For this blog post, I worked on pokemon dataset from Kaggle.

Before talking about random forest ,which is one of the most popular machine learning methods,I will begin with corrleation plot. It’s like an appetizer before the main course.


Brief Idea of the dataset

Last time, I felt bad that I forgot to give a brief idea of the data. From now on, I will make sure to provide how the data looks like for each post.

Here is the snapshot of the head of the dataset. This dataset has total 13 variables with 800 pokemons.

Screen Shot 2017-08-13 at 3.35.56 PM

  • Number: ID for each pokemon
  • Name: Name of each pokemon
  • Type 1: Each pokemon has a type, this determines weakness/resistance to attacks
  • Type 2: Some pokemon are dual type and have 2
  • Total: sum of all stats that come after this, a general guide to how strong a pokemon is
  • HP: hit points, or health, defines how much damage a pokemon can withstand before fainting
  • Attack: the base modifier for normal attacks (eg. Scratch, Punch)
  • Defense: the base damage resistance against normal attacks
  • SP Atk: special attack, the base modifier for special attacks (e.g. fire blast, bubble beam)
  • SP Def: the base damage resistance against special attacks
  • Speed: determines which pokemon attacks first each round


Correlation Plot

R has a “corrplot” package and it offers great quality of visualzition on correlation analysis.

Since correlation analysis is for numerical variables, let’s separate variables into categorical and numerical.

Screen Shot 2017-08-13 at 3.41.31 PM

Screen Shot 2017-08-13 at 3.42.05 PM




Simple,isn’t it? Now when you have a simple statistical report homework or work to submit, you can use this library and simply use “corrplot” function.It looks like total variable(dependent variable) is quite correlated with the rest of the variables.  Interestingly, speed is not so related with defense skill but slightly related with special defense.


Random Forest

I know this concept might be new to many readers for this post. But let me try to explain this concept as simple as possible.


<Image: decision tree from https://www.edureka.co/blog/decision-trees/&gt;

I think many of you have seen this tree. This is called decision tree. If you understand this, you are on the half way of understanding random forest. There are two keywords here – random and forests. Random forest is a collection of many decision trees. Instead of relying on a single decision tree, you build many decision trees say 100 of them. And you know what a collection of trees is called – a forest. And for higher accuracy, it’s randomized.


Let’s begin!


Step 1:  Divide the dataset into training set and test set (for Cross validation).

The first step is randomly select “k” features from total “m” features.

Screen Shot 2017-08-13 at 3.59.07 PM

In this example, I randomly assigned 70% of the data as “training” set while the rest of the data is assigned as “test” set. This procedure is called “Cross Validation“.

Then, what is Cross Validation?

<Image: Cross Validation from https://www.edureka.co/blog/implementation-of-decision-tree>

This image is quite self explanatory but let me elaborate. The example is 5- fold cross validation. For each fold, the each test set doesn’t duplicate to another. Using measurement metric(e.g. Mean Absolute Error, Root Suared Mean Error) , it averages the final measure of performance for each fold. By cross validation, it ensures randomization.

Step 2:  Build the random forest model

Screen Shot 2017-08-13 at 3.59.42 PM

Screen Shot 2017-08-13 at 3.59.46 PM

Luckily, R is an open source so there are a lot of packages that make people life easier. Random Forest package provides randomForest function that enables to build random forest model so easily. After building the model on the train dataset, test the prediction on the test dataset.


Step 3: Variable Importance

Screen Shot 2017-08-13 at 4.00.03 PM

Screen Shot 2017-08-13 at 3.23.07 PM

After building the random forest model, you can examine variable importance for the model. Again, I’m using ggplot to create nicer looking graphs.  We can see that generation is the least important while special defense is the most significant in the model.

Step 4: Examine how the model is performing

Screen Shot 2017-08-13 at 4.18.55 PM

Screen Shot 2017-08-13 at 4.21.10 PM

I’m using the two measurements in here: R-squared and MSE. R squared indicates how close the data are to the fitted regression line and MSE is the squared difference between the estimator and what is estimated. In short, the higher R-squared and lower MSE make the better model.


Screen Shot 2017-08-13 at 4.22.15 PM

As a result, R squared is 0.93 and MSE is 994.81. It means that the model explains 93% of the variability of the response data around its mean.

I hope you guys enjoyed reading this!

Data Source: https://www.kaggle.com/abcsds/pokemon

Which U.S. State does produce the most beer?/R using ggplot package with map

Well, I was trying not to talk about what I do for a living in the blog as I initially wanted my blog to be the space outside of my work. But it’s part of me that I love to do statistics even outside of work time so I decided to start the first statistics blog post.

Today, I was done with the work I’m supposed to do a little bit earlier than I expected so I got more than one hour of free time. At the same time, I was craving for a pint of beer but I did not want to risk myself holding a can of beer in the workplace. What I did to combat this thirstiness for beer was looking into craft beer dataset in Kaggle. I knew that the dataset was there for a while for more than 6 months. I have been quite interested in the data but I have not taken into action to analyze it. Today, I was like ” why not?” and digged into it.

As a fan of going to art museums, visualizing is one of my favorite parts in R programming. You know, it’s a human nature to be attracted to beautiful facade.

Based on the craft beer data, I would like to show how to make pretty maps using ggplot package.


Step 1: Read the two datasets(you can find the data from the source part at the           bottom). Get ‘maps’ and ‘ggplot2’ packages 

Screen Shot 2017-08-10 at 8.33.16 PM

Step2:  The name for the  first column of the data set is just’ X’. Let’s change the column name to “brewery_id” just like in beer dataset . This is the way that you can change the column name.

Screen Shot 2017-08-10 at 8.35.30 PM

Since “brewery_id” are common variable now, we can merge two datasets into one using merge function. In this case, we are merging the dataset by brewery id. 

Step 3:  For this example, I would like to show the maps for each abv level. In other words, I will show the frequency for each abv level in the U.S map. Generally, the average abv level for beer is 4.5% . (from: https://www.livescience.com/32735-how-much-alcohol-is-in-my-drink.html). 

Screen Shot 2017-08-10 at 8.39.17 PM

In order to see the distribution, histogram is the best. Let’s see the distribution. This is the way to draw histogram using ggplot package. “beer.data$abv” is the variable that we would like to see its distribution. “col=”blue” is the line color.

Screen Shot 2017-08-10 at 8.50.14 PM

beer hist

There are highest number of observations around 0.045. Based on this, I classified “low” when the abv level is lower than 4.5%. Most of the data points are concentrated between 4.5% and 6% so I defined it as “medium“. There are quite datapoints over 6% so I define it as “high” for 6%-8%. For the abv level higher than 8%, I named it as “very high“.

Step 4: We want to create US map so let’s make the frequency table with state. Don’t forget to factorize state variable. R has a builit in function,”state.abb” without “DC”. DC is the heart of the US and there are people living there so let’s include it, too.

Screen Shot 2017-08-10 at 8.54.56 PM

Step 5: The last sentence in this picture brings map background. map_data function belongs to maps package. tolower function converts to lowercase. 

Screen Shot 2017-08-10 at 8.59.28 PM

Step 6: Now it’s much more complicated ggplot with map. FYI, I set the limit with longitude for x and latitude for y. labs is a great function to clear the name for x-axis, y-axis and the title. We want to see the frequency for each state so fill=freq.

Screen Shot 2017-08-10 at 9.02.13 PM

Screen Shot 2017-08-10 at 9.03.07 PM

As a result,


low alchoholmedium alcoholhigh alcoholvery high

We can see those West Coast People loves lower abv alcohol. Look at Oregon and California. They produce those low abv level beer more than the east coast for sure.  From Medium to Very High, Colorado produces highest number of beer. But California is consistently producing higher than average number of beers.  Other notable producers appear to be the four states on Lake Michigan, and Texas. Utah makes a large showing for low alcohol beers(No wonder). East coast seemed pretty reserved in this map.

Source: https://www.kaggle.com/nickhould/craft-cans