[R][Visualization] Radar Plot with Scotch whisky data

When  I had an Islay single malt for the first time, it was mind blowing. In my first foray into the world of whiskies, I took the plunge into the smokiest, peatiest beast of them all — Laphroig . The smell wasn’t pleasant initially but it got totally different when I took a sip from the glass. That same night, dreams of owning a smoker were replaced by the desire to roam the landscape of smoky single malts. What even dragging me more is how the same whisky can taste different as it ages.

As a relatively new scotch whisky fan, I wanted to investigate whether distilleries within a given region do in fact share taste characteristics. For this, I used a dataset profiling 86 distilleries based on 12 flavor categories.

From math&Statistics department at University of Strathclyde, a professor who seems passionate about whisky created a dataset with 86 scotch whisky distillery. This data includes 12 different taste categories, postcode and latitude and longitude. In this post, I will focus on creating radar plot for each whisky to show the range of tastes.

Data Description

Instead of downloading the file, I just access the data through website.

Screen Shot 2019-02-16 at 7.30.18 PM.png

Previewing the data,

Screen Shot 2019-02-16 at 7.31.27 PM.png

86 malt whiskies are scored between 0-4 for 12 different taste categories including sweetness, smoky, nutty etc.

Then, I subsetted the data excluding unnecessary information for this post.

Screen Shot 2019-02-16 at 7.32.45 PM.png

Required Library

Screen Shot 2019-02-16 at 7.33.56 PM.png

Here are libraries that we need. For this post, ggRadar function is the main one and it’s from ggiraphExtra library.

Main Code

Screen Shot 2019-02-16 at 7.36.26 PM.png

For this case, I just selected the whiskys I know. You can use sample function in R if you want to see random whisky taste plots.  As you can see, ggRadar function is pretty straight-forward. Since we want to see taste profile for each distillery, let’s set ‘group =distillery’. 

 

Tara,

Screen Shot 2019-02-16 at 6.59.32 PM.png

Now with this code, we, the passionate whisky explorers, will easily identify the flavors of whisky and explore different kinds!

Advertisements

[R] Google Map Visualization

Hello, for this post, I will show how to visualize spatial data on Google Map using R. It is simpler than you think.

What is Spatial Data?

it is the data or information that identifies the geographic location of features and boundaries. The data that I’m using today has longitude variable and latitude variable so that we can locate the data points accurately on the map.

Now you know what spatial data is roughly so let’s jump into the map visualization.

First, Download the libraries 

Screen Shot 2017-09-25 at 10.12.15 AM.png

In ggmap, you need ggplot2 package.

ggmap library contains all the information of google map so we can see every city map as we want to.

Second, Call the Google map image

For example, I want to see London Google map. In this case, I can simply use qmap command in ggmap and set the location equal to London.

Screen Shot 2017-09-25 at 10.13.55 AM.png

Screen Shot 2017-09-22 at 12.10.52 PM

Then, you can get nice image of London Google map.

 

But the data I’m using is about crimes in Houston so let’s change it to Houston instead.

Screen Shot 2017-09-25 at 10.17.38 AM.png

Using ‘names’ command, we can get an overview of the variables in the data

Screen Shot 2017-09-25 at 10.17.59 AM.png

For spatial data, as I mentioned in the first paragraph, “lon” and “lat” variables are necessary.

Using ‘dim’ command, we can get the number of rows and columns. Multiplication between rows and columns make dimensions. From Jan 2010 to Aug 2010 in Houston, there were 86,314 crimes. Quite extraordinary!

Screen Shot 2017-09-25 at 10.18.04 AM

#Point Data Visualization

Screen Shot 2017-09-25 at 10.20.59 AM.png

Simply, we can use geom_point in ggplot2 package to demonstrate the point map visualization. In this case, I wanted to see the frequencies of different types of crimes.

Screen Shot 2017-09-22 at 12.04.06 PM

Pink color is pretty dominant and it indicates that theft is the most predominant crimes in Houston from Jan 2010 to Aug 2010. The second most frequent crime is burglary(the color is confusing, I just hope it’s not murder). Auto theft occurred occasionally.

 

#Heat Map

If you want to see the density and frequency of the crimes, heat map is the effective.

Screen Shot 2017-09-25 at 10.26.16 AM.png

In this case, we can use stat_density2d for this kind of visualization.

Screen Shot 2017-09-22 at 12.07.14 PM

From this heat map, we can observe which area is the most crime-ridden area. Luckily, the campus areas are relatively safer. And it corresponds to the point map that there are lots of points in the first map where it is red in this map. And the red area is the heart of Houston downtown. I hope it has been getting better since then but looks like we’d better be careful around the downtown Houston.

 

 

 

 

[Python]Principal Component Analysis and K-means clustering with IMDB movie datasets

Hello, today’s post would be the first post that I present the result in Python! Although I love R and I’m loyal to it, Python is widely loved by many data scientists.  Python is quite easy to learn and it has a lot of great functions.

In this post, I implemented unsupervised learning methods: 1. Principal Component Analysis and 2. K-means Clustering. Then a reader who has no background knowledge in Machine Learning would think,”what the hell is unsupervised learning?” I will try my best to explain this concept

Unsupervised Learning

Ok, let’s imagine you are going to backpacking to a new country. Isn’t it exciting? But you did not know much about the country – their food, culture, language etc. However from day 1, you start making sense there, learning to eat new cuisines including what not to eat, find a way to that beach.

In this example,you have lots of information but you do not know what to do with it initially. There is no clear guidance and you have to find the way by yourself. Like this traveling example, unsupervised learning is the method of training your machine learning task only with a set of inputs. Principal Component Analysis and K-means clustering are the most famous examples of unsupervised learning. I will explain them a little bit later.

Data

Before I begin talking about how I analyzed the data, let’s talk about the data. There are total 5,043 movies with 28 attributes. The attributes range from director name to the number of facebook likes.

Screen Shot 2017-09-07 at 9.41.16 PM

1. Data Cleaning

In Statistics class, we often get clean data: no missing values, no NA values. But in reality, the clean data is just like a dream. There are always some messed part of the data and it’s our job to trim the data useable before executing the analysis.

Here are some libraries you need for this post.

Screen Shot 2017-09-07 at 9.46.26 PM

First, let’s do some filtering to extract only the numbered columns and not the ones with words. So, I created a Python list containing the numbered column names “num_list”

Screen Shot 2017-09-07 at 9.45.39 PM

By the way, when it comes to using Python, pandas library is a must-have item. Using pandas library, we can create a new dataframe (movie_num) containing just the numbers

Screen Shot 2017-09-07 at 9.48.19 PM

By using function “fillna(filtering NA)”, we can easily discard NaN values.

If the distribution of certain variables are skewed, we can implement standardization.

Screen Shot 2017-09-07 at 9.50.24 PM

2. Correlation Analysis

Hexbin Plot

Let’s look at some hexbin visualisations first to get a feel for how the correlations between the different features compare to one another. In the hexbin plots, the lighter in color the hexagonal pixels, the more correlated one feature is to another.

Screen Shot 2017-09-07 at 9.52.24 PM

Screen Shot 2017-09-07 at 9.16.31 PM

This is a Hexbin Plot between IMDB Scroe and gross revenue. We can see it’s lighter around the score between 6 and 7.

 

Screen Shot 2017-09-07 at 9.16.22 PM

This is a Hexbin Plot between IMDB Scroe and duration(days). Again, the score between 6 and 7 is lighter.

We can examine the correlation more using Pearson correlation plot.

Screen Shot 2017-09-07 at 9.58.22 PM.png

Screen Shot 2017-09-07 at 9.17.40 PM.png

As we can see from the heatmap, there are regions (features) where we can see quite positive linear correlations amongst each other, given the darker shade of the colours – top left-hand corner and bottom right quarter. This is a good sign as it means we may be able to find linearly correlated features for which we can perform PCA projections on.

3. EXPLAINED VARIANCE MEASURE &Principal Component Analysis

Now you know what unsupervised learning is (I hope so). Then, let me explain about principal component analysis. The explanation would not be as entertaining as the one in unsupervised learning but I’ll try my best!

Principal component analysis (PCA) is a technique used to emphasize variation and bring out strong patterns in a dataset. It’s often used to make data easy to explore and visualize.  Principal components are dimensions along which your data points are most spread out:

Screen Shot 2017-09-07 at 10.18.24 PM.png

<From: https://www.quora.com/What-is-an-intuitive-explanation-for-PCA>

Let me give you an example. Imagine that you are a nutritionist trying to explore the nutritional content of food. What is the best way to differentiate food items? By vitamin content? Protein levels? Or perhaps a combination of both?

Knowing the variables that best differentiate your items has several uses:

1. Visualization. Using the right variables to plot items will give more insights.

2. Uncovering Clusters. With good visualizations, hidden categories or clusters could be identified. Among food items for instance, we may identify broad categories like meat and vegetables, as well as sub-categories such as types of vegetables.

The question is, how do we derive the variables that best differentiate items?

So, the first step to answer this question is Principal Component Analysis.

A principal component can be expressed by one or more existing variables. For example, we may use a single variable – vitamin C – to differentiate food items. Because vitamin C is present in vegetables but absent in meat, the resulting plot (below, left) will differentiate vegetables from meat, but meat items will clumped together.

To spread the meat items out, we can use fat content in addition to vitamin C levels, since fat is present in meat but absent in vegetables. However, fat and vitamin C levels are measured in different units. So to combine the two variables, we first have to normalize them, meaning to shift them onto a uniform standard scale, which would allow us to calculate a new variable – vitamin C – fat. Combining the two variables helps to spread out both vegetable and meat items.

The spread can be further improved by adding fiber, of which vegetable items have varying levels. This new variable – (vitamin C + fiber) – fat – achieves the best data spread yet.

 

So,  that’s my explanation of Principal Component analysis and K-means clustering at the same time. Let me apply Principal Component Analysis to this dataset and show how it works.

Explained Variance Measure

I will be using a particular measure called Explained Variance which will be useful in this context to help us determine the number of PCA projection components we should be looking at.

Before calculating explained variance, we need to get eigenvectors and eigenvalues.The eigenvectors and eigenvalues of a covariance (or correlation) matrix represent the “core” of a PCA: The eigenvectors (principal components) determine the directions of the new feature space, and the eigenvalues determine their magnitude. In other words, the eigenvalues explain the variance of the data along the new feature axes.

 

Screen Shot 2017-09-07 at 10.04.38 PM.png

After sorting the eigenpairs, the next question is “how many principal components are we going to choose for our new feature subspace?”. The explained variance tells us how much information (variance) can be attributed to each of the principal components.

Screen Shot 2017-09-07 at 10.05.06 PM.png

Screen Shot 2017-09-07 at 10.06.02 PMScreen Shot 2017-09-07 at 9.19.19 PM

From the plot above, it can be seen that approximately 90% of the variance can be explained with the 9 principal components. Therefore for the purposes of this notebook, let’s implement PCA with 9 components ( although to ensure that we are not excluding useful information, one should really go for 95% or greater variance level which corresponds to about 12 components).

Screen Shot 2017-09-07 at 10.07.13 PMScreen Shot 2017-09-07 at 9.21.01 PM

There does not seem to be any discernible clusters. However keeping in mind that our PCA projections contain another 7 components, perhaps looking at plots with the other components may be fruitful. For now, let us assume that will be trying a 3-cluster (just as a naive guess) KMeans to see if we are able to visualize any distinct clusters.

5.Visualization with K-means clustering

Screen Shot 2017-09-07 at 10.09.22 PMScreen Shot 2017-09-07 at 9.21.52 PM

This KMeans plot looks more promising now as if our simple clustering model assumption turns out to be right, we can observe 3 distinguishable clusters via this color visualization scheme. However I would also like to generate a KMeans visualization for other possible combinations of the projections against one another. I will use Seaborn’s convenient pairplot function to do the job. Basically pairplot automatically plots all the features in the dataframe (in this case our PCA projected movie data) in pairwise manner. I will pairplot the first 3 projections against one another and the resultant plot is given below:

Screen Shot 2017-09-07 at 10.10.36 PM.png

Screen Shot 2017-09-07 at 9.23.08 PM

 

Game of Thrones Battle Analysis

Today is a big day for GOT fans: it’s the day of the last episode of season 7. Before you watch this episode, I prepared the Game of thrones battle analysis. The data itself is not the most updated one but I think it will give you guys good insight about the battles in GOT.

Data

The GOT battle data has 38 observations with 25 variables.

 

Screen Shot 2017-08-27 at 11.50.36 AM.png

The variables include attacker, defender, family, year and outcome of the battle.

Analysis

1.Is Size of the Army Often Decides the Outcome of the Battle?

Screen Shot 2017-08-27 at 11.52.20 AMattacker size

So, in this graph there is a blue line and the pink line. The blue line indicates the smoothing line using linear regression while the pink line indicated the smoothing line using loess. Linear regression is straight forward: you are looking for straight line that minimizes residual sum of errors. The Loess involves nonparametric statistics that allow non linearity.

Larger the size does not mean a guaranteed victory. For example, Mance Rayder defeated Stannis Baratheon brutally. Stannis was marching with 100,000 soldiers and Mance Rayder was with less than 1500 member troop.

Since I already demonstrated these kinds of ggplots, I’ll just skip the code part.

2.Which king fought maximum number of battles?

 

attacker.jpeg

(pink indicated NA)

We can see that Joffrey/Tommen Baratheon have attacked the most followed by Robb Stark. Joffrey never participated in a battle directly but his brutal attitude and shocking decisions are unforgettable.

3.#How the Commanders of the Attacking Kings have Performed?

 

Screen Shot 2017-08-27 at 11.42.28 AM

 

Looks like Gregor Clegane had fought most number of battles and won all of them for Joffrey.

 

4. What are the Different Types of Battles fought and what their counts are?

 

battle type

Pitched battle is the most common battle type followed by siege.

5. In Which Regions Battles were Fought, Who were all the attackers?

Screen Shot 2017-08-27 at 11.43.48 AM

According to the data all the kings fought in The North. Joffrey fought most of his battles in The Riverlands. Joffrey’s fights are mainly to defend the Kings Landing because the The Riverlands are Between Everything and Everything Else

 

6.Type of Battles and The Attacker Kings

 

Screen Shot 2017-08-27 at 11.44.29 AM

Stannis Baratheon had the largest army among all other Kings. He never needed to have an ambush, on the other hand Robb Stark was slowly building his place so he used ambush the most. Data proves he ambushed most of the time and acquired his army before his bloody deathy during Red Wedding.

 

7. Kings and Their Army Strength

 

Screen Shot 2017-08-27 at 11.45.43 AM

 

8. Kings vs Kings

 

Screen Shot 2017-08-27 at 11.45.10 AM

Joffrey had fought agains almost all other kings except the wildlings. He was quite far from The Wall and there is no need for him to cross the wall. However, Robb Stark was quite focussed… His quest was to take revenge on his old friends the heirs of Robert Baratheon who kill his father Ned Stark.

 

Things to do next

  1. Work on character death analysis prediction
  2. Battle prediction

[R]Create word cloud with Harry Potter

Nowadays I’m trying to learn text analysis by myself. I came across how to create word cloud while exploring information on text/sentiment analysis. Sentiment analysis is quite tricky but I’m learning it. I hope I can demonstrate it in the near future but let’s start with word cloud.

Data

I found this harry potter dataset from here. I installed harry potter package using this:

Screen Shot 2017-08-21 at 9.38.51 AM

This package contains all the full text for the seven books so the data is completely text this time.

 

Word Cloud

Now I will demonstrate how to create word cloud. It’s completely data visualization so it involved very little statistics( or it’s free of statistics).

Step 1 Install these packages

Screen Shot 2017-08-21 at 9.42.15 AM

Those are the packages that you need for creating word cloud.

Step 2  Create Corpus

Screen Shot 2017-08-21 at 9.45.28 AM

Corpus is a document containing (natural language) text. It’s usually large and well structured. In this case, I created corpus with philosopher stone.

Step 3 Let’s convert the corpus to plain text document

Screen Shot 2017-08-21 at 9.58.57 AM

Step 4  Let’s convert the corpus to plain text document

Screen Shot 2017-08-21 at 10.00.09 AM

In this way, we can remove a lot of redundant stuffs in the context. The examples of ‘Stopwords(‘english’) are :

Screen Shot 2017-08-21 at 10.02.45 AM

From this process, we can trim the texts and extract fundamental texts that we need.

Step 5 Create the new corpus with the polished one and perform stemming.Screen Shot 2017-08-21 at 10.04.25 AM

As I wrote in the comment, stemming transforms words into the most basic form

Step 6 Lastly, create word cloudScreen Shot 2017-08-21 at 10.06.08 AM

max.words controls the maximum number of word cloud. Adding colors makes the word cloud look prettier.

Result

philospher

<Philospher’s stone>

Not surprisingly, Harry is the most frequently mentioned. Looks like JK Rowling likes to use ‘said‘ and ‘look‘. Ron and Hagrid  are slightly more appeared than Hermione and Dumbledore.

chamber

 

<Chamber of Secrets>

In this series, Ron is more emphasized than the previous one. Compared to the last one,Malfoy is relatively more popular in this one. In chamber of secrets, looking into basilisk eyes directly caused death so ‘eyes‘ are also one of the most popular words in this one. If you see it closely, you can also see lockhart.

 

prisoner

<Prisoner of Azkaban>

Now we can see Snape is appeared in non-green color. It implies that Snape can be an important figure as the series goes by. As you remember, this was the first novel that involved with Sirius Black and you can see his name here. And Hagrid is back here again.

goblet

<Goblet of Fire>

In Goblet of Fire, we can see the significance of Dumbledore arose. Since Harry went to Triwizard match with Weasley family, we can see Weasley in here as well.

 

order phoenix

<order of phoenix>

Not so diffrent from previous one, but we can notice Umbridge appears here.

half blood

<Half Blood Prince>

In this series, Dumbledore appears more than Ron and Hermione. If you know the plot of this episode, it’s reasonable why Dumbledore appears more than the couple. Also, we can see Slughorn which makes it distinctive from other series. Malfoy and Snape are back there again in non-green color.

deathly

<Deathly Hallows>

In this novel, the main part is the risky and important adventure with Ron and Hermione. No wonder why they are the largest besides ‘Harry’ and ‘Said’. Since this series reveals that there is a legendary wand, wand appears as almost important as dumbledore.  For verb, you can see ‘think’ and it may imply that there are more internal conversations within the character. Voldemort didn’t appear too much in these word clouds and we can see his name in this one.

<Summary>

 

We can see Harry is truly the center of the series since the title is already Harry Potter. To me, it’s surprising to see Voldemort didn’t appear as much as I thought. Since ‘said‘ is the most frequently used verb, it indicates that the many part of the novels is based on the conversation. Besides Harry, Ron, Hermione and Dumbledore are the most important in this series.