[Book Review]Game of Thrones

I’m very slow in catching trends and now I started to read the series of “A Song of Ice and Fire”. And as you know, it’s a big commitment to read this series. All the pages of the five books are more than 4000 pages and George RR Martin hasn’t finished the series yet(hope the 6th book comes out soon!)


Although I already know some stories roughly beforehand(I heard too many spoilers) , it was quite interesting to read and this book has some unique characteristics that separates from other fantasy novels(e.g Lord of the Rings,Harry Potter) . As many of you have already seen in the drama;good characters are not always make it to the end(think about Eddard Stark) and nice-personality characters have some flaws, not epitome of all the good-ness in some fantasy novels(e.g. Aragon in Lord of the Rings).  I think those characteristics make this book more realistic and approachable. The characters are more like Greek Mythology  gods. We know that everyone is not perfect. Plus, if you are fascinated by medieval English history, you will be able to find a lot of resemblances and GRRM did a great job in incorporating historical facts into this story.

The book isn’t without its flaws, of course. Although different characters narrate different chapters, there is absolutely no change in tone from character to character, to the point where the eight-year-old thinks, acts, and talks exactly like the forty-year-olds in the book. Certain characters are absent for much too long, resulting in implausible leaps from Mindset A to Mindset Z (Daenerys goes from “I don’t want to marry Khal Drogo and I don’t want to be queen of anything!” to calling Drogo “my sun-and-stars” and planning how she’s going to take back her family’s throne in the space of two chapters, with nothing in between to explain how she got to that point), and certain characters who should have had chapters devoted to their particular mindset are absent from the book (what I wouldn’t give to have read a chapter written from Cersei’s perspective).

But those are minor quibbles. This is a good fantasy book, because it subverts so many familiar fantasy tropes. Tropes like the idea of good guys and bad guys, and nothing in between. This isn’t The Lord of the Rings, where the good guys are noble and awesome and handsome and will win the big final battle and the bad guys are literally pure evil and ugly and will suffer for their foolish attempts at conquest. Martin was strongly influenced by the Wars of the Roses, and the similarities are clear: there’s no single good guy who deserves to have the throne over everyone else; instead we have several powerful families, all of them varying degrees of evil, fighting and clawing over what is, at the end of the day, just a stupid crown. The guy who won the crown from the original ruler, King Robert, is our typical fantasy hero, but he finds that after fifteen years of ruling, actually running a kingdom is a lot less fun that fighting for one. And that’s the way things go: it’s easy to depose the crazy despot, but what happens when you take his place and have to start thinking about taxes and actually governing this country that you fought so hard for? It sucks, that’s what happens.

At the end of this book, I was amazed by the world created in GRRM’s brain. He must have lived in that world at the same time mentally in order to describe and make the story in elaborate way. Now I’m turning the first page of A Clash of Kings.




[R] Harry Potter Sentiment Analysis

Last time, I created word clouds based on Harry Potter. In this post, I will discuss how emotions change throughout each chapter for each book.

  1. Download these libraries

Screen Shot 2017-09-16 at 3.11.45 PM

This time, you need to download “sentimentr” this time. Lots of useful work can be done by tokenizing at the word level, but sometimes it is useful or necessary to look at different units of text. For example, some sentiment analysis algorithms look beyond only unigrams (i.e. single words) to try to understand the sentiment of a sentence as a whole. These algorithms try to understand that

I’m not having a good day.

is a sad sentence, not a happy one, because of negation. The  sentimentr R package are examples of such sentiment analysis algorithms. For these, we may want to tokenize text into sentences.

2. Tokenize text into sentences.

Screen Shot 2017-09-16 at 3.11.50 PM

The argument token= sentences attempts to break up text by punctuation.

3. Break up the  text by chapter and sentence.

Screen Shot 2017-09-16 at 3.11.56 PM

This will allow us to assess the net sentiment by chapter and by sentence. First, we need to track the sentence numbers and then I create an index that tracks the progress through each chapter. I then unnest the sentences by words. This gives us a tibble that has individual words by sentence within each chapter.

4. Join “afinn” lexicon and compute the net sentiment score

Screen Shot 2017-09-16 at 3.12.02 PM

Now, as before, I join the AFINN lexicon and compute the net sentiment score for each chapter.The AFINN lexicon assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment.

5. Visualize using ggplot

Screen Shot 2017-09-16 at 3.12.18 PM




<Philosopher’s Stone>

This book has the least number of chapters among all seven books. The range of sentiment is from -20 to 15 and it’s the narrowest range of sentiment as well. We can observe that the first chapter is emotionally neutral while chapter 17 contains most emotionally negative and most emotionally positive at the same time. We can see that the ending is relatively happy ending in this book.



chamber of secrets

<Chamber of Secrets>

It also has the narrowest range of emotions with more chapters. About 25% progress in chapter 1, there is a quite conspicuous negative part and I wonder what it was about.


prisoner of azkaban

<Prisoner of Azkaban>

It looks like Prisoner Azkaban does not have many emotionally positive parts. We can see that the highest score is relatively lower than the other two previous series. Instead, the minimum value got lower which indicates that the net sentiment score is lower. Especially, at 50% of chapter 16, we can see dart red color. It indicates that Prisoner of Azkaban got darker than previous ones. But still, it is happy ending.

goblet of fire

<Goblet of Fire>

From this book, J.K Rowling started to include more chapters and Goblet of fire has 37 chapters. Emotional range is similar with previous three books. Compared to Prisoner of Azkaban, there are some noticeable blue parts and it may be because Harry Potter getting high scores in Tri Wizard competition was quite exciting. But there are also some red parts which may include Harry Potter being scorned by friends and the death of Cedric. That’s why the ending part is relatively neutral.


orderof Phoenix

<Order of Phoenix>

I feel this one is slightly more colorful than the previous ones.  There are a lot of blues around the middle of the stories but as it goes by, red is pretty dominant. Considering  Sirius Black was killed at the end, it explains why the ending part is not happy ending.


Half blood Prince

<Half Blood Prince>

It is somehow less colorful than Order of Phoenix. We also should notice that the highest value is the largest in this book. For example, past 75% in chapter 4, the net score is around 30 (I forgot why). Also, there is the darkest red part in chapter 28: 50% and it may be the moment when Dumbledore was killed.

Deathly Hallows

<Deathly Hallows>

Interestingly, this book has the lowest net score: -40. In Chapter 17 after 50% progression, there is -40 part. According to the story, it is the part where Harry confronted Bathilda changing into a snake. Besides that, we can see that the negative and neutral sentiment is dominant in this one. But we know that it ends well!


[Travel] Auschwitz Concentration Camp

In  Summer 2011, I had a family trip for 10 days in Eastern European Area. Visiting Auschwitz in Poland was obviously not the most pleasant part of the trip but the most memorable and shocking. In fact, my family initially thought of skipping it since it might be emotionally disturbing. But since it’s one of the most historical significant monuments, we eventually decided to visit Auschwitz Concentration Camp.

It was  about one hour drive from Krakow, Poland. The scenery looked quite ordinary at that moment but thinking about how people coerced to stuck in the camp would have felt while looking the view from completely packed train heading to Auschwitz.

On that day, Auschwitz was quite crowded with visitors all around the world. It was mandatory to accompany with a guide to look around the concentration camp so we had a guide who can speak fluent English.  Before we looked around the facilities in Auschwitz,  the guide led us to the museum to provide background information.

Nazi decided to build the giant concentration camp in Auschwitz(Polish: Oświęcim) because Auschwitz is geologically the center of Europe that can easily be reached by railroads. Due to the location, Nazi thought they could easily transported Jews and other “inferior” people from all around the Europe. Plus, Poland has one of the largest Jews population in Europe at that time. Nazi were able to gather a lot of Jewish people by telling them Nazi will provide shelters for them.

Screen Shot 2017-09-10 at 10.35.21 AM.png

Here is the statistics of the estimated number of Jews deported to Auschwitz. As you can see, most people deported to Auschwitz are from Hungary and Poland. Surprisingly, Nazi even deported people in Norway.  Among those people, 1.1 million were killed.


auschwitz statistics


<Auschwitz Concentration Camp 1>

In this picture, you can notice there are a lot of same looking buildings and that’s where many inmates who were capable of doing hard labor. Men who were capable of doing hard labor were sent to Auschwitz Concentration Camp 1 while women and young people were sent to Auschwitz Concentration Camp 2. Otherwise, old people ,who were not likely to do work well, were killed as soon as they arrived the camp.


According to “Man’s Search for Meaning” written by Viktor Frankl who survived in the camp, Nazi militants decided who to send to the camp or be killed instantly using his finger: right-you will survive laboring in the camp, left- die instantly in gas shower.



This is monumental entrance of the concentration camp with the notorious slogan ‘Arbeit macht frei’ which means “Work makes you free”. Not only in Auschwitz, but also there are same slogan in other concentration camp like Dachau, Germany. The inmates showed resistance in a subtle way by flipping B upside-down.

Inside these buildings, they exhibited what inmates were coerced to give it to Nazi before entering the camp. Indeed, Nazi took pretty much everything as they can even including hairs and leg casts and it was speechless.

Screen Shot 2017-09-10 at 10.54.21 AM.png

Since many inmates thought they would get a new shelter, they brought a lot of things like these mugs and dishes.

Screen Shot 2017-09-10 at 10.58.08 AM.png

And there are enormous piles of shoes.

Screen Shot 2017-09-10 at 10.59.02 AM.png

And these portmanteaus with the owners’ names on it.


There are some people who thought it would be better to commit a suicide than continuing living in the camp with the worst and unhygienic conditions. This is where they tried to quit their lives and there are also towers where Kappos can watch those people.



<Auschwitz Death Wall>

The condemned were led to the wall for execution. SS men shot several thousand people there—mostly Polish political prisoners and, above all, members of clandestine organizations.


Screen Shot 2017-09-10 at 11.04.12 AM.png


Screen Shot 2017-09-10 at 11.03.59 AM.png

This is the demonstration inside the buildings. Imagine there were tons of people packed in those buildings. We can see how the conditions of living in the camp were utterly terrible.



<Chimney of gas chamber>

gas chambers

<Gas Chamber>

If you look closely on the whiter part of the chamber, most part of it is the nail scratches.



Screen Shot 2017-09-10 at 11.17.47 AM.png

Next to the gas chamber, it is where Rudolf Höss,the longest-serving commandant of Auschwitz concentration camp, was executed. While many inmates were killed, he often had parties in his house near Auschwitz with other Nazi officers. This house is also closely located to the gas chamber.

Screen Shot 2017-09-10 at 11.21.25 AM.png


Of course, visiting Auschwitz would not be the pleasant part of the trip but I believe every person needs to visit this place. Visiting this place in my life gave me a good opportunity to contemplate how cruel humans can be and it encouraged me to read more about the journals about survival in the camp. German government officially apologized to the victims and financially support running this place. Indeed, there were also a lot of Germans visiting this place or other concentration camps in Germany to learn their mistakes in the past and tried not to repeat it.  Although Mark Twain said “History doesn’t repeat itself but it often rhymes” , we should not forget this and tried as best as we can not to repeat it.








[Python]Principal Component Analysis and K-means clustering with IMDB movie datasets

Hello, today’s post would be the first post that I present the result in Python! Although I love R and I’m loyal to it, Python is widely loved by many data scientists.  Python is quite easy to learn and it has a lot of great functions.

In this post, I implemented unsupervised learning methods: 1. Principal Component Analysis and 2. K-means Clustering. Then a reader who has no background knowledge in Machine Learning would think,”what the hell is unsupervised learning?” I will try my best to explain this concept

Unsupervised Learning

Ok, let’s imagine you are going to backpacking to a new country. Isn’t it exciting? But you did not know much about the country – their food, culture, language etc. However from day 1, you start making sense there, learning to eat new cuisines including what not to eat, find a way to that beach.

In this example,you have lots of information but you do not know what to do with it initially. There is no clear guidance and you have to find the way by yourself. Like this traveling example, unsupervised learning is the method of training your machine learning task only with a set of inputs. Principal Component Analysis and K-means clustering are the most famous examples of unsupervised learning. I will explain them a little bit later.


Before I begin talking about how I analyzed the data, let’s talk about the data. There are total 5,043 movies with 28 attributes. The attributes range from director name to the number of facebook likes.

Screen Shot 2017-09-07 at 9.41.16 PM

1. Data Cleaning

In Statistics class, we often get clean data: no missing values, no NA values. But in reality, the clean data is just like a dream. There are always some messed part of the data and it’s our job to trim the data useable before executing the analysis.

Here are some libraries you need for this post.

Screen Shot 2017-09-07 at 9.46.26 PM

First, let’s do some filtering to extract only the numbered columns and not the ones with words. So, I created a Python list containing the numbered column names “num_list”

Screen Shot 2017-09-07 at 9.45.39 PM

By the way, when it comes to using Python, pandas library is a must-have item. Using pandas library, we can create a new dataframe (movie_num) containing just the numbers

Screen Shot 2017-09-07 at 9.48.19 PM

By using function “fillna(filtering NA)”, we can easily discard NaN values.

If the distribution of certain variables are skewed, we can implement standardization.

Screen Shot 2017-09-07 at 9.50.24 PM

2. Correlation Analysis

Hexbin Plot

Let’s look at some hexbin visualisations first to get a feel for how the correlations between the different features compare to one another. In the hexbin plots, the lighter in color the hexagonal pixels, the more correlated one feature is to another.

Screen Shot 2017-09-07 at 9.52.24 PM

Screen Shot 2017-09-07 at 9.16.31 PM

This is a Hexbin Plot between IMDB Scroe and gross revenue. We can see it’s lighter around the score between 6 and 7.


Screen Shot 2017-09-07 at 9.16.22 PM

This is a Hexbin Plot between IMDB Scroe and duration(days). Again, the score between 6 and 7 is lighter.

We can examine the correlation more using Pearson correlation plot.

Screen Shot 2017-09-07 at 9.58.22 PM.png

Screen Shot 2017-09-07 at 9.17.40 PM.png

As we can see from the heatmap, there are regions (features) where we can see quite positive linear correlations amongst each other, given the darker shade of the colours – top left-hand corner and bottom right quarter. This is a good sign as it means we may be able to find linearly correlated features for which we can perform PCA projections on.

3. EXPLAINED VARIANCE MEASURE &Principal Component Analysis

Now you know what unsupervised learning is (I hope so). Then, let me explain about principal component analysis. The explanation would not be as entertaining as the one in unsupervised learning but I’ll try my best!

Principal component analysis (PCA) is a technique used to emphasize variation and bring out strong patterns in a dataset. It’s often used to make data easy to explore and visualize.  Principal components are dimensions along which your data points are most spread out:

Screen Shot 2017-09-07 at 10.18.24 PM.png


Let me give you an example. Imagine that you are a nutritionist trying to explore the nutritional content of food. What is the best way to differentiate food items? By vitamin content? Protein levels? Or perhaps a combination of both?

Knowing the variables that best differentiate your items has several uses:

1. Visualization. Using the right variables to plot items will give more insights.

2. Uncovering Clusters. With good visualizations, hidden categories or clusters could be identified. Among food items for instance, we may identify broad categories like meat and vegetables, as well as sub-categories such as types of vegetables.

The question is, how do we derive the variables that best differentiate items?

So, the first step to answer this question is Principal Component Analysis.

A principal component can be expressed by one or more existing variables. For example, we may use a single variable – vitamin C – to differentiate food items. Because vitamin C is present in vegetables but absent in meat, the resulting plot (below, left) will differentiate vegetables from meat, but meat items will clumped together.

To spread the meat items out, we can use fat content in addition to vitamin C levels, since fat is present in meat but absent in vegetables. However, fat and vitamin C levels are measured in different units. So to combine the two variables, we first have to normalize them, meaning to shift them onto a uniform standard scale, which would allow us to calculate a new variable – vitamin C – fat. Combining the two variables helps to spread out both vegetable and meat items.

The spread can be further improved by adding fiber, of which vegetable items have varying levels. This new variable – (vitamin C + fiber) – fat – achieves the best data spread yet.


So,  that’s my explanation of Principal Component analysis and K-means clustering at the same time. Let me apply Principal Component Analysis to this dataset and show how it works.

Explained Variance Measure

I will be using a particular measure called Explained Variance which will be useful in this context to help us determine the number of PCA projection components we should be looking at.

Before calculating explained variance, we need to get eigenvectors and eigenvalues.The eigenvectors and eigenvalues of a covariance (or correlation) matrix represent the “core” of a PCA: The eigenvectors (principal components) determine the directions of the new feature space, and the eigenvalues determine their magnitude. In other words, the eigenvalues explain the variance of the data along the new feature axes.


Screen Shot 2017-09-07 at 10.04.38 PM.png

After sorting the eigenpairs, the next question is “how many principal components are we going to choose for our new feature subspace?”. The explained variance tells us how much information (variance) can be attributed to each of the principal components.

Screen Shot 2017-09-07 at 10.05.06 PM.png

Screen Shot 2017-09-07 at 10.06.02 PMScreen Shot 2017-09-07 at 9.19.19 PM

From the plot above, it can be seen that approximately 90% of the variance can be explained with the 9 principal components. Therefore for the purposes of this notebook, let’s implement PCA with 9 components ( although to ensure that we are not excluding useful information, one should really go for 95% or greater variance level which corresponds to about 12 components).

Screen Shot 2017-09-07 at 10.07.13 PMScreen Shot 2017-09-07 at 9.21.01 PM

There does not seem to be any discernible clusters. However keeping in mind that our PCA projections contain another 7 components, perhaps looking at plots with the other components may be fruitful. For now, let us assume that will be trying a 3-cluster (just as a naive guess) KMeans to see if we are able to visualize any distinct clusters.

5.Visualization with K-means clustering

Screen Shot 2017-09-07 at 10.09.22 PMScreen Shot 2017-09-07 at 9.21.52 PM

This KMeans plot looks more promising now as if our simple clustering model assumption turns out to be right, we can observe 3 distinguishable clusters via this color visualization scheme. However I would also like to generate a KMeans visualization for other possible combinations of the projections against one another. I will use Seaborn’s convenient pairplot function to do the job. Basically pairplot automatically plots all the features in the dataframe (in this case our PCA projected movie data) in pairwise manner. I will pairplot the first 3 projections against one another and the resultant plot is given below:

Screen Shot 2017-09-07 at 10.10.36 PM.png

Screen Shot 2017-09-07 at 9.23.08 PM


Game of Thrones Battle Analysis

Today is a big day for GOT fans: it’s the day of the last episode of season 7. Before you watch this episode, I prepared the Game of thrones battle analysis. The data itself is not the most updated one but I think it will give you guys good insight about the battles in GOT.


The GOT battle data has 38 observations with 25 variables.


Screen Shot 2017-08-27 at 11.50.36 AM.png

The variables include attacker, defender, family, year and outcome of the battle.


1.Is Size of the Army Often Decides the Outcome of the Battle?

Screen Shot 2017-08-27 at 11.52.20 AMattacker size

So, in this graph there is a blue line and the pink line. The blue line indicates the smoothing line using linear regression while the pink line indicated the smoothing line using loess. Linear regression is straight forward: you are looking for straight line that minimizes residual sum of errors. The Loess involves nonparametric statistics that allow non linearity.

Larger the size does not mean a guaranteed victory. For example, Mance Rayder defeated Stannis Baratheon brutally. Stannis was marching with 100,000 soldiers and Mance Rayder was with less than 1500 member troop.

Since I already demonstrated these kinds of ggplots, I’ll just skip the code part.

2.Which king fought maximum number of battles?



(pink indicated NA)

We can see that Joffrey/Tommen Baratheon have attacked the most followed by Robb Stark. Joffrey never participated in a battle directly but his brutal attitude and shocking decisions are unforgettable.

3.#How the Commanders of the Attacking Kings have Performed?


Screen Shot 2017-08-27 at 11.42.28 AM


Looks like Gregor Clegane had fought most number of battles and won all of them for Joffrey.


4. What are the Different Types of Battles fought and what their counts are?


battle type

Pitched battle is the most common battle type followed by siege.

5. In Which Regions Battles were Fought, Who were all the attackers?

Screen Shot 2017-08-27 at 11.43.48 AM

According to the data all the kings fought in The North. Joffrey fought most of his battles in The Riverlands. Joffrey’s fights are mainly to defend the Kings Landing because the The Riverlands are Between Everything and Everything Else


6.Type of Battles and The Attacker Kings


Screen Shot 2017-08-27 at 11.44.29 AM

Stannis Baratheon had the largest army among all other Kings. He never needed to have an ambush, on the other hand Robb Stark was slowly building his place so he used ambush the most. Data proves he ambushed most of the time and acquired his army before his bloody deathy during Red Wedding.


7. Kings and Their Army Strength


Screen Shot 2017-08-27 at 11.45.43 AM


8. Kings vs Kings


Screen Shot 2017-08-27 at 11.45.10 AM

Joffrey had fought agains almost all other kings except the wildlings. He was quite far from The Wall and there is no need for him to cross the wall. However, Robb Stark was quite focussed… His quest was to take revenge on his old friends the heirs of Robert Baratheon who kill his father Ned Stark.


Things to do next

  1. Work on character death analysis prediction
  2. Battle prediction

[Book Review] Cosmos by Carl Sagan

I have heard the reputation of this book since I was young but I was not dare to approach this book since I often feel science is convoluted and abstract to me(I’m a math person but not science). But once I watched the documentary version narrated by Carl Sagan, I was immediately hooked and finally grabbed the book.

My initial thought of this book before I started to read was it would be full of scientific knowledge and explanations but this book is much beyond that. Unlike didactic science scholars, Carl Sagan put his best effort to make astronomy approachable to the public. He has managed to put into simple words concepts that have scared away so many people for so long. In this book, Sagan encompasses the whole of human existence and the universe, with a focus on science.

For example, he also discussed:

– evolution,
– Kepler, astrology and acceptance of truth in spite of what outcome is desired,
– Venus and Mars, including the made-up belief of life on Mars a century ago,
– the Voyager spacecrafts’ Grand Tour of the Outer Planets (a rare alignment),
– ancient Greek scientists,
– Relativity,
– atoms, elements, and how star make them,
– Creation Myths, incl Hindu ones that are longer than the current discovered age of the Universe,
– genes, DNA, the brain, and books: the progression of how and how much information we can store and access,
– SETI, and Jean-François Champollion’s translation of Egyptian hieroglyphs,
– the Library of Alexandria.

So, even for people who don’t love science that much, they would find some parts of the books interesting and marveled how interconnected the science and those different aspects are(e.g. myth).


After giving us a general idea of our ‘cosmic address’, Sagan moves on to Darwin and his discovery of Natural Selection as the engine of Evolution. This has to be one of the finest explanations of Darwinian Natural Selection, where Sagan uses the extra-ordinary example Heike crabs, to demonstrate the strange but beautiful ways in which ‘survival of the fittest’ is manifested. But he doesn’t keep us here for long. After giving the best possible ‘lecture’ on Evolution, he takes us further to see the harmony of the worlds. the planets and how the stars follow fixed patterns that can be mathematically explained; a most singular achievement of humans to have discovered the language of the Nature. Kepler gave us the laws of planetary motion. Laws that not just explained the elliptical orbit of Earth, but inspired a generation of mathematicians and physicists to inquire further into the nature and behavior of the heavenly bodies.

As the book progresses, Sagan’s obsession with extra-terrestrial life becomes more and more apparent. He admits that as a child, he spent hours contemplating about the possibility of intelligent life on other planets. Although our search for intelligent life has been a failure (even on Earth), Sagan aspires to make contact with the dwellers of distant worlds. The possibility of life elsewhere, is not too ‘fantastic’ altogether. As we observe the immensity of the observable universe, we can be more than certain that life does exist elsewhere but we don’t know what it will be like. Space travel and Alien Contact are not stuff of science fiction anymore but a possibility in waiting.

The concluding chapters touch on two matters of colossal significance, namely Nuclear Weapons and Climate Change. These two man-made disasters are a ticking time bomb that can obliterate our species, and we have done precious little to stop them. We are destroying this planet, poisoning our oceans and destroying Specie after specie for centuries now. Man is without a doubt the most deadly predator in the history of Earth Life. And now we are on the path to self-annihilation.


After reading the last chapter blaming selfish and greedy humans, his book is a wakeup call. A world ridden with ignorance and greed, will need to forego the idiotic bliss of being certain about everything. We don’t need good answers to everything, what we need instead are good questions. A good question is often times more educating than its answer. How can we love this world if we are awaiting an apocalypse, how can we love our environment and its safe keepers, the plants and the animals, without recognizing that they are our distant cousins. Life, wherever it exists on this planet, is our kin. And we are bullying, butchering and asphyxiating it everywhere. What a shame !

This book was published for the first time in 1980 and we are still enjoying his book. The messages from his book still penetrate greedy and egocentric human beings. It is sad that humans have not improved in this perspective that much since 1980. What would Carl Sagan say about this current world if he is still alive?

Because Carl Sagan does more than just educate you about the wonders of Science and the Universe; he makes you fall in love with it.


[R]Create word cloud with Harry Potter

Nowadays I’m trying to learn text analysis by myself. I came across how to create word cloud while exploring information on text/sentiment analysis. Sentiment analysis is quite tricky but I’m learning it. I hope I can demonstrate it in the near future but let’s start with word cloud.


I found this harry potter dataset from here. I installed harry potter package using this:

Screen Shot 2017-08-21 at 9.38.51 AM

This package contains all the full text for the seven books so the data is completely text this time.


Word Cloud

Now I will demonstrate how to create word cloud. It’s completely data visualization so it involved very little statistics( or it’s free of statistics).

Step 1 Install these packages

Screen Shot 2017-08-21 at 9.42.15 AM

Those are the packages that you need for creating word cloud.

Step 2  Create Corpus

Screen Shot 2017-08-21 at 9.45.28 AM

Corpus is a document containing (natural language) text. It’s usually large and well structured. In this case, I created corpus with philosopher stone.

Step 3 Let’s convert the corpus to plain text document

Screen Shot 2017-08-21 at 9.58.57 AM

Step 4  Let’s convert the corpus to plain text document

Screen Shot 2017-08-21 at 10.00.09 AM

In this way, we can remove a lot of redundant stuffs in the context. The examples of ‘Stopwords(‘english’) are :

Screen Shot 2017-08-21 at 10.02.45 AM

From this process, we can trim the texts and extract fundamental texts that we need.

Step 5 Create the new corpus with the polished one and perform stemming.Screen Shot 2017-08-21 at 10.04.25 AM

As I wrote in the comment, stemming transforms words into the most basic form

Step 6 Lastly, create word cloudScreen Shot 2017-08-21 at 10.06.08 AM

max.words controls the maximum number of word cloud. Adding colors makes the word cloud look prettier.



<Philospher’s stone>

Not surprisingly, Harry is the most frequently mentioned. Looks like JK Rowling likes to use ‘said‘ and ‘look‘. Ron and Hagrid  are slightly more appeared than Hermione and Dumbledore.



<Chamber of Secrets>

In this series, Ron is more emphasized than the previous one. Compared to the last one,Malfoy is relatively more popular in this one. In chamber of secrets, looking into basilisk eyes directly caused death so ‘eyes‘ are also one of the most popular words in this one. If you see it closely, you can also see lockhart.



<Prisoner of Azkaban>

Now we can see Snape is appeared in non-green color. It implies that Snape can be an important figure as the series goes by. As you remember, this was the first novel that involved with Sirius Black and you can see his name here. And Hagrid is back here again.


<Goblet of Fire>

In Goblet of Fire, we can see the significance of Dumbledore arose. Since Harry went to Triwizard match with Weasley family, we can see Weasley in here as well.


order phoenix

<order of phoenix>

Not so diffrent from previous one, but we can notice Umbridge appears here.

half blood

<Half Blood Prince>

In this series, Dumbledore appears more than Ron and Hermione. If you know the plot of this episode, it’s reasonable why Dumbledore appears more than the couple. Also, we can see Slughorn which makes it distinctive from other series. Malfoy and Snape are back there again in non-green color.


<Deathly Hallows>

In this novel, the main part is the risky and important adventure with Ron and Hermione. No wonder why they are the largest besides ‘Harry’ and ‘Said’. Since this series reveals that there is a legendary wand, wand appears as almost important as dumbledore.  For verb, you can see ‘think’ and it may imply that there are more internal conversations within the character. Voldemort didn’t appear too much in these word clouds and we can see his name in this one.



We can see Harry is truly the center of the series since the title is already Harry Potter. To me, it’s surprising to see Voldemort didn’t appear as much as I thought. Since ‘said‘ is the most frequently used verb, it indicates that the many part of the novels is based on the conversation. Besides Harry, Ron, Hermione and Dumbledore are the most important in this series.