Well, I was trying not to talk about what I do for a living in the blog as I initially wanted my blog to be the space outside of my work. But it’s part of me that I love to do statistics even outside of work time so I decided to start the first statistics blog post.
Today, I was done with the work I’m supposed to do a little bit earlier than I expected so I got more than one hour of free time. At the same time, I was craving for a pint of beer but I did not want to risk myself holding a can of beer in the workplace. What I did to combat this thirstiness for beer was looking into craft beer dataset in Kaggle. I knew that the dataset was there for a while for more than 6 months. I have been quite interested in the data but I have not taken into action to analyze it. Today, I was like ” why not?” and digged into it.
As a fan of going to art museums, visualizing is one of my favorite parts in R programming. You know, it’s a human nature to be attracted to beautiful facade.
Based on the craft beer data, I would like to show how to make pretty maps using ggplot package.
Step 1: Read the two datasets(you can find the data from the source part at the bottom). Get ‘maps’ and ‘ggplot2’ packages
Step2: The name for the first column of the data set is just’ X’. Let’s change the column name to “brewery_id” just like in beer dataset . This is the way that you can change the column name.
Since “brewery_id” are common variable now, we can merge two datasets into one using merge function. In this case, we are merging the dataset by brewery id.
Step 3: For this example, I would like to show the maps for each abv level. In other words, I will show the frequency for each abv level in the U.S map. Generally, the average abv level for beer is 4.5% . (from: https://www.livescience.com/32735-how-much-alcohol-is-in-my-drink.html).
In order to see the distribution, histogram is the best. Let’s see the distribution. This is the way to draw histogram using ggplot package. “beer.data$abv” is the variable that we would like to see its distribution. “col=”blue” is the line color.
There are highest number of observations around 0.045. Based on this, I classified “low” when the abv level is lower than 4.5%. Most of the data points are concentrated between 4.5% and 6% so I defined it as “medium“. There are quite datapoints over 6% so I define it as “high” for 6%-8%. For the abv level higher than 8%, I named it as “very high“.
Step 4: We want to create US map so let’s make the frequency table with state. Don’t forget to factorize state variable. R has a builit in function,”state.abb” without “DC”. DC is the heart of the US and there are people living there so let’s include it, too.
Step 5: The last sentence in this picture brings map background. map_data function belongs to maps package. tolower function converts to lowercase.
Step 6: Now it’s much more complicated ggplot with map. FYI, I set the limit with longitude for x and latitude for y. labs is a great function to clear the name for x-axis, y-axis and the title. We want to see the frequency for each state so fill=freq.
As a result,
We can see those West Coast People loves lower abv alcohol. Look at Oregon and California. They produce those low abv level beer more than the east coast for sure. From Medium to Very High, Colorado produces highest number of beer. But California is consistently producing higher than average number of beers. Other notable producers appear to be the four states on Lake Michigan, and Texas. Utah makes a large showing for low alcohol beers(No wonder). East coast seemed pretty reserved in this map.