Wednesday, 11 January 2017

7 Creating a Dashboard - Part II

The Dashboard on this blog

Data Source

7 Creating a Dashboard - Part 1

Basic Column Chart Showing All Data

Data Source

3 Specifying Range of Data and Selecting Columns : Google Charts with Google Docs data

Sheet 2 Chart - Sheet, Range, Cols

Creating a Google Chart using data pulled from Google Docs

Data from a Spreadsheet

Basic Google Chart

The first Pi-chart and here i can put some text

Sample Codes for illustration of ggplot2

## Sample Codes for illustration of ggplot2
## Susheel Shukla 09-01-2017

### install & load ggplot library
install.package("ggplot2")
library("ggplot2")

### show info about the data
head(diamonds)
head(mtcars)

### comparison qplot vs ggplot

# qplot histogram
qplot(clarity, data=diamonds, fill=cut, geom="bar")

# ggplot histogram -> same output
ggplot(diamonds, aes(clarity, fill=cut)) + geom_bar()

P.S. - both qplot and ggplot having same graph (as shown here)

### how to use qplot

# scatterplot

qplot(wt, mpg, data=mtcars)

# transform input data with functions

qplot(log(wt), mpg - 10, data=mtcars)

# add aesthetic mapping (hint: how does mapping work)

qplot(wt, mpg, data=mtcars, color=qsec)

# change size of points (hint: color/colour, hint: set aesthetic/mapping)

qplot(wt, mpg, data=mtcars, color=qsec, size=3)

qplot(wt, mpg, data=mtcars, colour=qsec, size=I(3)) #aesthetics can be set to a constant value instead of mapping

# use alpha blending

qplot(wt, mpg, data=mtcars, alpha=qsec) # values between 0 (transparent) and 1 (opaque)

# continuous scale vs. discrete scale

head(mtcars)

qplot(wt, mpg, data=mtcars, colour=cyl)

levels(mtcars$cyl)

qplot(wt, mpg, data=mtcars, colour=factor(cyl))

# use different aesthetic mappings

qplot(wt, mpg, data=mtcars, shape=factor(cyl))

qplot(wt, mpg, data=mtcars, size=qsec)

# combine mappings (hint: hollow points, geom-concept, legend combination)

qplot(wt, mpg, data=mtcars, size=qsec, color=factor(carb))

qplot(wt, mpg, data=mtcars, size=qsec, color=factor(carb), shape=I(1))

qplot(wt, mpg, data=mtcars, size=qsec, shape=factor(cyl), geom="point")

qplot(wt, mpg, data=mtcars, size=factor(cyl), geom="point")

# bar-plot

qplot(factor(cyl), data=mtcars, geom="bar")

# flip plot by 90°

qplot(factor(cyl), data=mtcars, geom="bar") + coord_flip()

# difference between fill/color bars

qplot(factor(cyl), data=mtcars, geom="bar", fill=factor(cyl))

qplot(factor(cyl), data=mtcars, geom="bar", colour=factor(cyl))

# fill by variable

qplot(factor(cyl), data=mtcars, geom="bar", fill=factor(gear))

# use different display of bars (stacked, dodged, identity)

head(diamonds)

qplot(clarity, data=diamonds, geom="bar", fill=cut, position="stack")

qplot(clarity, data=diamonds, geom="bar", fill=cut, position="dodge")

qplot(clarity, data=diamonds, geom="bar", fill=cut, position="fill")

qplot(clarity, data=diamonds, geom="bar", fill=cut, position="identity")

qplot(clarity, data=diamonds, geom="freqpoly", group=cut, colour=cut, position="identity")

qplot(clarity, data=diamonds, geom="freqpoly", group=cut, colour=cut, position="stack")

Quick Introduction to ggplot2

here

This is a bare-bones introduction to ggplot2, a visualization package in R. It assumes no knowledge of R.

Preview

Let’s start with a preview of what ggplot2 can do.

Given Fisher’s iris data set and one simple command…

## R Codes

head(iris)

library(ggplot2)

qplot(Sepal.Length, Petal.Length, data = iris, color = Species)

…we can produce this plot of sepal length vs. petal length, colored by species.

R Basics

Vectors

Vectors are a core data structure in R, and are created with c(). Elements in a vector must be of the same type.

numbers = c(23, 13, 5, 7, 31)

names = c("edwin", "alice", "bob")

Elements are indexed starting at 1, and are accessed with [] notation.

#indexing

numbers[1]

names[1]

Data frames

Data frames are like matrices, but with named columns of different types (similar to database tables).

books = data.frame(

title = c("harry potter", "war and peace", "lord of the rings"), # column named "title"

author = c("rowling", "tolstoy", "tolkien"),

num_pages = c("350", "875", "500")

)

You can access columns of a data frame with $.

##column access

books$title

You can also create new columns with $.

ggplot2

install.packages("ggplot2")

library(ggplot2)

Scatterplots with qplot()

## scatterplot with qplot()

head(iris) # by default head displys first 6 rows

head(iris, n=10) # now first 10 rows

qplot(Sepal.Length, Petal.Length, data = iris)

# Plot Sepal.Length vs. Petal.Length, using data from the `iris` data frame.

# * First argument `Sepal.Length` goes on the x-axis.

# * Second argument `Petal.Length` goes on the y-axis.

# * `data = iris` means to look for this data in the `iris` data frame.

##R Code:

plot(Sepal.Length, Petal.Length, data = iris, color = Species) #dude!

#Similarly, we can let the size of each point denote sepal width, by adding a size = Sepal.Width argument.

#R Code:

qplot(Sepal.Length, Petal.Length, data = iris, color = Species, size = Petal.Width)

# We see that Iris setosa flowers have the narrowest petals.

#R Code:

qplot(Sepal.Length, Petal.Length, data = iris, color = Species, size = Petal.Width, alpha = I(0.7))

# By setting the alpha of each point to 0.7, we reduce the effects of overplotting

##Finally, let's fix the axis labels and add a title to the plot.

#R Code:

qplot(Sepal.Length, Petal.Length, data = iris, color = Species,

xlab = "Sepal Length", ylab = "Petal Length",

main = "Sepal vs. Petal Length in Fisher's Iris data")

##Other common geoms

##In the scatterplot examples above, we implicitly used a point geom, the default when you supply two arguments to qplot().

# These two invocations are equivalent.

qplot(Sepal.Length, Petal.Length, data = iris, geom = "point")

qplot(Sepal.Length, Petal.Length, data = iris)

##Barcharts: geom = "bar"

movies = data.frame(

director = c("spielberg", "spielberg", "spielberg", "jackson", "jackson"),

movie = c("jaws", "avatar", "schindler's list", "lotr", "king kong"),

minutes = c(124, 163, 195, 600, 187)

)

# Plot the number of movies each director has.

qplot(director, data = movies, geom = "bar", ylab = "# movies")

# By default, the height of each bar is simply a count.

# But we can also supply a different weight.

# Here the height of each bar is the total running time of the director's movies.

qplot(director, weight = minutes, data = movies, geom = "bar", ylab = "total length (min.)")

########### Line charts: geom = "line" ##################

qplot(Sepal.Length, Petal.Length, data = iris, geom = "line", color = Species)

# Using a line geom doesn't really make sense here, but hey.

# `Orange` is another built-in data frame that describes the growth of orange trees.

qplot(age, circumference, data = Orange, geom = "line",

colour = Tree,

main = "How does orange tree circumference vary with age?")

# We can also plot both points and lines.

qplot(age, circumference, data = Orange, geom = c("point", "line"), colour = Tree)

Please go through my next blog here to see ggplot2 in detail.

***

Data Visualization Techniques using 'R'

Why we need visualization techniques ??

Because it tells stories about increasing volume of data. It is an art to convert numbers into knowledge.

There are 7 types of charts which we use frequently in R, they are

1. Scatter Plot
2. Histogram
3. Bar Chart
4. Box Plot
5. Area Chart
6. Heat Map
7. Correlogram

Selecting the Right Chart Type

There are four basic presentation types

1. Comparison
2. Composition
3. Distribution
4. Relationship

Now the question is which one is most suited for your data, it depends on

1. How many variables do you want to show in single chart?
2. How many data points will you display for each variable?
3. Will you display values over a period of time, or among items or groups?

The below figure explains about selecting right chart type -

let's understand these charts one by one using Big Mart Dataset (find it here)

1. Scatter Plot

Scatter Plot is used to see the relationship between two continuous variables -

Here is the R code for simple scatter plot using ggplot() and geom_point()

library(ggplot2)
data = Big.Mart.Dataset...Sheet1
ggplot(data, aes(Item_Visibility, Item_MRP)) + geom_point() + scale_x_continuous("Item Visibility", breaks = seq(0,0.35,0.05))+ scale_y_continuous("Item MRP", breaks = seq(0,270,by = 30))+ theme_bw()

Now, we can view a third variable in the same chart. Let's take a categorical variable (Item_Type) which will give the characteristic (Item_Type) of each dataset. Different categories are depicted by
way of different color for item_type in below chart.

## The R code with addition of 1 category:

ggplot(data, aes(Item_Visibility, Item_MRP)) + geom_point(aes(color = Item_Type)) +
scale_x_continuous("Item Visibility", breaks = seq(0,0.35,0.05))+
scale_y_continuous("Item MRP", breaks = seq(0,270,by = 30))+
theme_bw() + labs(title="Scatterplot")

let's make it clear by creating different plot for each item_type

The R code for separate item_type:

ggplot(data, aes(Item_Visibility, Item_MRP)) + geom_point(aes(color = Item_Type)) +
scale_x_continuous("Item Visibility", breaks = seq(0,0.35,0.05))+
scale_y_continuous("Item MRP", breaks = seq(0,270,by = 30))+
theme_bw() + labs(title="Scatterplot") + facet_wrap( ~ Item_Type)

2. Histogram

Histograms are used to plot continuous variables. It breaks the data into bins and shows frequency distribution of these bins. We can always change the bin size and can see it's effect on visualization.

From our Mart dataset, if we want to know the count of items on the basis of their cost, then we can plot the histogram using continuous variable Item_MRP as shown below:

The R code for above plot:

ggplot(data, aes(Item_MRP)) + geom_histogram(binwidth = 2)+
scale_x_continuous("Item MRP", breaks = seq(0,270,by = 30))+
scale_y_continuous("Count", breaks = seq(0,200,by = 20))+
labs(title = "Histogram")

3. Bar Chart

Bar Chart is used when we want to plot a categorical variable or a combination of categorical and continuous variable.

From our Mart data if we want to know number of Mart established in particular year, then bar chart will be more suitable option.

The R code for above graph is here:

ggplot(data, aes(Outlet_Establishment_Year)) + geom_bar(fill = "red")+theme_bw()+
scale_x_continuous("Establishment Year", breaks = seq(1985,2010)) +
scale_y_continuous("Count", breaks = seq(0,1500,150)) +
coord_flip()+ labs(title = "Bar Chart") + theme_gray()

Vertical Bar Chart

As a variation, we can remove coord_flip() parameter to get the above bar chart vertically.

The R code for above chart is here:

ggplot(data, aes(Item_Type, Item_Weight)) + geom_bar(stat = "identity", fill = "darkblue") + scale_x_discrete("Outlet Type")+ scale_y_continuous("Item Weight", breaks = seq(0,15000, by = 500))+ theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) + labs(title = "Bar Chart")

Stacked Bar Chart

The R code for above graph is here:

ggplot(data, aes(Outlet_Location_Type, fill = Outlet_Type)) + geom_bar()+

labs(title = "Stacked Bar Chart", x = "Outlet Location Type", y = "Count of Outlets")

4. Box Plot

Box Plot are used to plot a combination of continuous and categorical variable. This plot is useful for visualize the spread of data and detect outliers. It shows five statistically significant numbers- the minimum, the 25th percentile, the median, the 75th percentile and the maximum.

The blackpoints are outliers.

The R code for above graph is here:

ggplot(data, aes(Outlet_Identifier, Item_Outlet_Sales)) + geom_boxplot(fill = "red")+
scale_y_continuous("Item Outlet Sales", breaks= seq(0,15000, by=500))+
labs(title = "Box Plot", x = "Outlet Identifier")

5. Area Chart

Area chart is used to show continuity across a variable or dataset. It is very much same to line chart
and is commonly used for time series plots. Alternatively, it is also used to plot continuous variables and analyze the underlying trends.

From our dataset, when we want to analyze the trend of item outlet sales, area chart can be plotted as shown below. It shows count of outlets on basis of sales.

here is the R code for above chart:

ggplot(data, aes(Item_Outlet_Sales)) + geom_area(stat = "bin", bins = 30, fill = "steelblue") + scale_x_continuous(breaks = seq(0,11000,1000))+ labs(title = "Area Chart", x = "Item Outlet Sales", y = "Count")

6. Heat Map

Heat Map uses intensity (density) of colors to display relationships between two or more variables in 2-D image. It allows you to explore 2-D as the axis and third dimension by intensity of color.

From our dataset, if we want to know cost of each item in every outlet, we can plot heat map as shown below as using three variable item MRP, Outlet identifier and Item_Type from our Mart data.

The dark portion indicates Item MRP is close 50. The brighter portion indicates Item MRP is close to 250.

Here is the R code:

ggplot(data, aes(Outlet_Identifier, Item_Type))+

geom_raster(aes(fill = Item_MRP))+

labs(title ="Heat Map", x = "Outlet Identifier", y = "Item Type")+

scale_fill_continuous(name = "Item MRP")

7. Correlogram

Correlogram is used to test the level of co-relation among the variable available in the dataset. The cells of the Matrix can be shaded or colored to show the co-relation value.

Darker the color, Higher the co-relation between variables. Positive co-relation is displayed in blue and negative co-relations in red color. Color intensity is proportional to co-relation value.

From our dataset, let's check co-relation between Item cost, weight, visibility along with outlet establishment year and outlet sales from below plot.

In our example, we can see that Item cost and Outlet sales are positively co-related while item weight and it's visibility are negatively co-related.

and the R code for correlogram is here:

install.packages("corrgram")
library(corrgram)

corrgram(data, order=NULL, panel=panel.shade, text.panel=panel.txt,
main="Correlogram")

These were visualization techniques we use in R.

***