Data Visualization Techniques using 'R'
Why we need visualization techniques ??
Because it tells stories about increasing volume of data. It is an art to convert numbers into knowledge.
There are 7 types of charts which we use frequently in R, they are
1. Scatter Plot
2. Histogram
3. Bar Chart
4. Box Plot
5. Area Chart
6. Heat Map
7. Correlogram
Selecting the Right Chart Type
There are four basic presentation types
1. Comparison
2. Composition
3. Distribution
4. Relationship
Now the question is which one is most suited for your data, it depends on
1. How many variables do you want to show in single chart?
2. How many data points will you display for each variable?
3. Will you display values over a period of time, or among items or groups?
The below figure explains about selecting right chart type -
let's understand these charts one by one using Big Mart Dataset (find it here)
1. Scatter Plot
Scatter Plot is used to see the relationship between two continuous variables -
Here is the R code for simple scatter plot using ggplot() and geom_point()
library(ggplot2)
data = Big.Mart.Dataset...Sheet1
ggplot(data, aes(Item_Visibility, Item_MRP)) + geom_point() + scale_x_continuous("Item Visibility", breaks = seq(0,0.35,0.05))+ scale_y_continuous("Item MRP", breaks = seq(0,270,by = 30))+ theme_bw()
Now, we can view a third variable in the same chart. Let's take a categorical variable (Item_Type) which will give the characteristic (Item_Type) of each dataset. Different categories are depicted by
way of different color for item_type in below chart.
## The R code with addition of 1 category:
ggplot(data, aes(Item_Visibility, Item_MRP)) + geom_point(aes(color = Item_Type)) +
scale_x_continuous("Item Visibility", breaks = seq(0,0.35,0.05))+
scale_y_continuous("Item MRP", breaks = seq(0,270,by = 30))+
theme_bw() + labs(title="Scatterplot")
let's make it clear by creating different plot for each item_type
The R code for separate item_type:
ggplot(data, aes(Item_Visibility, Item_MRP)) + geom_point(aes(color = Item_Type)) +
scale_x_continuous("Item Visibility", breaks = seq(0,0.35,0.05))+
scale_y_continuous("Item MRP", breaks = seq(0,270,by = 30))+
theme_bw() + labs(title="Scatterplot") + facet_wrap( ~ Item_Type)
2. Histogram
Histograms are used to plot continuous variables. It breaks the data into bins and shows frequency distribution of these bins. We can always change the bin size and can see it's effect on visualization.
From our Mart dataset, if we want to know the count of items on the basis of their cost, then we can plot the histogram using continuous variable Item_MRP as shown below:
The R code for above plot:
ggplot(data, aes(Item_MRP)) + geom_histogram(binwidth = 2)+
scale_x_continuous("Item MRP", breaks = seq(0,270,by = 30))+
scale_y_continuous("Count", breaks = seq(0,200,by = 20))+
labs(title = "Histogram")
3. Bar Chart
Bar Chart is used when we want to plot a categorical variable or a combination of categorical and continuous variable.
From our Mart data if we want to know number of Mart established in particular year, then bar chart will be more suitable option.
The R code for above graph is here:
ggplot(data, aes(Outlet_Establishment_Year)) + geom_bar(fill = "red")+theme_bw()+
scale_x_continuous("Establishment Year", breaks = seq(1985,2010)) +
scale_y_continuous("Count", breaks = seq(0,1500,150)) +
coord_flip()+ labs(title = "Bar Chart") + theme_gray()
Vertical Bar Chart
As a variation, we can remove coord_flip() parameter to get the above bar chart vertically.
The R code for above chart is here:
ggplot(data, aes(Item_Type, Item_Weight)) + geom_bar(stat = "identity", fill = "darkblue") + scale_x_discrete("Outlet Type")+ scale_y_continuous("Item Weight", breaks = seq(0,15000, by = 500))+ theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) + labs(title = "Bar Chart")
Stacked Bar Chart
The R code for above graph is here:
ggplot(data, aes(Outlet_Location_Type, fill = Outlet_Type)) + geom_bar()+
labs(title = "Stacked Bar Chart", x = "Outlet Location Type", y = "Count of Outlets")
4. Box Plot
Box Plot are used to plot a combination of continuous and categorical variable. This plot is useful for visualize the spread of data and detect outliers. It shows five statistically significant numbers- the minimum, the 25th percentile, the median, the 75th percentile and the maximum.
The blackpoints are outliers.
The R code for above graph is here:
ggplot(data, aes(Outlet_Identifier, Item_Outlet_Sales)) + geom_boxplot(fill = "red")+
scale_y_continuous("Item Outlet Sales", breaks= seq(0,15000, by=500))+
labs(title = "Box Plot", x = "Outlet Identifier")
5. Area Chart
Area chart is used to show continuity across a variable or dataset. It is very much same to line chart
and is commonly used for time series plots. Alternatively, it is also used to plot continuous variables and analyze the underlying trends.
From our dataset, when we want to analyze the trend of item outlet sales, area chart can be plotted as shown below. It shows count of outlets on basis of sales.
here is the R code for above chart:
ggplot(data, aes(Item_Outlet_Sales)) + geom_area(stat = "bin", bins = 30, fill = "steelblue") + scale_x_continuous(breaks = seq(0,11000,1000))+ labs(title = "Area Chart", x = "Item Outlet Sales", y = "Count")
6. Heat Map
Heat Map uses intensity (density) of colors to display relationships between two or more variables in 2-D image. It allows you to explore 2-D as the axis and third dimension by intensity of color.
From our dataset, if we want to know cost of each item in every outlet, we can plot heat map as shown below as using three variable item MRP, Outlet identifier and Item_Type from our Mart data.
The dark portion indicates Item MRP is close 50. The brighter portion indicates Item MRP is close to 250.
Here is the R code:
ggplot(data, aes(Outlet_Identifier, Item_Type))+
geom_raster(aes(fill = Item_MRP))+
labs(title ="Heat Map", x = "Outlet Identifier", y = "Item Type")+
scale_fill_continuous(name = "Item MRP")
7. Correlogram
Correlogram is used to test the level of co-relation among the variable available in the dataset. The cells of the Matrix can be shaded or colored to show the co-relation value.
Darker the color, Higher the co-relation between variables. Positive co-relation is displayed in blue and negative co-relations in red color. Color intensity is proportional to co-relation value.
From our dataset, let's check co-relation between Item cost, weight, visibility along with outlet establishment year and outlet sales from below plot.
In our example, we can see that Item cost and Outlet sales are positively co-related while item weight and it's visibility are negatively co-related.












ok
ReplyDelete