Search

Advance R - K-Means Method of Clustering

This example illustrates how to use WSS to determine an appropriate number, k, of clusters, the following example uses R to perform a k-means analysis. The task is to group 620 high school seniors based on their grades in three subject areas: English, mathematics, and science. The grades are averaged over their high school career and assume values from 0 to 100.

 
# The following R code establishes the necessary R libraries
> library(plyr)
> library(ggplot2)
> library(cluster)
> library(lattice)
> library(graphics)
> library(grid)
> library(gridExtra)
> library(cowplot)
 
# Import the CSV file containing the grades
> grades<-read.csv("grades.csv", header=TRUE, sep=",")
> grades<-as.data.frame(grades)
 
# Let's take a look at the structure of the dataset
> str(grades)
 
R output:

'data.frame': 620 obs. of 4 variables:
$ Student: int 1 2 3 4 5 6 7 8 9 10 ...
$ English: int 99 99 98 95 95 96 98 95 98 99 ...
$ Math : int 96 96 97 100 96 97 96 98 96 99 ...
$ Science: int 97 97 97 95 96 96 97 98 96 95 ...
 
# Exclude the first column StudentID from clustering analysis
> grades_km<-as.matrix(grades[,c("Student", "English", "Math", "Science")])
> grades_km.process<-grades_km[,c(2:4)]
 
# Let's take a look at the first ten rows of the processed dataset grades_km.process, which we will use for our clustering analysis
> grades_km.process[1:10,]
 
R output:
English Math Science
[1,] 99 96 97
[2,] 99 96 97
[3,] 98 97 97
[4,] 95 100 95
[5,] 95 96 96
[6,] 96 97 96
[7,] 98 96 97
[8,] 95 98 98
[9,] 98 96 96
[10,] 99 99 95
 
# Use WSS (Within Sum of Squares) to determine the k value.
# We will try k = 1, 2, ..., 10
# For each k, the option nstart = 30 specifies that the k-means algorithm will be repeated 30 times, each starting with k random initial centroids.
# The corresponding WSS results are stored in wss vector
> wss<-numeric(10)
> for (k in 1:10) wss[k]<-sum(kmeans(grades_km.process, centers=k, nstart=30)$withinss)
 
# Plot each wss against the respective number of centroids
> plot(1:10, wss, type="b", xlab="Number of Clusters", ylab="Within Sum of Squares")
 
# Plot (apologize that plot can't be pasted onto Wyzant from R) suggests that we use k = 3
# Let's examine the k-means algorithm results at k = 3
> km.3<-kmeans(grades_km.process, centers=3, nstart=30)
> km.3
 
R output:

K-means clustering with 3 clusters of sizes 158, 242, 220

Cluster means:
English Math Science
1 97.18987 93.34177 94.82278
2 85.85124 79.79339 81.52479
3 73.21364 64.41818 65.73182

Clustering vector:
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[61] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[121] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 1 1 2
[181] 2 1 2 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[241] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[301] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[361] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 2 3 2 3 2 2 2 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[421] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 2 3 3 3 3 3 3 3 3 2 3
[481] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[541] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[601] 2 2 3 3 2 2 2 2 1 1 2 2 2 3 3 2 3 2 2 2

Within cluster sum of squares by cluster:
[1] 6778.886 22748.665 34413.664
(between_SS / total_SS = 76.8 %)

Available components:

[1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss" "size" "iter"
[9] "ifault" 
 
# Let's visualize the data and assigned clusters.
# First, prepare the data and clustering results for plotting
 
> grades.plot<-as.data.frame(grades[,c(2:4)])
> grades.plot$cluster<-factor(km.3$cluster)
> km.centers<-as.data.frame(km.3$centers)
 
# Let's plot the data and assigned clusters.
> plot.1<-ggplot(data=grades.plot, aes(x=English, y=Math, color=cluster))+geom_point()+theme(legend.position="right")+geom_point(data=km.centers, aes(x=English, y=Math, color=as.factor(c(1,2,3))), size=10, alpha=.3, show.legend=FALSE)
> plot.2<-ggplot(data=grades.plot, aes(x=English, y=Science, color=cluster))+geom_point()+theme(legend.position="right")+geom_point(data=km.centers, aes(x=English, y=Science, color=as.factor(c(1,2,3))), size=10, alpha=.3, show.legend=FALSE)
> plot.3<-ggplot(data=grades.plot, aes(x=Math, y=Science, color=cluster))+geom_point()+theme(legend.position="right")+geom_point(data=km.centers, aes(x=Math, y=Science, color=as.factor(c(1,2,3))), size=10, alpha=.3, show.legend=FALSE)
 
# Combine three plots on one page
> plot_grid(plot.1, plot.2, plot.3, labels=c("A", "B", "C"), ncol=1, nrow=3)
 
 
 

Comments

$40p/h

Yuning L.

Mathematics, Statistics, Economics. Let's rock it!

10+ hours
if (isMyPost) { }