This data analysis project uses a dataset that contains bone measures and living habits of birds provided by Dr. D. Liu of Beijing Museum of Natural History. The data contains ecological categories of the birds and measures of length and diameter of five bones from the bird’s skeleton. The analysis explores the data and builds models to classify the bird’s ecological classes based on its bone information.
The data is downloaded from the Kaggle website. The dataset contains ecological classification of birds and the measure (length and diameter) of five pieces of bones on the wings and legs. The ecological classes include:
The bones measured are indicated on the following graph:
library(knitr)
library(ggplot2)
library(reshape2)
library(gridExtra)
library(grid)
library(scatterplot3d)
library(e1071)
library(caret)
Let’s first check the size and content of the data.
# load the data
bird.data <- read.csv("bird.csv")
# check the dimension
n.row <- nrow(bird.data)
n.col <- ncol(bird.data)
print (paste0("# of predictors: ", n.col))
## [1] "# of predictors: 12"
print (paste0("# of observations: ", n.row))
## [1] "# of observations: 420"
# show how the data look like
kable(bird.data[1:5, ])
id | huml | humw | ulnal | ulnaw | feml | femw | tibl | tibw | tarl | tarw | type |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 80.78 | 6.68 | 72.01 | 4.88 | 41.81 | 3.70 | 5.50 | 4.03 | 38.70 | 3.84 | SW |
1 | 88.91 | 6.63 | 80.53 | 5.59 | 47.04 | 4.30 | 80.22 | 4.51 | 41.50 | 4.01 | SW |
2 | 79.97 | 6.37 | 69.26 | 5.28 | 43.07 | 3.90 | 75.35 | 4.04 | 38.31 | 3.34 | SW |
3 | 77.65 | 5.70 | 65.76 | 4.77 | 40.04 | 3.52 | 69.17 | 3.40 | 35.78 | 3.41 | SW |
4 | 62.80 | 4.84 | 52.09 | 3.73 | 33.95 | 2.72 | 56.27 | 2.96 | 31.88 | 3.13 | SW |
There are 7 observations with missing values among the total 420 observations. Due to the samll proportion of missing values, We can just remove those 7 data points from our analysis.
# remove samples with missing values (7 in 420)
data.complete <- bird.data[rowSums(is.na(bird.data)) == 0, ]
For classification problems, it is important to check the proportion of each class. Severe imbalance of the classes can cause difficulties of correctly classifying the classes with fewer observations.
# compute count for each class
data.count <- melt(table(data.complete[, n.col]))
# class label
label <- c("Scansorial", "Raptors", "Singing", "Swimming", "Terrestrial", "Wading")
# commpute percentage
percent <- rev(data.count$value/sum(data.count$value))
# compute label position
position <- cumsum(rev(data.count$value)) - 0.5*rev(data.count$value)
# check the number of birds in each class
ggplot(data=data.count, aes(x=factor(1), y=value, fill = Var1)) +
geom_bar(width = 1, stat = "identity")+
coord_polar(theta = "y") +
geom_text(aes(label = sprintf("%1.2f%%", 100*percent), y = position)) +
scale_fill_brewer(palette = "Set2", name = c("Living Habit"),
label = label) +
ggtitle("Figure 1. Proportion of Birds with Different Living Habits") +
theme(plot.title = element_text(hjust = 0.5, size = 14))
More than half of the birds are classified as singing or swimming birds. Terrestrial birds are in the smallest group.
As the sizes of bones in the same bird are natually correlated with each other due to factors such as the overall size of the bird, the bone measures are highly likely to be correlated. However, correlation between predictors leads to multicollinearity problem for models. Let’s first confirm whether there is strong correlation between the predictors.
#correlation between predictors
cor.matrix <- cor(data.complete[, 2:(n.col-1)])
# plot heatmap
melted_cormat <- melt(cor.matrix)
ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill=value)) +
geom_tile()+
scale_fill_continuous(low = "azure", high = "steelblue",
name = "Correlation") +
ggtitle("Figure 3. Correlation Between the Bone Measures") +
theme(plot.title = element_text(hjust = 0.5, size = 14))
The above heatmap clearly demonstrates strong correlation among the bone measures. Some of the bones also seem to correlate more than others, such as humerus and ulna (both are on the wings).
In order to avoid the multicollinearity problem, we can use principal component analysis to decompose the predictors into orthogonal variables.
# perform PCA
pca <- prcomp(data.complete[2:(n.col-1)], scale=T)
# check the variance explained by each PC
# total variance of the data
total.var <- sum(pca$sdev^2)
# compute %variance explained by each PC
percent.var <- pca$sdev^2/total.var
# cumulative variance explained
cum.percent <- c()
for (i in 1:(n.col-2)){
cum.percent <- c(cum.percent, sum(percent.var[1:i]))
}
# plot the accumulative variance against # of PCs
df <- data.frame(percent = cum.percent,
num.PC = seq(1, n.col-2, 1))
ggplot(df, aes(x = num.PC, y = percent))+
geom_point()+
geom_line()+
ggtitle("Figure 4. Cumulative Variance Explained by Principal Components")+
theme(plot.title = element_text(hjust = 0.5, size = 13))+
xlab("Number of Principal Components")+
ylab("Percentage of Variance Explained")
Consistent with high correlation among bone measures, the first principal component explains 85% of the variance. The first three principal components explain over 95% of the variance.
score <- data.frame(pca$rotation)
score$names <- rownames(score)
score.df <- melt(score, value.name = "score")
ggplot(score.df, aes(x = names, y = score))+
geom_bar(stat = "identity")+
coord_flip()+
facet_wrap(~ variable)+
ggtitle("Figure 5. Bone Measures Represented by Each Principal Component")+
theme(plot.title = element_text(hjust = 0.5, size = 14))+
ylab("Bone measures")
The first principal component captures the common trend of the bone measures. The later principal components start to pick up smaller difference among the measures.
# combine the PCs with class label into a df
pca.df <- data.frame(pca$x)
pca.df$type <- data.complete$type
# plot and color by sample type
p1<- ggplot(pca.df, aes(x = PC1, y = PC2, color = type)) +
geom_point() +
scale_color_discrete(name = "Ecological Class", labels = label)+
ggtitle("6a. Ecological Classes Shown by PC1 and PC2")+
theme(plot.title = element_text(size = 13))
p2<- ggplot(pca.df, aes(x = PC3, y = PC4, color = type)) +
geom_point() +
scale_color_discrete(name = "Ecological Class", labels = label)+
ggtitle("6b. Ecological Classes Shown by PC3 and PC4")+
theme(plot.title = element_text(size = 13))
grid.arrange(p1, p2, ncol=2,
top = textGrob("Figure 6. Ecological Classes Shown by Principal Components",
gp=gpar(fontsize=16, fontface = 'bold')))
We can see from the above plots that the ecologial classes are hard to be separated by the principal components. Consistent with the previous boxplots of the bone measures, three types of birds (raptors, swimming and wading) have higher variance compared to the other three types. It makes sense for the birds to share similar bone characteristics even if they have very different living habits, especially when some of the living habits may co-exist in a brid. For example, the singing birds may share the bone measures of scansorial birds if they also climb trees.
Based on the above plots, it may be difficult to classify the all six living habits according to the bone measures. We will try it in Section 3.2. In addition, we can also use some of the living habits that do not co-exist in birds to perform the classification. For example, the water birds and non-water birds can be a distinct and clear classification. In order to perform the classification, the ecological classes are further grouped as follows:
The water birds are characterized by the higher bone measure variance compared to other birds. The two groups are largely separable on the following plot.
# assign new types to the data
pca.df$new.type <- "Other"
pca.df$new.type[pca.df$type == "SW" | pca.df$type == "W"] <- "Water"
pca.df$new.type <- as.factor(pca.df$new.type)
# 3-D plot to visualize the two groups
colors = c("red", "blue")
colors <- colors[as.numeric(pca.df$new.type)]
plot <- scatterplot3d(pca.df[c(1, 3:4)],
color = colors, angle = 60, pch = 1,
main = "Figure 7. Water v.s Other Birds by the PC 2-4")
legend(plot$xyz.convert(-15, 3, 2), legend = levels(pca.df$new.type),
col = c("red", "blue"), pch =1)
Although there is still overlap between the two groups, we can try classification models on the data based on these two ecological distinct groups.
First, let’s split training and testing sets, and apply PCA transformation.
# add the new type to original data
data.complete$new.type <- pca.df$new.type
# number of samples
n.sample <- nrow(data.complete)
# separate training and testing sets
set.seed(322)
index <- sample(seq(1, n.sample, 1), n.sample)
# use 30% as testing set
n.split <- floor(n.sample*0.7)
train.split <- data.complete[index[1:n.split], ]
test.split <- data.complete[index[(n.split+1):n.sample], ]
# apply PCA to the training data
pca.2 <- prcomp(train.split[,2:(n.col-1)], scale = TRUE)
train <- data.frame(pca.2$x)
train[c("type", "new.type")] <- train.split[c("type", "new.type")]
# project the test data to the same PCA scale
test <- data.frame(predict(pca.2, newdata = test.split[, 2:(n.col-1)]))
test[c("type", "new.type")] <- test.split[c("type", "new.type")]
Let’s first check the balance between these two groups:
table(pca.df$new.type)
##
## Other Water
## 233 180
The numbers of birds in the two classes are comparable. We can directly apply classification models to the data without necessarily adjusting the class weight.
Let’s try logistic regression and support vector machines (SVM) for this classification task. Both models are known to achieve good classification performance, but with advantages in different cases.
In order to decide which principal components to include into the logistic regression model, a predictor selection procedure is performed. First, models with one PC as the predictor are compared to the intercept model, and the predictor with the lowest deviance (most significant p-value from likelihood ratio test) is added to the model. Then a second predictor is selected in the same way. The procedure is repeated until no predictor yields any statistical significant improvement of the model.
Sample code for the first step:
# function to store model results
store <- function(model, variable, null.deviance, null.df){
# compute p-value from Chi-square distribution
p.val <- 1- pchisq(null.deviance - deviance(model),
length(model$coefficients) - null.df)
# organize the result into a dataframe
result <- data.frame(model = variable,
deviance = deviance(model),
num.coef = length(model$coefficients),
p.value = p.val)
return (result)
}
# fit an intercept model
model.0 <- glm(new.type ~ 1, data = train, family = binomial)
result <- store(model.0, "1", deviance(model.0), 1)
# compare to model with 1 predictor
variables <- names(pca.df)[1:10]
for (name in variables){
formula = as.formula(paste0("new.type ~", name))
model <- glm(formula, data = train, family = binomial)
result <- rbind(result, store(model, paste0("1+", name), deviance(model.0), 1))
}
# check the comparison
kable(result)
model | deviance | num.coef | p.value |
---|---|---|---|
1 | 393.6036 | 1 | 1.0000000 |
1+PC1 | 344.1887 | 2 | 0.0000000 |
1+PC2 | 372.7438 | 2 | 0.0000049 |
1+PC3 | 363.6009 | 2 | 0.0000000 |
1+PC4 | 318.9894 | 2 | 0.0000000 |
1+PC5 | 392.8406 | 2 | 0.3824042 |
1+PC6 | 388.1939 | 2 | 0.0200261 |
1+PC7 | 391.5699 | 2 | 0.1538545 |
1+PC8 | 393.5016 | 2 | 0.7495383 |
1+PC9 | 391.0191 | 2 | 0.1079148 |
1+PC10 | 389.7865 | 2 | 0.0507344 |
PC3 is added into the model.
The rest of the procedure is not presented (check the rmarkdown file for details). Here is the final model and all the previsou steps of the model selection.
model | deviance | num.coef | p.value | |
---|---|---|---|---|
1 | 1 | 393.6036 | 1 | 1 |
4 | 1+PC3 | 363.6009 | 2 | 0 |
11 | 1+PC3+PC1 | 299.5575 | 3 | 0 |
13 | 1+PC3+PC1+PC4 | 155.4380 | 4 | 0 |
The final model includes the principal component #1, 3 and 4.
# check performance of the final model
model.log <- glm(new.type ~ PC1+PC3+PC4, data = train,
family = binomial)
# predict on test set
pred <- predict(model.log, newdata = test)
# convert to binary result
y.pred <- rep("Other", nrow(test))
y.pred[pred > 0.5] <- "Water"
# classification accuracy
print (paste0("Classification accuracy of the logistic model is ",
round(mean(y.pred == test$new.type), 4)))
## [1] "Classification accuracy of the logistic model is 0.9274"
# check confusion matrix
table <- confusionMatrix(y.pred, test$new.type)
table$table
## Reference
## Prediction Other Water
## Other 66 9
## Water 0 49
The logsitic model has a high classificatio accuracy of over 90%. Depending on the need, we can further adjust the class weight to achieve better classification on centain groups.
Then let’s check if support vector machines can result in similar classification accuracy using the same set of predictors.
Linear Kernel
# perform SVM with linear kernel
# (the best cost is around the default 1)
svm.linear <- svm(new.type ~ PC1+PC3+PC4,
data = train, kernel = "linear")
# predict on test set
pred <- predict(svm.linear, newdata = test)
# classification accuracy
print (paste0("Classification accuracy of linear SVM is ",
round(mean(pred == test$new.type), 4)))
## [1] "Classification accuracy of linear SVM is 0.9274"
The performance of linear SVM is comparable to the logistic regression. The high accuracy suggests that the decision boundary is linear in this case. However, let’s try a radial kernel to see if the accuracy can be even better.
Radial Basis Kernel
set.seed(322)
# perform SVM with RBF kernel using default 10-fold CV
svm.rbf <- tune(svm, new.type ~ PC1+PC3+PC4,
data = train, kernel = "radial",
ranges = list(gamma = seq(0.01, 0.2, 0.02),
cost = seq(1, 200, 20)))
svm.rbf$best.parameters
## gamma cost
## 20 0.19 21
plot(svm.rbf, main = "Figure 8. Visualize the Grid Search Result")
# predict on test set
pred <- predict(svm.rbf$best.model, newdata = test)
# classification accuracy
print (paste0("Classification accuracy of the radial SVM is ",
round(mean(pred == test$new.type), 4)))
## [1] "Classification accuracy of the radial SVM is 0.9516"
With tuned Radial SVM, the classification accuracy is further improved to over 95%.
After the success in the classification of water vs non-water birds, let’s also try the original classification tasks of the six ecological classes using SVM. Although the groups are not visually separable from the above plots, the performance of SVM is surprisingly good.
Linear Kernel:
# linear SVM on the multi-class task
log.2 <- tune(svm, type ~ ., data = train, kernel = "linear",
ranges = list(cost = 10^seq(-3, 3, 1)))
# predict on test set
pred <- predict(log.2$best.model, newdata = test)
# classification accuracy
print (paste0("Multi-class classification accuracy of the Linear SVM is ",
round(mean(pred == test$type), 4)))
## [1] "Multi-class classification accuracy of the Linear SVM is 0.9032"
# confusion matrix
table.2 <- confusionMatrix(pred, test$type)
table.2$table
## Reference
## Prediction P R SO SW T W
## P 8 0 1 0 0 0
## R 0 13 0 0 0 0
## SO 2 0 34 0 0 0
## SW 0 0 0 30 0 4
## T 1 0 0 0 7 0
## W 0 0 0 4 0 20
The classification accuracy is surprisingly high, and the classification of most classes are good. Interestinly, as mentioned above that swimming birds (SW) and walding birds (W) are similar, they tend to be misclassified to each other in this multi-class classification task.
RBF Kernel:
# RBF SVM on the multi-class task
rbf.2 <- tune(svm, type ~ ., data = train, kernel = "radial",
ranges = list(cost = seq(1, 40, 4),
gamma = c(seq(0.001, 0.009, 0.002),
seq(0.01, 1, 0.02))))
# predict on test set
pred <- predict(rbf.2$best.model, newdata = test)
# classification accuracy
print (paste0("Multi-class classification accuracy of the RBF SVM is ",
round(mean(pred == test$type), 4)))
## [1] "Multi-class classification accuracy of the RBF SVM is 0.9516"
# confusion matrix
table.3 <- confusionMatrix(pred, test$type)
table.3$table
## Reference
## Prediction P R SO SW T W
## P 9 0 0 0 0 0
## R 0 13 1 0 0 0
## SO 1 0 34 0 0 0
## SW 0 0 0 34 0 3
## T 1 0 0 0 7 0
## W 0 0 0 0 0 21
The RBF kernel improves the classification accuracy to over 95%. The improvement is mainly on better separation of the swimming and walding birds, with some improvement on other classes except terretorial birds (which only has a few samples in the test set.)
After performing classification using supervised learning methods, let’s further explore connections between the bone measures and types of living habits using clustering methods. Due to the unsupervised nature of clustering, it may reveal natual connection between the features/classes.
First, in order to understand how the bone measures relate, the bone measures are clustered by hierachical clustering method. The distance between bone measures are defined according to Euclidean distance, and feasures close with each other are clustered together. Based on the order clustering is performed, hierachical clustering uses “complete”, “single” and “average” linkage methods. The following result uses default complete linkage. Interestingly, these three linkage methods produce exactly the same clustering pattern.
# apply hierachical clustering on bone measures
clusters.bones <- hclust(dist(t(data.complete[2:11])))
# plot the dendrogram
plot(clusters.bones,
main = "Figure 9. Hierarchical clustering of the bone measures",
xlab = "Bone measures", sub="")
# draw rectangles around clusters
rect.hclust(clusters.bones, h = 800, which = c(1, 2, 3),
border = c("green", "red", "blue"))
The results are consistent with our previous knowledge of the bone measures. All the width measures are clustered together (in green box), and are separate from the length measures. The difference between the width measures are very small. In comparison, the length measures are further separated into two clusters. Interestingly, one of the clusters includes humerus and ulna, both are in bird wings (in red box). The three bone in bird legs form another cluster (in blue box). It suggests that the bones in wings and legs are evolved separately for different functions and living habits.
Besides using supervised classification models, we can also apply unsupervised classification methods on the dataset. k-means clustering is tested in this case.
The number of clusters is chosen by examining percentage of variance explained against the number of clusters using the “elbow method”. The optimal number of clusters is chosen at the number where adding clusters does not explain significant amount of additional variance. Since deciding the optimal number of clusters is more or less subjective, the method is “semi-supervised”.
# compute % variance explained for different k values
var.explained <- c()
for (k in 1:15){
model <- kmeans(data.complete[2:11], centers = k)
var.explained <- c(var.explained,
model$betweenss/(model$tot.withinss + model$betweenss))
}
#plot vairance explained against k values
df <- data.frame(k = 1:15, var.explained = var.explained)
ggplot(df, aes(x = k, y = var.explained))+
geom_point()+
geom_line()+
xlab("Number of Clusters")+
ylab("Percent Variance Explained")+
ggtitle("Figure 10. Percent Variance Explained by Different Numbers of Clusters")+
theme(plot.title = element_text(hjust = 0.5, size = 12))
The percentage of variance explained keeps increasing until after nine clusters, so k is set to nine.
k-means clustering with \(k=9\) is performed on the data. The living habits classified into each of the cluster is visualized in the following graph. For each cluster, the majority living habit is indicated as the “classified type”.
# perform k-means clustering with 9 clusters
num.clust <- 9
kmeans_4 <- kmeans(data.complete[2:11], centers = num.clust)
# combine results with actual type
type.df <- data.frame(actual = data.complete$type,
class = kmeans_4$cluster)
# split by cluster
type.split <- split(type.df, type.df$class)
# classify the cluster according to the majority type
types <- sapply(type.split,
function(x) {c = table(x$actual); names(c)[c == max(c)]})
# organize classified type into the data.frame
num_in_clust <- sapply(type.split, nrow)
clust.type <- c()
for (i in 1:num.clust){
clust.type <- c(clust.type, rep(types[i], num_in_clust[i]))
}
type.df$kmeans.type <- clust.type
# plot stacked bar chart
ggplot(type.df, aes(x = class, fill = actual))+
geom_bar(stat = "count")+
scale_fill_brewer(palette = "Set2", name = c("Actual type"),
label = label) +
ggtitle("Figure 11. Actual ecological classes in each cluster") +
theme(plot.title = element_text(hjust = 0.5, size = 14))+
scale_x_continuous("Classified type", breaks = 1:num.clust, labels = types)
k-means clustering does not perform well on this data. Most of the clusters are not dominated by single living habits. The two big classes, swimming and singing, either are represented by more than two clusters, and highly mixed with other classes. The final classified types by majority in each cluster are mainly singing and swimming. Therefore, k-means clustering is unable to correctly classify bird living habits using the bone measures.
Both logistic regression and support vector machines are able to classify birds into the correct ecological classes according to their bone measures. The high accuracy of the classification indicates that the bird’s living habits are related to its bone structure. It makes sense because a bird’s living habits are the results of natual selection and evolution of the bird’s body structures. The different bone structures enable the birds to adapt to the different living habits. Besides using the models to perform classification using the bone measures, details of the bone structures may also help the studies of how birds adapt to the different living habits.
Hierachical clustering is able to separate bone width and length measures, and further group bone lengths from bones in wings and legs. It suggests that the bones in wings/legs are typically envolved together. Wings are legs are probably envolved separated for different living habits. In addition, k-means clustering is not suitable for grouping birds with similar living habits. Instead, the supervised classification models are preferred for this task.