This report is the course project for Coursera course Practical Machine Learning.
In this project we’ll use data about personal acitivity to predict the manner in which barbell lift participants did their exercises. The data is from here
data_train = read.csv("pml-training.csv")
# remove columns that has too many NAs
# i.e. over 10% of the data is NA or ""
num_nas = apply(data_train,2,function(x) sum(is.na(x)))
num_nulls = apply(data_train,2,function(x) sum(x==""))
columns = names(data_train)[num_nas<nrow(data_train)*0.9 & num_nulls<nrow(data_train)*0.9]
#sds = apply(data_train[,columns],2,function(x) sd(as.numeric(x)))
# the first few columns are also not useful
# as they are indices, names, timestamps, window flags
columns = setdiff(columns,names(data_train)[1:7])
# so the data is
data_train = data_train[,columns]
We will read in the data and do some data cleaning before we go to the model training step. First, remove columns that have too many NA or NULL values.
# remove columns that has too many NAs
# i.e. over 10% of the data is NA or ""
num_nas = apply(data_train,2,function(x) sum(is.na(x)))
num_nulls = apply(data_train,2,function(x) sum(x==""))
columns = names(data_train)[num_nas<nrow(data_train)*0.9 & num_nulls<nrow(data_train)*0.9]
#sds = apply(data_train[,columns],2,function(x) sd(as.numeric(x)))
Then we noticed that out of the remaining feature columns, the first few columns are not useful as they are indices, names, timestamps, window flags
# the first few columns are also not useful
# as they are indices, names, timestamps, window flags
columns = setdiff(columns,names(data_train)[1:7])
# so the data is
data_train = data_train[,columns]
Now we have cleaned our datset, we can perform our model training.
We will try different classifiers (see here for all the available models caret package provide) to predict the “classe” target variable in the dataset.
Let’s create k-fold dataset for cross validation first.
suppressPackageStartupMessages(library(caret))
# set up k-fold cross validation
set.seed(42)
cv_control<- trainControl(method="cv", number=5, savePredictions = TRUE)
# prepross data???
# e.g. center and scale???
Then let’s try different classifiers.
In this report, we’ll try CART, Random Forest, K Nearest Neighbors and Stochastic Gradient Boosting.
cart_model <- train(classe~., data=data_train, trControl=cv_control, method="rpart")
confusionMatrix(predict(cart_model,newdata = data_train),data_train$classe)
Confusion Matrix and Statistics
Reference
Prediction A B C D E
A 5080 1581 1587 1449 524
B 81 1286 108 568 486
C 405 930 1727 1199 966
D 0 0 0 0 0
E 14 0 0 0 1631
Overall Statistics
Accuracy : 0.4956
95% CI : (0.4885, 0.5026)
No Information Rate : 0.2844
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.3407
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: A Class: B Class: C Class: D Class: E
Sensitivity 0.9104 0.33869 0.50468 0.0000 0.45218
Specificity 0.6339 0.92145 0.78395 1.0000 0.99913
Pos Pred Value 0.4970 0.50850 0.33040 NaN 0.99149
Neg Pred Value 0.9468 0.85310 0.88225 0.8361 0.89008
Prevalence 0.2844 0.19351 0.17440 0.1639 0.18382
Detection Rate 0.2589 0.06554 0.08801 0.0000 0.08312
Detection Prevalence 0.5209 0.12889 0.26638 0.0000 0.08383
Balanced Accuracy 0.7721 0.63007 0.64431 0.5000 0.72565
rf_model <- train(classe~., data=data_train, trControl=cv_control, method="ranger",verbose=F)
confusionMatrix(predict(rf_model,newdata = data_train),data_train$classe)
Confusion Matrix and Statistics
Reference
Prediction A B C D E
A 5580 0 0 0 0
B 0 3797 0 0 0
C 0 0 3422 0 0
D 0 0 0 3216 0
E 0 0 0 0 3607
Overall Statistics
Accuracy : 1
95% CI : (0.9998, 1)
No Information Rate : 0.2844
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: A Class: B Class: C Class: D Class: E
Sensitivity 1.0000 1.0000 1.0000 1.0000 1.0000
Specificity 1.0000 1.0000 1.0000 1.0000 1.0000
Pos Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
Neg Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
Prevalence 0.2844 0.1935 0.1744 0.1639 0.1838
Detection Rate 0.2844 0.1935 0.1744 0.1639 0.1838
Detection Prevalence 0.2844 0.1935 0.1744 0.1639 0.1838
Balanced Accuracy 1.0000 1.0000 1.0000 1.0000 1.0000
knn_model <- train(classe~., data=data_train, trControl=cv_control, method="knn")
confusionMatrix(predict(knn_model,newdata = data_train),data_train$classe)
Confusion Matrix and Statistics
Reference
Prediction A B C D E
A 5531 58 11 12 8
B 16 3638 42 2 30
C 12 50 3331 81 23
D 19 27 24 3104 41
E 2 24 14 17 3505
Overall Statistics
Accuracy : 0.9739
95% CI : (0.9715, 0.976)
No Information Rate : 0.2844
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9669
Mcnemar's Test P-Value : 5.335e-16
Statistics by Class:
Class: A Class: B Class: C Class: D Class: E
Sensitivity 0.9912 0.9581 0.9734 0.9652 0.9717
Specificity 0.9937 0.9943 0.9898 0.9932 0.9964
Pos Pred Value 0.9842 0.9759 0.9525 0.9655 0.9840
Neg Pred Value 0.9965 0.9900 0.9944 0.9932 0.9936
Prevalence 0.2844 0.1935 0.1744 0.1639 0.1838
Detection Rate 0.2819 0.1854 0.1698 0.1582 0.1786
Detection Prevalence 0.2864 0.1900 0.1782 0.1638 0.1815
Balanced Accuracy 0.9924 0.9762 0.9816 0.9792 0.9841
gbm_model <- train(classe~., data=data_train, trControl=cv_control, method="gbm",verbose = F)
confusionMatrix(predict(gbm_model,newdata = data_train),data_train$classe)
Confusion Matrix and Statistics
Reference
Prediction A B C D E
A 5520 71 0 3 5
B 43 3648 72 8 24
C 10 76 3311 100 20
D 5 2 34 3087 36
E 2 0 5 18 3522
Overall Statistics
Accuracy : 0.9728
95% CI : (0.9704, 0.975)
No Information Rate : 0.2844
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9656
Mcnemar's Test P-Value : 9.084e-16
Statistics by Class:
Class: A Class: B Class: C Class: D Class: E
Sensitivity 0.9892 0.9608 0.9676 0.9599 0.9764
Specificity 0.9944 0.9907 0.9873 0.9953 0.9984
Pos Pred Value 0.9859 0.9613 0.9414 0.9757 0.9930
Neg Pred Value 0.9957 0.9906 0.9931 0.9922 0.9947
Prevalence 0.2844 0.1935 0.1744 0.1639 0.1838
Detection Rate 0.2813 0.1859 0.1687 0.1573 0.1795
Detection Prevalence 0.2853 0.1934 0.1792 0.1612 0.1808
Balanced Accuracy 0.9918 0.9757 0.9774 0.9776 0.9874
As we can see from the 5-folded cross validatd model training, the random forest does the best, which 100% accuracy and (0.9998, 1) on the 95% confidence interval. So we’ll choose the random forest model to predict on our dataset.
data_test = read.csv("pml-testing.csv")
# so the data is
feature_columns = setdiff(columns,'classe')
data_test = data_test[,feature_columns]
# do the predictions
test_preds = predict(rf_model,data_test)
So the predictions are:
index | classe |
---|---|
1 | B |
2 | A |
3 | B |
4 | A |
5 | A |
6 | E |
7 | D |
8 | B |
9 | A |
10 | A |
11 | B |
12 | C |
13 | B |
14 | A |
15 | E |
16 | E |
17 | A |
18 | B |
19 | B |
20 | B |