This post was originally published at ODSC. Check out their curated blog posts!
With the holidays long gone and likely everyone’s New Year’s resolutions along with it I figured I would spend some time working with data instead of working with gym weights. Lacking any real inspiration, a friend pointed me to the University of California, Irvine’s Machine Learning data repository. Specifically, he sent me the link to the PAMAP2 Physical Activity Monitoring Data Set.
This dataset contains longitudinal data from 9 subjects. Each is hooked up to various biometric sensors that provide a reading each second. As each subject migrates through her day the subject’s activity is also recorded like driving or playing soccer. For a fictitious example, suppose Meghan was wearing these biometric sensors, and driving her car to soccer practice. The sensors would provide second by second data such as heart rate during the drive. Next, she would start her workout and again the sensors would provide data presumably with an increased heart rate. Considering automotive telematics and personal fitness trackers all produce similar data I was intrigued to explore modeling the data as a classification problem.
Since so many New Year’s resolutions center around working out I figure it’s a fairly timely post. First, I will show you how to get the data and organize it. Then just a little exploratory data analysis (EDA), followed by preprocessing, partitioning and then we apply the K-Nearest Neighbor (KNN) algorithm the data. The goal is to use biometric data to classify what activity a person is doing.
Set Up
This data is housed in a zip file on the UCI site. Considering the size, I opted to use data.table
since it reads and manipulates data efficiently. I also use pbapply
for applying functions with a progress bar. This helps me understand how long my code is going to take when working with millions of rows. Next, the plyr
package is great for data manipulation and preprocessing. Lastly, although I could model a KNN algorithm in multiple packages, even caret
, I chose klaR
because it builds KNN algorithms fast.
library(data.table)
library(pbapply)
library(plyr)
library(caret)
After downloading the zip file, you will have to unpack the “.dat” files. Each file contains the tabled data ordered chronologically for the subjects’ data. There are two folders with subject data. Rather than specify files individually you can programmatically scan the folders and then amass all the biometric into a single data table. The list.files
function will search for a pattern within a specified folder. So temp1
and temp2
become string vectors with the full file path for any files ending in “*.dat” I concatenated the two file objects into a single object, temp
.
temp1 <- list.files(path='~/pmap/PAMAP2_Dataset/Protocol', pattern="*.dat",full.names = T)
temp2 <- list.files(path='~/pmap/PAMAP2_Dataset/Optional', pattern="*.dat",full.names = T)
temp<-c(temp1,temp2)
fread
will read any table into your R session. This is applied to each individual file path in temp
. All of this is then unified using rbindlist
which row binds the list containing individual subject data.
activity.data<- rbindlist(pblapply(temp, fread),fill=T)
activity.data<-activity.data[complete.cases(activity.data),]
Instead of imputing missing values I decided to omit records containing any NA values. This is a large data set and really only a hobby post so I decided not to. If you want to impute and therefore model on more records I usually use the VIM
package and particularly the hotdeck
function has been helpful in the past. In this case I pass in the base complete.cases
function into the bracketed data table. The complete.cases
function creates a T/F Boolean output. True represents a row without any NA values. The data table will automatically retain any T values with the code below.
activity.data<-activity.data[complete.cases(activity.data),]
The variable names are mostly incomprehensible so the code below changes them. I simply paste the hand, chest and ankle measurements to a sequence of numbers coinciding with the data frame’s column number. Then I declare the colnames
to be a character vector with the non-measurement inputs.
hands<-paste0('hand', seq(4,20))
chest<-paste0('chest', seq(21,37))
ankle<-paste0('ankle', seq(38,54))
colnames(activity.data)<-c('timestamp','Y_activityID','heart_rate',hands,chest,ankle)
The dependent or Y feature is a multi-class factor corresponding to a person’s activity. Although the value is an integer the data’s pdf defines the actual states. To re-map the target feature I first create y.code
. This is a numeric vector with existing activity codes. Then I create y.class
as a string vector with each activity.
y.code<-c(1,2,3,4,5,6,7,9,10,11,12,13,16,17,18,19,20,24,0 )
y.class<-c('lying','sitting','standing','walking','running','cycling',
'Nordic_walking','watching_TV','computer_work','car_driving','ascending_stairs','descending_stairs','vacuum_cleaning','ironing','foldinglaundry','house_cleaning','playing_soccer','rope_jumping','other')
The mapvalues
function accepts a vector of values to change then a “from” and “to” parameters. The code passes in the activity.data$Y_activityID
vector followed by y.code
and y.class
. The code snippet rewrites the existing activity.data$Y_activityID
. The second line changes the remapped values from characters to factors.
activity.data$Y_activityID<-mapvalues(activity.data$Y_activityID,
from = y.code, to = y.class)
activity.data$Y_activityID<-as.factor(activity.data$Y_activityID)
The target feature is now a factor corresponding to the data dictionary. Check it out with sample
and an integer e.g.
sample(activity.data$Y_activityID,10)
Quick EDA
Although not the point of the post, it’s a good idea to perform EDA anytime you are modeling. At a minimum I like to tally the target feature. This will help you understand if you have severely unbalanced targets which affects how you construct a model matrix. Use table
on Y_activityID
to print the tally.
table(activity.data$Y_activityID)
You can also make a quick visual by nesting the table
data inside barplot
. So the labels do not get cut off, specify margins in your graphics device. This is done with par
and declaring mar
with integer values that provide the cushion around the edges of the plot. Next take the previous code and nest it in barplot
.
op <- par(mar=c(11,4,4,2))
barplot(table(activity.data$Y_activityID),las=2, col='darkred')
The activity distribution from the 9 subjects.
The basic EDA function summary
can be applied to a data frame and will return information for each vector. To save time on this large data set, I took a random sample of the entire data table. It’s easy to sample a data table using the code below. You can use sample
within the indexing code by first passing in the .N
follow by the number of records to sample. This code will grab 10,000 records to create eda.sample
. Now calling summary on the subset data will calculate the information faster.
eda.sample<-activity.data[sample(.N,10000)]
summary(eda.sample)
A screenshot of the sampled activity data showing the summary information for some inputs.
An Irresponsibly quick KNN explanation
The KNN algorithm is an analogous method. This means the predictions come from similar or analogous records. This is a common sense approach. Let’s say you have data shown below in a scatter plot with 2 classes Red and Green.
This visual represents your training set because the target, red or green, is known. Now you are presented a new record shown as a grey triangle in the graph below. Would you guess the unknown record is red or green?
If you look at the nearest neighbors to the triangle you may guess the new record is a red dot. This new record is analogous to the closest records.
A tuning parameter of KNN is the number of nearest neighbors. You have to specify the number of neighbors in case new points are equal distance to both classes. For example this graph shows a more centered grey triangle. If you are restricted to a single neighbor you wouldn’t know which class because the triangle is exactly in between opposing markers. This makes it harder to pick a color if you are looking for the single closest Red or Green marker. So instead a K =3 in KNN would improve the results. For the sake of this illustration I added arrows to the closest 2 dots. 1 of the 3 neighbors is RED, the other 2 are GREEN so the probability of being green is 66%.
Keep in mind that distance is measured as Euclidean meaning the straight line distance to the nearest known record. Remember your Pythagorean Theorem days in geometry? That’s the stuff of Euclidean distance. Also this data is complex and distance occurs in hyperspace not the 2 dimensions shown.
Center & Scaling
The problem with measuring Euclidean distance is that any values that have different orders of magnitude will impact the KNN algorithm significantly. For example, if you were modeling customer outcomes and income is measured in thousands and number of children were (likely) single digits, distances between incomes will seem larger than between children. In this approach you have to scale and center your inputs.
To understand the impact of scaling and centering apply it to the eda.sample data. The scale function can be applied to the data frame with additional parameters set to TRUE.
Keep in mind that you do not want to scale the dependent variable just the inputs.
Also I don’t scale or even model on the timestamp feature. This removes the temporal aspect of the modeling, since from second to second a subject is likely doing the same activity. You could feature engineer an input that captures the longitudinal information in the timestamp but I just omit timestamp and the target using the index 3:54 below.
eda.scale<-scale(eda.sample[,3:54, with=F],center=T,scale=T)
Now you can compare the summary output on the eda.scale
to eda.sample
. Notice the mean for all vectors is now 0. Centering a vector subtracts the mean average from each individual value. Scale will divide the new value by the vector’s standard deviation. Essentially this normalizes each value to its distance from an average of 0 and puts the values on the same scale so no single attribute would dictate a larger Euclidean distance.
summary(eda.scale)
A portion of the eda.scale summary with mean at zero.
Now that you understand the center and scaling function let’s apply it to the entire data set. The first input to scale is now activity.data[,3:54, with=F]
. The second line simply adds the dependent activity to the new scaled inputs.
activity.scale<-as.data.frame(scale(activity.data[,3:54, with=F],center=T,scale=T))
activity.scale$Y_activityID<-activity.data$Y_activityID
Now that you have the data set, check out the rest of this post at ODSC!