title | output | ||||
---|---|---|---|---|---|
Reproducible Research: Peer Assessment 1 |
|
Show any code that is needed to
- Load the data (i.e. read.csv())
data <- read.csv("activity.csv")
- Process/transform the data (if necessary) into a format suitable for your analysis
data$interval2 <- data$interval
data <- transform(data, interval=factor(interval))
data$avgStepsIntervalAcrossAllDays <- tapply(data$steps,data$interval, mean, na.rm=T)
For this part of the assignment, you can ignore the missing values in the dataset.
- Make a histogram of the total number of steps taken each day
stepsperday <- tapply(data$steps,data$date,sum)
hist(stepsperday,breaks=20, main="Histogram of total steps per day", xlab="Steps per day")
- Calculate and report the mean and median total number of steps taken per day
stepsperday_mean <- mean(stepsperday, na.rm=T)
print(stepsperday_mean)
## [1] 10766
stepsperday_median <- median(stepsperday, na.rm=T)
print(stepsperday_median)
## [1] 10765
The mean of steps per day is 1.0766 × 104.
The median of steps per day is 10765.
- Make a time series plot (i.e. type = "l") of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all days (y-axis)
xvalues <- levels(data$interval)
yvalues <- tapply(data$steps,data$interval, mean, na.rm=T)
plot(x=xvalues, y=yvalues, xlab="5-minute intervals", ylab="average number of steps in 5-minute intervals across all days", type="l")
- Which 5-minute interval, on average across all the days in the dataset, contains the maximum number of steps?
interval_with_yvalues_max <- names(which(yvalues == max(yvalues)))
print(interval_with_yvalues_max)
## [1] "835"
This interval contains the maximum number of steps: 835.
Note that there are a number of days/intervals where there are missing values (coded as NA). The presence of missing days may introduce bias into some calculations or summaries of the data.
- Calculate and report the total number of missing values in the dataset (i.e. the total number of rows with NAs)
data_log_na <- is.na(data)
# summary(data_log_na) reveals that there are only NAs in the first column, so the number of NAs equals the number of rows with NAs
rows_with_na <- sum(data_log_na[,1])
print(rows_with_na)
## [1] 2304
The total number of missing values is 2304.
- Devise a strategy for filling in all of the missing values in the dataset. The strategy does not need to be sophisticated. For example, you could use the mean/median for that day, or the mean for that 5-minute interval, etc.
for (i in 1:nrow(data)){
if (data_log_na[i,1] == TRUE){
data$steps[i] <- data$avgStepsIntervalAcrossAllDays[i]
}
}
- Create a new dataset that is equal to the original dataset but with the missing data filled in.
# Remove column 4 with average steps for the intervalls across all days
data_new <- data[,-4]
- Make a histogram of the total number of steps taken each day and Calculate and report the mean and median total number of steps taken per day. Do these values differ from the estimates from the first part of the assignment? What is the impact of imputing missing data on the estimates of the total daily number of steps?
stepsperday_newdata <- tapply(data_new$steps,data_new$date,sum)
hist(stepsperday_newdata,breaks=20, xlab="Steps per day, imputing missing data", main="Histogram of total steps per day, imputing missing data")
stepsperday_newdata_mean <- mean(stepsperday_newdata, na.rm=T)
print(stepsperday_newdata_mean)
## [1] 10766
stepsperday_newdata_median <- median(stepsperday_newdata, na.rm=T)
print(stepsperday_newdata_median)
## [1] 10766
The mean of steps per day with imputing the missing data is 1.0766 × 104.
The median of steps per day with imputing the missing data is 1.0766 × 104.
ANSWER TO LAST QUESTION OF POINT 4:
The mean values are equal. The median values are slidly different.
There are eight day with NAs. Each of these eight days get the mean value for the corresponding interval which is calculated across all days. The mean is the estimator for the normal distribution and the mean of mean values is normal distributed. In this case the mean of total steps per day is equal to 10766, so you can see in the histogram that the frequency of the mean value of the total steps per day has increased by the the number eight comparing the two histograms. In the first histogram the frequency of the interval which has the mean value is 10 and in the second histogram the corresponding frequency is 18. The difference (18-10=8) is equal to the number of days with missing values.
For this part the weekdays() function may be of some help here. Use the dataset with the filled-in missing values for this part.
- Create a new factor variable in the dataset with two levels – “weekday” and “weekend” indicating whether a given date is a weekday or weekend day.
data_new$factordate<-weekdays(as.Date(data_new$date,"%Y-%m-%d"))
data_new$interval2 <- data$interval2
for(i in 1:nrow(data_new)){
if(data_new$factordate[i] == "Saturday" | data_new$factordate[i] == "Sunday"){
data_new$factordate[i] <- "weekend"
}else{
data_new$factordate[i] <- "weekday"
}
}
data_new <- transform(data_new, factordate=factor(factordate))
- Make a panel plot containing a time series plot (i.e. type = "l") of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all weekday days or weekend days (y-axis).
data_new_split <- split(data_new, data_new$factordate)
data_new_split[[1]][,1] <- tapply(data_new_split[[1]][,1],data_new_split[[1]][,3],mean)
data_new_split[[2]][,1] <- tapply(data_new_split[[2]][,1],data_new_split[[2]][,3],mean)
data_new_splitted <- unsplit(data_new_split, data_new$factordate)
library(lattice)
xyplot(steps ~ interval2 | factordate, data=data_new_splitted, layout = c(1, 2), type="l", xlab="Interval", ylab="Avg. number of steps")