Skip to content

Latest commit

 

History

History
256 lines (137 loc) · 6.78 KB

PA1_template.md

File metadata and controls

256 lines (137 loc) · 6.78 KB
title output
Reproducible Research: Peer Assessment 1
html_document
keep_md
true

Loading and preprocessing the data

Show any code that is needed to

  1. Load the data (i.e. read.csv())
data <- read.csv("activity.csv")
  1. Process/transform the data (if necessary) into a format suitable for your analysis
data$interval2 <- data$interval

data <- transform(data, interval=factor(interval))

data$avgStepsIntervalAcrossAllDays <- tapply(data$steps,data$interval, mean, na.rm=T)

What is mean total number of steps taken per day?

For this part of the assignment, you can ignore the missing values in the dataset.

  1. Make a histogram of the total number of steps taken each day
stepsperday <- tapply(data$steps,data$date,sum) 

hist(stepsperday,breaks=20, main="Histogram of total steps per day", xlab="Steps per day")

plot of chunk unnamed-chunk-3

  1. Calculate and report the mean and median total number of steps taken per day
stepsperday_mean <- mean(stepsperday, na.rm=T)

print(stepsperday_mean)
## [1] 10766
stepsperday_median <- median(stepsperday, na.rm=T)

print(stepsperday_median)
## [1] 10765

The mean of steps per day is 1.0766 × 104.
The median of steps per day is 10765.

What is the average daily activity pattern?

  1. Make a time series plot (i.e. type = "l") of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all days (y-axis)
xvalues <- levels(data$interval)

yvalues <- tapply(data$steps,data$interval, mean, na.rm=T)

plot(x=xvalues, y=yvalues, xlab="5-minute intervals", ylab="average number of steps in 5-minute intervals across all days", type="l")

plot of chunk unnamed-chunk-5

  1. Which 5-minute interval, on average across all the days in the dataset, contains the maximum number of steps?
interval_with_yvalues_max <- names(which(yvalues == max(yvalues)))

print(interval_with_yvalues_max)
## [1] "835"

This interval contains the maximum number of steps: 835.

Imputing missing values

Note that there are a number of days/intervals where there are missing values (coded as NA). The presence of missing days may introduce bias into some calculations or summaries of the data.

  1. Calculate and report the total number of missing values in the dataset (i.e. the total number of rows with NAs)
data_log_na <- is.na(data)

# summary(data_log_na) reveals that there are only NAs in the first column, so the number of NAs equals the number of rows with NAs

rows_with_na <- sum(data_log_na[,1])

print(rows_with_na)
## [1] 2304

The total number of missing values is 2304.

  1. Devise a strategy for filling in all of the missing values in the dataset. The strategy does not need to be sophisticated. For example, you could use the mean/median for that day, or the mean for that 5-minute interval, etc.
for (i in 1:nrow(data)){
        
        if (data_log_na[i,1] == TRUE){
                
                data$steps[i] <- data$avgStepsIntervalAcrossAllDays[i]
                
        }
        
}
  1. Create a new dataset that is equal to the original dataset but with the missing data filled in.
# Remove column 4 with average steps for the intervalls across all days

data_new <- data[,-4]
  1. Make a histogram of the total number of steps taken each day and Calculate and report the mean and median total number of steps taken per day. Do these values differ from the estimates from the first part of the assignment? What is the impact of imputing missing data on the estimates of the total daily number of steps?
stepsperday_newdata <- tapply(data_new$steps,data_new$date,sum)

hist(stepsperday_newdata,breaks=20, xlab="Steps per day, imputing missing data", main="Histogram of total steps per day, imputing missing data")

plot of chunk unnamed-chunk-10

stepsperday_newdata_mean <- mean(stepsperday_newdata, na.rm=T)

print(stepsperday_newdata_mean)
## [1] 10766
stepsperday_newdata_median <- median(stepsperday_newdata, na.rm=T)

print(stepsperday_newdata_median)
## [1] 10766

The mean of steps per day with imputing the missing data is 1.0766 × 104.
The median of steps per day with imputing the missing data is 1.0766 × 104.

ANSWER TO LAST QUESTION OF POINT 4:
The mean values are equal. The median values are slidly different. There are eight day with NAs. Each of these eight days get the mean value for the corresponding interval which is calculated across all days. The mean is the estimator for the normal distribution and the mean of mean values is normal distributed. In this case the mean of total steps per day is equal to 10766, so you can see in the histogram that the frequency of the mean value of the total steps per day has increased by the the number eight comparing the two histograms. In the first histogram the frequency of the interval which has the mean value is 10 and in the second histogram the corresponding frequency is 18. The difference (18-10=8) is equal to the number of days with missing values.

Are there differences in activity patterns between weekdays and weekends?

For this part the weekdays() function may be of some help here. Use the dataset with the filled-in missing values for this part.

  1. Create a new factor variable in the dataset with two levels – “weekday” and “weekend” indicating whether a given date is a weekday or weekend day.
data_new$factordate<-weekdays(as.Date(data_new$date,"%Y-%m-%d"))

data_new$interval2 <- data$interval2

for(i in 1:nrow(data_new)){
        
        if(data_new$factordate[i] == "Saturday" | data_new$factordate[i] == "Sunday"){
                
                data_new$factordate[i] <- "weekend"
                
        }else{
                
                data_new$factordate[i] <- "weekday"
                
        }
        
}

data_new <- transform(data_new, factordate=factor(factordate))
  1. Make a panel plot containing a time series plot (i.e. type = "l") of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all weekday days or weekend days (y-axis).
data_new_split <- split(data_new, data_new$factordate)

data_new_split[[1]][,1] <- tapply(data_new_split[[1]][,1],data_new_split[[1]][,3],mean)

data_new_split[[2]][,1] <- tapply(data_new_split[[2]][,1],data_new_split[[2]][,3],mean)

data_new_splitted <- unsplit(data_new_split, data_new$factordate)

library(lattice)

xyplot(steps ~ interval2 | factordate, data=data_new_splitted, layout = c(1, 2), type="l", xlab="Interval", ylab="Avg. number of steps")

plot of chunk unnamed-chunk-12