The purpose of this project is to demonstrate the ability to collect, work with, and clean a data set. The goal is to prepare tidy data that can be used for later analysis.
The required submissions include the following:
- a tidy data set submitted as a text file,
- a link to a Github repository with a script for performing the analysis (the github repository for this project is found at https://github.com/pjrowe/Get_and_Clean_Data_Week4/),
- a code book called CodeBook.md that describes the variables, the data, and any transformations or work that was performed to clean up the data,
- a README.md in the repo which explains how the scripts work and how they are connected.
The dataset downloaded for cleaning was the Human Activity Recognition Using Smartphones Dataset, Version 1.0, via the link to in the Coursera materials [here].(https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip). The download process was automated in the R script.
There is only one script file - run_analysis.R. It does the following tasks, which are explained in more detail in the headings below:
- Merges the training and the test sets to create one data set.
- Extracts only the measurements on the mean and standard deviation for each measurement.
- Uses descriptive activity names to name the activities in the data set
- Appropriately labels the data set with descriptive variable names.
- From the data set in step 4, creates a second, independent tidy data set with the average of each variable for each activity and each subject.
The test and training datasets are spread out over the 6 files bulleted below. My script loads all these files into dataframes using the read.delim() function, adds an index column to the dataframes, and then performs two separate merges on the index columns: first, to connect the ‘y’ and ‘subject’ dataframes, and second, to add on the ‘x’ variables. These two merges are done for both the training and test data sets. We should also note that the X dataframes are given descriptive variable names from the features.txt file.
The dataframes test_data
and train_data
are then combined into
all_data using rbind(). The all_data
dataframe has 10,299 rows x 564
columns (rows = combined rows of test and training data, columns =
index, subject, y_numlabels, 561 feature variables).
-
y_test : 2,947 observations(rows) and 1 column with a number label (1-6) for the activity performed by the experimental subjects. There are 9 subjects (2, 4, 9, 10, 12, 13, 18, 20, and 24) included vs. a total of 30, so 30% of the data were for the test data set and the remaining 21 subjects (70%) were part of the training set.
-
subject_test : 2,947 rows x 1 column with numbered label (1-30) of subject who performed the activity.
-
X_test : 2,947 observations(rows) x 561 columns of feature variables. For each subject, two experiments were run, each of about 163 rows (i.e., 163 rows x 9 subjects x 2 experiments = 2,934 total rows). Due to sampling and how the experiments were manually labeled, the samples or rows are not exactly the same from experiment to experiment. Features are normalized and bounded within [-1,1].
-
y_train : 7,352 rows x 1 column.
-
subject_train : 7,352 rows x 1 column, with numbered label (1-30) of subject who performed the activity.
-
X_train : 7,352 rows x 561 columns. There were 21 subjects included in this dataset, each with 2 experiments, which comes to about 175 samples(or rows) per experiment. Features are normalized and bounded within [-1,1].
An overview of the 561 feature variables in X_train, X_test, and all_data can be found at the bottom of this document, but because most are excluded from the tidy data set and are not transformed in any way, we will not provide a full description of all 561. The names of the 79 variables (46 mean variables and 33 standard deviation variables) that end up in the mean_by_activity tidy data set are are included in the Appendix, however.
The str_detect
function is used to find which of the 561 features in
the X_train and X_test files contain the string “mean” or “std”. Then
the select
function from the plyr library is used to select only those
columns from all_data
, in addition to index, y_numlabel(numbered
activity label), and subject columns. The resulting dataframe is
extracted_data
and has 10,299 rows x 82 columns. We use gsub() command
to remove () characters from the variable names, and substitute ’_" for
hyphens.
index | y_numlabels | subject | 46 mean variables | 33 standard deviation variables |
---|---|---|---|---|
Each subject/person performed six activities during the experiment
(WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING,
LAYING). To transform the number label for each activity (y_numlabell)
into its descriptive label, the script loads the activity_labels.txt
into the dataframe activity. We insert a column named ‘activity’
into the extracted_data dataframe after the ‘y_numlabels’ column and
initialize with a placeholder sequence, and then use the function
label_function
to lookup the descriptive label from the activity
dataframe and transform the activity column of the
extracted_data
dataframe from its initialized value to the correct
descriptive label. We then remove the y_numlabels column, as it is
redundant and no longer needed.
As described in step 1. Merging …, the features.txt file provides
variable names which are sufficiently descriptive.
names(X_test)<-t(features[,2])
assigns the column names of the x_test
and x_train dataframes to those features, which were loaded into the
variable features
. The activity column of extracted_data was
described in 3.
5. Creating a second, independent tidy data set with the average of each variable for each activity and each subject.
A tidy data set, by definition, has one variable per column and one
observation in each row. With 30 test subjects performing 6 activities
each, there should be 180 rows or “observations” for the mean of each of
79 variables of extracted_data. The extracted_data
dataframe is
melted into a long dataframe dmelt of 813,621 rows (79 variables x
10,299 values each) and 4 columns. The dcast()
function is then used
to calculate the mean for each activity/subject pair.
Activity | Subject | Variable | Value |
---|---|---|---|
The extracted tidy data set is stored in the dataframe
mean_by_activity
(180 rows x 81 columns). There is no need for an
index column in the final dataset, so it is removed. The final tidy data
set is saved to “tidy_data.txt” with the write.table()
function.
Activity | Subject | 46 mean variables | 33 standard deviation variables |
---|---|---|---|
1 of 6 descriptive labels | 1-30 | 46 calculated averages for given activity/label | 33 calculated averages for given activity/label |
As stated previously, the consolidated data from X_test and X_train
files contains 561 variables generated from the three-axis accelerometer
and gyroscope data readings of the experiments (Human Activity
Recognition Using Smartphones Dataset, Version 1.0). Only 79 of these
variables were extracted, including 46 mean variables and 33 standard
deviation variables, that end up in the tidy_data.txt file. As
shown under heading 5, the first two columns of the mean_by_activity
dataframe and tidy_data.txt are Activity and Subject variables, so
there are a total of 81 columns.
Mean variables (46)
Appearing in following order as Columns 3-48 in mean_by_activity
dataframe or in tidy_data.txt.
1-16 | 17-32 | 33-46 |
---|---|---|
tBodyAcc_mean_X | tGravityAccMag_mean | fBodyGyro_mean_X |
tBodyAcc_mean_Y | tBodyAccJerkMag_mean | fBodyGyro_mean_Y |
tBodyAcc_mean_Z | tBodyGyroMag_mean | fBodyGyro_mean_Z |
tGravityAcc_mean_X | tBodyGyroJerkMag_mean | fBodyGyro_meanFreq_X |
tGravityAcc_mean_Y | fBodyAcc_mean_X | fBodyGyro_meanFreq_Y |
tGravityAcc_mean_Z | fBodyAcc_mean_Y | fBodyGyro_meanFreq_Z |
tBodyAccJerk_mean_X | fBodyAcc_mean_Z | fBodyAccMag_mean |
tBodyAccJerk_mean_Y | fBodyAcc_meanFreq_X | fBodyAccMag_meanFreq |
tBodyAccJerk_mean_Z | fBodyAcc_meanFreq_Y | fBodyBodyAccJerkMag_mean |
tBodyGyro_mean_X | fBodyAcc_meanFreq_Z | fBodyBodyAccJerkMag_meanFreq |
tBodyGyro_mean_Y | fBodyAccJerk_mean_X | fBodyBodyGyroMag_mean |
tBodyGyro_mean_Z | fBodyAccJerk_mean_Y | fBodyBodyGyroMag_meanFreq |
tBodyGyroJerk_mean_X | fBodyAccJerk_mean_Z | fBodyBodyGyroJerkMag_mean |
tBodyGyroJerk_mean_Y | fBodyAccJerk_meanFreq_X | fBodyBodyGyroJerkMag_meanFreq |
tBodyGyroJerk_mean_Z | fBodyAccJerk_meanFreq_Y | |
tBodyAccMag_mean | fBodyAccJerk_meanFreq_Z |
Standard deviation variables (33)
Appearing in order as Columns 49-81 in mean_by_activity dataframe or in tidy_data.txt.
1-11 | 12-22 | 23-33 |
---|---|---|
tBodyAcc_std_X | tBodyGyro_std_Z | fBodyAcc_std_Z |
tBodyAcc_std_Y | tBodyGyroJerk_std_X | fBodyAccJerk_std_X |
tBodyAcc_std_Z | tBodyGyroJerk_std_Y | fBodyAccJerk_std_Y |
tGravityAcc_std_X | tBodyGyroJerk_std_Z | fBodyAccJerk_std_Z |
tGravityAcc_std_Y | tBodyAccMag_std | fBodyGyro_std_X |
tGravityAcc_std_Z | tGravityAccMag_std | fBodyGyro_std_Y |
tBodyAccJerk_std_X | tBodyAccJerkMag_std | fBodyGyro_std_Z |
tBodyAccJerk_std_Y | tBodyGyroMag_std | fBodyAccMag_std |
tBodyAccJerk_std_Z | tBodyGyroJerkMag_std | fBodyBodyAccJerkMag_std |
tBodyGyro_std_X | fBodyAcc_std_X | fBodyBodyGyroMag_std |
tBodyGyro_std_Y | fBodyAcc_std_Y | fBodyBodyGyroJerkMag_std |