Skip to content
This repository has been archived by the owner on Oct 26, 2019. It is now read-only.

pjrowe/Get_and_Clean_Data_Week4

Repository files navigation

Project Description

The purpose of this project is to demonstrate the ability to collect, work with, and clean a data set. The goal is to prepare tidy data that can be used for later analysis.

The required submissions include the following:

  1. a tidy data set submitted as a text file,
  2. a link to a Github repository with a script for performing the analysis (the github repository for this project is found at https://github.com/pjrowe/Get_and_Clean_Data_Week4/),
  3. a code book called CodeBook.md that describes the variables, the data, and any transformations or work that was performed to clean up the data,
  4. a README.md in the repo which explains how the scripts work and how they are connected.

Data Source

The dataset downloaded for cleaning was the Human Activity Recognition Using Smartphones Dataset, Version 1.0, via the link to in the Coursera materials [here].(https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip). The download process was automated in the R script.

README: How the Script Works

There is only one script file - run_analysis.R. It does the following tasks, which are explained in more detail in the headings below:

  1. Merges the training and the test sets to create one data set.
  2. Extracts only the measurements on the mean and standard deviation for each measurement.
  3. Uses descriptive activity names to name the activities in the data set
  4. Appropriately labels the data set with descriptive variable names.
  5. From the data set in step 4, creates a second, independent tidy data set with the average of each variable for each activity and each subject.

1. Merging the training and the test sets to create one data set

The test and training datasets are spread out over the 6 files bulleted below. My script loads all these files into dataframes using the read.delim() function, adds an index column to the dataframes, and then performs two separate merges on the index columns: first, to connect the ‘y’ and ‘subject’ dataframes, and second, to add on the ‘x’ variables. These two merges are done for both the training and test data sets. We should also note that the X dataframes are given descriptive variable names from the features.txt file.

The dataframes test_data and train_data are then combined into all_data using rbind(). The all_data dataframe has 10,299 rows x 564 columns (rows = combined rows of test and training data, columns = index, subject, y_numlabels, 561 feature variables).

  • y_test : 2,947 observations(rows) and 1 column with a number label (1-6) for the activity performed by the experimental subjects. There are 9 subjects (2, 4, 9, 10, 12, 13, 18, 20, and 24) included vs. a total of 30, so 30% of the data were for the test data set and the remaining 21 subjects (70%) were part of the training set.

  • subject_test : 2,947 rows x 1 column with numbered label (1-30) of subject who performed the activity.

  • X_test : 2,947 observations(rows) x 561 columns of feature variables. For each subject, two experiments were run, each of about 163 rows (i.e., 163 rows x 9 subjects x 2 experiments = 2,934 total rows). Due to sampling and how the experiments were manually labeled, the samples or rows are not exactly the same from experiment to experiment. Features are normalized and bounded within [-1,1].

  • y_train : 7,352 rows x 1 column.

  • subject_train : 7,352 rows x 1 column, with numbered label (1-30) of subject who performed the activity.

  • X_train : 7,352 rows x 561 columns. There were 21 subjects included in this dataset, each with 2 experiments, which comes to about 175 samples(or rows) per experiment. Features are normalized and bounded within [-1,1].

An overview of the 561 feature variables in X_train, X_test, and all_data can be found at the bottom of this document, but because most are excluded from the tidy data set and are not transformed in any way, we will not provide a full description of all 561. The names of the 79 variables (46 mean variables and 33 standard deviation variables) that end up in the mean_by_activity tidy data set are are included in the Appendix, however.

2. Extracting only the measurements on the mean and standard deviation for each measurement

The str_detect function is used to find which of the 561 features in the X_train and X_test files contain the string “mean” or “std”. Then the select function from the plyr library is used to select only those columns from all_data, in addition to index, y_numlabel(numbered activity label), and subject columns. The resulting dataframe is extracted_data and has 10,299 rows x 82 columns. We use gsub() command to remove () characters from the variable names, and substitute ’_" for hyphens.

index y_numlabels subject 46 mean variables 33 standard deviation variables

3. Using descriptive activity names to name the activities in the data set

Each subject/person performed six activities during the experiment (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING). To transform the number label for each activity (y_numlabell) into its descriptive label, the script loads the activity_labels.txt into the dataframe activity. We insert a column named ‘activity’ into the extracted_data dataframe after the ‘y_numlabels’ column and initialize with a placeholder sequence, and then use the function label_function to lookup the descriptive label from the activity dataframe and transform the activity column of the extracted_data dataframe from its initialized value to the correct descriptive label. We then remove the y_numlabels column, as it is redundant and no longer needed.

4. Appropriately labeling the data set with descriptive variable names

As described in step 1. Merging …, the features.txt file provides variable names which are sufficiently descriptive. names(X_test)<-t(features[,2]) assigns the column names of the x_test and x_train dataframes to those features, which were loaded into the variable features. The activity column of extracted_data was described in 3.

5. Creating a second, independent tidy data set with the average of each variable for each activity and each subject.

A tidy data set, by definition, has one variable per column and one observation in each row. With 30 test subjects performing 6 activities each, there should be 180 rows or “observations” for the mean of each of 79 variables of extracted_data. The extracted_data dataframe is melted into a long dataframe dmelt of 813,621 rows (79 variables x 10,299 values each) and 4 columns. The dcast() function is then used to calculate the mean for each activity/subject pair.

Activity Subject Variable Value

The extracted tidy data set is stored in the dataframe mean_by_activity (180 rows x 81 columns). There is no need for an index column in the final dataset, so it is removed. The final tidy data set is saved to “tidy_data.txt” with the write.table() function.

Activity Subject 46 mean variables 33 standard deviation variables
1 of 6 descriptive labels 1-30 46 calculated averages for given activity/label 33 calculated averages for given activity/label

APPENDIX

79 Feature Variables Extracted from 561 Feature Variables

As stated previously, the consolidated data from X_test and X_train files contains 561 variables generated from the three-axis accelerometer and gyroscope data readings of the experiments (Human Activity Recognition Using Smartphones Dataset, Version 1.0). Only 79 of these variables were extracted, including 46 mean variables and 33 standard deviation variables, that end up in the tidy_data.txt file. As shown under heading 5, the first two columns of the mean_by_activity dataframe and tidy_data.txt are Activity and Subject variables, so there are a total of 81 columns.

Mean variables (46)

Appearing in following order as Columns 3-48 in mean_by_activity dataframe or in tidy_data.txt.

1-16 17-32 33-46
tBodyAcc_mean_X tGravityAccMag_mean fBodyGyro_mean_X
tBodyAcc_mean_Y tBodyAccJerkMag_mean fBodyGyro_mean_Y
tBodyAcc_mean_Z tBodyGyroMag_mean fBodyGyro_mean_Z
tGravityAcc_mean_X tBodyGyroJerkMag_mean fBodyGyro_meanFreq_X
tGravityAcc_mean_Y fBodyAcc_mean_X fBodyGyro_meanFreq_Y
tGravityAcc_mean_Z fBodyAcc_mean_Y fBodyGyro_meanFreq_Z
tBodyAccJerk_mean_X fBodyAcc_mean_Z fBodyAccMag_mean
tBodyAccJerk_mean_Y fBodyAcc_meanFreq_X fBodyAccMag_meanFreq
tBodyAccJerk_mean_Z fBodyAcc_meanFreq_Y fBodyBodyAccJerkMag_mean
tBodyGyro_mean_X fBodyAcc_meanFreq_Z fBodyBodyAccJerkMag_meanFreq
tBodyGyro_mean_Y fBodyAccJerk_mean_X fBodyBodyGyroMag_mean
tBodyGyro_mean_Z fBodyAccJerk_mean_Y fBodyBodyGyroMag_meanFreq
tBodyGyroJerk_mean_X fBodyAccJerk_mean_Z fBodyBodyGyroJerkMag_mean
tBodyGyroJerk_mean_Y fBodyAccJerk_meanFreq_X fBodyBodyGyroJerkMag_meanFreq
tBodyGyroJerk_mean_Z fBodyAccJerk_meanFreq_Y
tBodyAccMag_mean fBodyAccJerk_meanFreq_Z

Standard deviation variables (33)

Appearing in order as Columns 49-81 in mean_by_activity dataframe or in tidy_data.txt.

1-11 12-22 23-33
tBodyAcc_std_X tBodyGyro_std_Z fBodyAcc_std_Z
tBodyAcc_std_Y tBodyGyroJerk_std_X fBodyAccJerk_std_X
tBodyAcc_std_Z tBodyGyroJerk_std_Y fBodyAccJerk_std_Y
tGravityAcc_std_X tBodyGyroJerk_std_Z fBodyAccJerk_std_Z
tGravityAcc_std_Y tBodyAccMag_std fBodyGyro_std_X
tGravityAcc_std_Z tGravityAccMag_std fBodyGyro_std_Y
tBodyAccJerk_std_X tBodyAccJerkMag_std fBodyGyro_std_Z
tBodyAccJerk_std_Y tBodyGyroMag_std fBodyAccMag_std
tBodyAccJerk_std_Z tBodyGyroJerkMag_std fBodyBodyAccJerkMag_std
tBodyGyro_std_X fBodyAcc_std_X fBodyBodyGyroMag_std
tBodyGyro_std_Y fBodyAcc_std_Y fBodyBodyGyroJerkMag_std

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published