GitHub

Project Description

The purpose of this project is to demonstrate the ability to collect, work with, and clean a data set. The goal is to prepare tidy data that can be used for later analysis.

The required submissions include the following:

a tidy data set submitted as a text file,
a link to a Github repository with a script for performing the analysis (the github repository for this project is found at https://github.com/pjrowe/Get_and_Clean_Data_Week4/),
a code book called CodeBook.md that describes the variables, the data, and any transformations or work that was performed to clean up the data,
a README.md in the repo which explains how the scripts work and how they are connected.

Data Source

The dataset downloaded for cleaning was the Human Activity Recognition Using Smartphones Dataset, Version 1.0, via the link to in the Coursera materials [here].(https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip). The download process was automated in the R script.

README: How the Script Works

There is only one script file - run_analysis.R. It does the following tasks, which are explained in more detail in the headings below:

Merges the training and the test sets to create one data set.
Extracts only the measurements on the mean and standard deviation for each measurement.
Uses descriptive activity names to name the activities in the data set
Appropriately labels the data set with descriptive variable names.
From the data set in step 4, creates a second, independent tidy data set with the average of each variable for each activity and each subject.

1. Merging the training and the test sets to create one data set

The test and training datasets are spread out over the 6 files bulleted below. My script loads all these files into dataframes using the read.delim() function, adds an index column to the dataframes, and then performs two separate merges on the index columns: first, to connect the ‘y’ and ‘subject’ dataframes, and second, to add on the ‘x’ variables. These two merges are done for both the training and test data sets. We should also note that the X dataframes are given descriptive variable names from the features.txt file.

The dataframes test_data and train_data are then combined into all_data using rbind(). The all_data dataframe has 10,299 rows x 564 columns (rows = combined rows of test and training data, columns = index, subject, y_numlabels, 561 feature variables).

y_test : 2,947 observations(rows) and 1 column with a number label (1-6) for the activity performed by the experimental subjects. There are 9 subjects (2, 4, 9, 10, 12, 13, 18, 20, and 24) included vs. a total of 30, so 30% of the data were for the test data set and the remaining 21 subjects (70%) were part of the training set.
subject_test : 2,947 rows x 1 column with numbered label (1-30) of subject who performed the activity.
X_test : 2,947 observations(rows) x 561 columns of feature variables. For each subject, two experiments were run, each of about 163 rows (i.e., 163 rows x 9 subjects x 2 experiments = 2,934 total rows). Due to sampling and how the experiments were manually labeled, the samples or rows are not exactly the same from experiment to experiment. Features are normalized and bounded within [-1,1].
y_train : 7,352 rows x 1 column.
subject_train : 7,352 rows x 1 column, with numbered label (1-30) of subject who performed the activity.
X_train : 7,352 rows x 561 columns. There were 21 subjects included in this dataset, each with 2 experiments, which comes to about 175 samples(or rows) per experiment. Features are normalized and bounded within [-1,1].

An overview of the 561 feature variables in X_train, X_test, and all_data can be found at the bottom of this document, but because most are excluded from the tidy data set and are not transformed in any way, we will not provide a full description of all 561. The names of the 79 variables (46 mean variables and 33 standard deviation variables) that end up in the mean_by_activity tidy data set are are included in the Appendix, however.

2. Extracting only the measurements on the mean and standard deviation for each measurement

The str_detect function is used to find which of the 561 features in the X_train and X_test files contain the string “mean” or “std”. Then the select function from the plyr library is used to select only those columns from all_data, in addition to index, y_numlabel(numbered activity label), and subject columns. The resulting dataframe is extracted_data and has 10,299 rows x 82 columns. We use gsub() command to remove () characters from the variable names, and substitute ’_" for hyphens.

index	y_numlabels	subject	46 mean variables	33 standard deviation variables

3. Using descriptive activity names to name the activities in the data set

Each subject/person performed six activities during the experiment (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING). To transform the number label for each activity (y_numlabell) into its descriptive label, the script loads the activity_labels.txt into the dataframe activity. We insert a column named ‘activity’ into the extracted_data dataframe after the ‘y_numlabels’ column and initialize with a placeholder sequence, and then use the function label_function to lookup the descriptive label from the activity dataframe and transform the activity column of the extracted_data dataframe from its initialized value to the correct descriptive label. We then remove the y_numlabels column, as it is redundant and no longer needed.

4. Appropriately labeling the data set with descriptive variable names

As described in step 1. Merging …, the features.txt file provides variable names which are sufficiently descriptive. names(X_test)<-t(features[,2]) assigns the column names of the x_test and x_train dataframes to those features, which were loaded into the variable features. The activity column of extracted_data was described in 3.

5. Creating a second, independent tidy data set with the average of each variable for each activity and each subject.

A tidy data set, by definition, has one variable per column and one observation in each row. With 30 test subjects performing 6 activities each, there should be 180 rows or “observations” for the mean of each of 79 variables of extracted_data. The extracted_data dataframe is melted into a long dataframe dmelt of 813,621 rows (79 variables x 10,299 values each) and 4 columns. The dcast() function is then used to calculate the mean for each activity/subject pair.

Activity	Subject	Variable	Value

The extracted tidy data set is stored in the dataframe mean_by_activity (180 rows x 81 columns). There is no need for an index column in the final dataset, so it is removed. The final tidy data set is saved to “tidy_data.txt” with the write.table() function.

Activity	Subject	46 mean variables	33 standard deviation variables
1 of 6 descriptive labels	1-30	46 calculated averages for given activity/label	33 calculated averages for given activity/label

APPENDIX

79 Feature Variables Extracted from 561 Feature Variables

As stated previously, the consolidated data from X_test and X_train files contains 561 variables generated from the three-axis accelerometer and gyroscope data readings of the experiments (Human Activity Recognition Using Smartphones Dataset, Version 1.0). Only 79 of these variables were extracted, including 46 mean variables and 33 standard deviation variables, that end up in the tidy_data.txt file. As shown under heading 5, the first two columns of the mean_by_activity dataframe and tidy_data.txt are Activity and Subject variables, so there are a total of 81 columns.

Mean variables (46)

Appearing in following order as Columns 3-48 in mean_by_activity dataframe or in tidy_data.txt.

1-16	17-32	33-46
tBodyAcc_mean_X	tGravityAccMag_mean	fBodyGyro_mean_X
tBodyAcc_mean_Y	tBodyAccJerkMag_mean	fBodyGyro_mean_Y
tBodyAcc_mean_Z	tBodyGyroMag_mean	fBodyGyro_mean_Z
tGravityAcc_mean_X	tBodyGyroJerkMag_mean	fBodyGyro_meanFreq_X
tGravityAcc_mean_Y	fBodyAcc_mean_X	fBodyGyro_meanFreq_Y
tGravityAcc_mean_Z	fBodyAcc_mean_Y	fBodyGyro_meanFreq_Z
tBodyAccJerk_mean_X	fBodyAcc_mean_Z	fBodyAccMag_mean
tBodyAccJerk_mean_Y	fBodyAcc_meanFreq_X	fBodyAccMag_meanFreq
tBodyAccJerk_mean_Z	fBodyAcc_meanFreq_Y	fBodyBodyAccJerkMag_mean
tBodyGyro_mean_X	fBodyAcc_meanFreq_Z	fBodyBodyAccJerkMag_meanFreq
tBodyGyro_mean_Y	fBodyAccJerk_mean_X	fBodyBodyGyroMag_mean
tBodyGyro_mean_Z	fBodyAccJerk_mean_Y	fBodyBodyGyroMag_meanFreq
tBodyGyroJerk_mean_X	fBodyAccJerk_mean_Z	fBodyBodyGyroJerkMag_mean
tBodyGyroJerk_mean_Y	fBodyAccJerk_meanFreq_X	fBodyBodyGyroJerkMag_meanFreq
tBodyGyroJerk_mean_Z	fBodyAccJerk_meanFreq_Y
tBodyAccMag_mean	fBodyAccJerk_meanFreq_Z

Standard deviation variables (33)

Appearing in order as Columns 49-81 in mean_by_activity dataframe or in tidy_data.txt.

1-11	12-22	23-33
tBodyAcc_std_X	tBodyGyro_std_Z	fBodyAcc_std_Z
tBodyAcc_std_Y	tBodyGyroJerk_std_X	fBodyAccJerk_std_X
tBodyAcc_std_Z	tBodyGyroJerk_std_Y	fBodyAccJerk_std_Y
tGravityAcc_std_X	tBodyGyroJerk_std_Z	fBodyAccJerk_std_Z
tGravityAcc_std_Y	tBodyAccMag_std	fBodyGyro_std_X
tGravityAcc_std_Z	tGravityAccMag_std	fBodyGyro_std_Y
tBodyAccJerk_std_X	tBodyAccJerkMag_std	fBodyGyro_std_Z
tBodyAccJerk_std_Y	tBodyGyroMag_std	fBodyAccMag_std
tBodyAccJerk_std_Z	tBodyGyroJerkMag_std	fBodyBodyAccJerkMag_std
tBodyGyro_std_X	fBodyAcc_std_X	fBodyBodyGyroMag_std
tBodyGyro_std_Y	fBodyAcc_std_Y	fBodyBodyGyroJerkMag_std

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
CodeBook.Rmd		CodeBook.Rmd
CodeBook.html		CodeBook.html
CodeBook.md		CodeBook.md
Get_and_Clean_Data_Week4.Rproj		Get_and_Clean_Data_Week4.Rproj
README.Rmd		README.Rmd
README.html		README.html
README.md		README.md
run_analysis.R		run_analysis.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Description

Data Source

README: How the Script Works

1. Merging the training and the test sets to create one data set

2. Extracting only the measurements on the mean and standard deviation for each measurement

3. Using descriptive activity names to name the activities in the data set

4. Appropriately labeling the data set with descriptive variable names

5. Creating a second, independent tidy data set with the average of each variable for each activity and each subject.

APPENDIX

79 Feature Variables Extracted from 561 Feature Variables

About

Releases

Packages

Languages

pjrowe/Get_and_Clean_Data_Week4

Folders and files

Latest commit

History

Repository files navigation

Project Description

Data Source

README: How the Script Works

1. Merging the training and the test sets to create one data set

2. Extracting only the measurements on the mean and standard deviation for each measurement

3. Using descriptive activity names to name the activities in the data set

4. Appropriately labeling the data set with descriptive variable names

5. Creating a second, independent tidy data set with the average of each variable for each activity and each subject.

APPENDIX

79 Feature Variables Extracted from 561 Feature Variables

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages