GPS Analysis v1.0

Date completed	Feb 5, 2024
Release where first appeared	v2.0
Researcher / Developer	Georgios Efstathiadis

1 – Syntax

import openwillis as ow

hourly, daily, summary = ow.gps_analysis(filepath = '', timezone = 'US/Eastern')

2 – Methods

This function is used for summarizing geolocation information. It requires two inputs. The first is a CSV file containing four columns: timestamps, latitude, longitude, and the accuracy of GPS measurement from the source device. The second is the timezone the data was collected in.

The GPS data is first processed using the Forest library, which includes modules for GPS processing and imputation. The missing data are imputed using a bidirectional imputation algorithm described in “Bidirectional imputation of spatial GPS trajectories with missingness using sparse online Gaussian Process” by G. Liu and J.P. Onnela. Then, the data are separated into flights and pauses, where a flight “is defined to be a longest straight-line trip of a particle from one location to another without a directional change or pause” (taken from forest documentation).

This matrix of flights and pauses is then used to calculate summary statistics of interest, at an hourly and daily level. The timestamps are converted to date and time using the timezone string that corresponds to the location of the device. The possible values for timezone come from the pytz library and users can find a list of all possible values here. Home location is considered the place the source device spends the most amount of time at night between 7pm to 9am across the dataset.

The csv file needs to contain at least 60 observations per hour for at least 5% of the hours of analysis in order to contain enough information to process and impute any missing data. Additionally the dataset needs to contain data in the 7pm to 9am interval otherwise the home location cannot be inferred and the analysis will fail.

The final output will include summary statistics for each hour or day respectively:

datetime of the summaries
- YYYY-MM-DD for daily
- YYYY-MM-DD HH_00_00 for hourly
observed_time, which indicates how much of the GPS data were not imputed. At 'daily' level, observed_time_day and observed_time_night are also included, which separate the observed_time into the one that is in the day (8am to 8pm) and the one that is at night (8pm to 8am).
dist_travelled, which specifies the number of kms moved
home_time, which are the number of hours spent at home
home_max_dist, which indicates the maximum distance the person was from home
home_mean_dist, which indicates the average distance the person was from home

2.1 – Hourly measures

The function’s first output is an hourly level summary. This includes:

datetime of the measure - “YYYY-MM-DD HH_00_00” for format.
observed_time, which indicates how much of the GPS data was not imputed, but observed in the input file.
dist_travelled, which specifies the number of kms moved.
home_time, which is the number of hours spent at home.
home_max_dist, which indicates the maximum distance the person was from home.
home_mean_dist, which indicates the average distance the person was from home.

2.2 – Daily measures

The function’s second output is a daily level summary. This includes:

datetime of the measure - “YYYY-MM-DD” for format.
observed_time, which indicates how much of the GPS data were not imputed, but observed in the input file.
observed_time_day, which indicates how much of the GPS data was not imputed, but observed in the input file in the day (8am to 8pm).
observed_time_night, which indicates how much of the GPS data was not imputed, but observed in the input file in the night (8pm to 8am).
dist_travelled, which specifies the number of kms moved.
home_time, which is the number of hours spent at home.
home_max_dist, which indicates the maximum distance the person was from home.
home_mean_dist, which indicates the average distance the person was from home.

2.3 – Summary

The summary dataframe compiles file-level information. It mostly includes the mean and standard deviation of the daily measures.

no_days, the number of days analyzed
length of daily dataframe
total_observed_time, the total time of observation
sum of observed_time from the daily summary statistics
mean_move_time, the average time spent moving per day
mean of move_time from the daily summary statistics
sd_move_time, the standard deviation of time spent moving per day
std of move_time from the daily summary statistics
mean_pause_time, the average time spent idle per day
mean of pause_time from the daily summary statistics
sd_pause_time, the standard deviation of time spent idle per day
std of pause_time from the daily summary statistics
mean_dist_travelled, the average distance traveled per day
mean of dist_travelled from the daily summary statistics
sd_dist_travelled, the standard deviation of distance traveled per day
std of dist_travelled from the daily summary statistics
mean_home_time, the average time spent at home per day
mean of home_time from the daily summary statistics
sd_home_time, the standard deviation of time spent at home per day
std of home_time from the daily summary statistics
mean_home_max_dist, the average max distance from home per day
mean of home_max_dist from the daily summary statistics
sd_home_max_dist, the standard deviation of max distance from home per day
std of home_max_dist from the daily summary statistics
mean_home_mean_dist, the average mean distance from home per day
mean of home_mean_dist from the daily summary statistics
sd_home_mean_dist, the standard deviation of mean distance from home per day
std of home_mean_dist from the daily summary statistics

3 – Inputs

3.1 – `data_path`

Type	String
Description	Path to CSV that contains GPS data.

3.2 – `timezone`

Type	String
Description	The time zone at which the GPS data are collected. Time zone codes are the same as used in pytz, a list can be found here.

4 – Outputs

4.1 – `hourly`

Type	pd.DataFrame
Description	Hour-level summary statistics for GPS data.

The data frame is the transpose of the table below:

datetime
observed_time
move_time
pause_time
dist_travelled
home_time
home_max_dist
home_mean_dist

4.1 – `daily`

Type	pd.DataFrame
Description	Day-level summary statistics for GPS data.

The data frame is the transpose of the table below:

datetime
observed_time
observed_time_day
observed_time_night
move_time
pause_time
dist_travelled
home_time
home_max_dist
home_mean_dist

4.1 – `summary`

Type	pd.DataFrame
Description	File-level summary statistics for GPS data.

The data frame is the transpose of the table below:

no_days
total_observed_time
mean_move_time
sd_move_time
mean_pause_time
sd_pause_time
mean_dist_travelled
sd_dist_travelled
mean_home_time
sd_home_time
mean_home_max_dist
sd_home_max_dist
mean_home_mean_dist
sd_home_mean_dist

5 – Example use

hourly, daily, summary = ow.gps_analysis(data_path = 'data.csv', timezone = 'US/Eastern')

daily.head(x)

datetime	observed_time	observed_time_day	observed_time_night	move_time	pause_time	dist_travelled	home_time	home_max_dist
2023-09-20	6.24	3.12	3.12	2.4	21.6	2.12	18.2	1.02
2023-09-21	5.65	2.65	3	8.6	15.4	4.15	17	0.25

6 – Dependencies

Below are dependencies specific to calculation of this measure.

Dependency	License	Justification
Forest	BSD 3-Clause	Used for GPS imputation algorithm and basic GPS trajectory processing

OpenWillis was developed by a small team of clinicians, scientists, and engineers based in Brooklyn, NY.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPS Analysis v1.0

1 – Syntax

2 – Methods

2.1 – Hourly measures

2.2 – Daily measures

2.3 – Summary

3 – Inputs

3.1 – `data_path`

3.2 – `timezone`

4 – Outputs

4.1 – `hourly`

4.1 – `daily`

4.1 – `summary`

5 – Example use

6 – Dependencies

Table of contents

Clone this wiki locally

GPS Analysis v1.0

1 – Syntax

2 – Methods

2.1 – Hourly measures

2.2 – Daily measures

2.3 – Summary

3 – Inputs

3.1 – data_path

3.2 – timezone

4 – Outputs

4.1 – hourly

4.1 – daily

4.1 – summary

5 – Example use

6 – Dependencies

Table of contents

Clone this wiki locally

3.1 – `data_path`

3.2 – `timezone`

4.1 – `hourly`

4.1 – `daily`

4.1 – `summary`