Skip to content

GPS Analysis v1.0

GeorgiosEfstathiadis edited this page Nov 27, 2024 · 6 revisions
Date completed Feb 5, 2024
Release where first appeared v2.0
Researcher / Developer Georgios Efstathiadis

1 – Syntax

import openwillis as ow

hourly, daily, summary = ow.gps_analysis(filepath = '', timezone = 'US/Eastern')

2 – Methods

This function is used for summarizing geolocation information. It requires two inputs. The first is a CSV file containing four columns: timestamps, latitude, longitude, and the accuracy of GPS measurement from the source device. The second is the timezone the data was collected in.

The GPS data is first processed using the Forest library, which includes modules for GPS processing and imputation. The missing data are imputed using a bidirectional imputation algorithm described in Bidirectional imputation of spatial GPS trajectories with missingness using sparse online Gaussian Process by G. Liu and J.P. Onnela. Then, the data are separated into flights and pauses, where a flight “is defined to be a longest straight-line trip of a particle from one location to another without a directional change or pause” (taken from forest documentation).

This matrix of flights and pauses is then used to calculate summary statistics of interest, at an hourly and daily level. The timestamps are converted to date and time using the timezone string that corresponds to the location of the device. The possible values for timezone come from the pytz library and users can find a list of all possible values here. Home location is considered the place the source device spends the most amount of time at night between 7pm to 9am across the dataset.

The csv file needs to contain at least 60 observations per hour for at least 5% of the hours of analysis in order to contain enough information to process and impute any missing data. Additionally the dataset needs to contain data in the 7pm to 9am interval otherwise the home location cannot be inferred and the analysis will fail.

The final output will include summary statistics for each hour or day respectively:

  • datetime of the summaries
    • YYYY-MM-DD for daily
    • YYYY-MM-DD HH_00_00 for hourly
  • observed_time, which indicates how much of the GPS data were not imputed. At 'daily' level, observed_time_day and observed_time_night are also included, which separate the observed_time into the one that is in the day (8am to 8pm) and the one that is at night (8pm to 8am).
  • dist_travelled, which specifies the number of kms moved
  • home_time, which are the number of hours spent at home
  • home_max_dist, which indicates the maximum distance the person was from home
  • home_mean_dist, which indicates the average distance the person was from home

2.1 – Hourly measures

The function’s first output is an hourly level summary. This includes:

  • datetime of the measure - “YYYY-MM-DD HH_00_00” for format.
  • observed_time, which indicates how much of the GPS data was not imputed, but observed in the input file.
  • dist_travelled, which specifies the number of kms moved.
  • home_time, which is the number of hours spent at home.
  • home_max_dist, which indicates the maximum distance the person was from home.
  • home_mean_dist, which indicates the average distance the person was from home.

2.2 – Daily measures

The function’s second output is a daily level summary. This includes:

  • datetime of the measure - “YYYY-MM-DD” for format.
  • observed_time, which indicates how much of the GPS data were not imputed, but observed in the input file.
  • observed_time_day, which indicates how much of the GPS data was not imputed, but observed in the input file in the day (8am to 8pm).
  • observed_time_night, which indicates how much of the GPS data was not imputed, but observed in the input file in the night (8pm to 8am).
  • dist_travelled, which specifies the number of kms moved.
  • home_time, which is the number of hours spent at home.
  • home_max_dist, which indicates the maximum distance the person was from home.
  • home_mean_dist, which indicates the average distance the person was from home.

2.3 – Summary

The summary dataframe compiles file-level information. It mostly includes the mean and standard deviation of the daily measures.

  • no_days, the number of days analyzed
    length of daily dataframe
  • total_observed_time, the total time of observation
    sum of observed_time from the daily summary statistics
  • mean_move_time, the average time spent moving per day
    mean of move_time from the daily summary statistics
  • sd_move_time, the standard deviation of time spent moving per day
    std of move_time from the daily summary statistics
  • mean_pause_time, the average time spent idle per day
    mean of pause_time from the daily summary statistics
  • sd_pause_time, the standard deviation of time spent idle per day
    std of pause_time from the daily summary statistics
  • mean_dist_travelled, the average distance traveled per day
    mean of dist_travelled from the daily summary statistics
  • sd_dist_travelled, the standard deviation of distance traveled per day
    std of dist_travelled from the daily summary statistics
  • mean_home_time, the average time spent at home per day
    mean of home_time from the daily summary statistics
  • sd_home_time, the standard deviation of time spent at home per day
    std of home_time from the daily summary statistics
  • mean_home_max_dist, the average max distance from home per day
    mean of home_max_dist from the daily summary statistics
  • sd_home_max_dist, the standard deviation of max distance from home per day
    std of home_max_dist from the daily summary statistics
  • mean_home_mean_dist, the average mean distance from home per day
    mean of home_mean_dist from the daily summary statistics
  • sd_home_mean_dist, the standard deviation of mean distance from home per day
    std of home_mean_dist from the daily summary statistics

3 – Inputs

3.1 – data_path

Type String
Description Path to CSV that contains GPS data.

3.2 – timezone

Type String
Description The time zone at which the GPS data are collected. Time zone codes are the same as used in pytz, a list can be found here.

4 – Outputs

4.1 – hourly

Type pd.DataFrame
Description Hour-level summary statistics for GPS data.

The data frame is the transpose of the table below:

datetime
observed_time
move_time
pause_time
dist_travelled
home_time
home_max_dist
home_mean_dist

4.1 – daily

Type pd.DataFrame
Description Day-level summary statistics for GPS data.

The data frame is the transpose of the table below:

datetime
observed_time
observed_time_day
observed_time_night
move_time
pause_time
dist_travelled
home_time
home_max_dist
home_mean_dist

4.1 – summary

Type pd.DataFrame
Description File-level summary statistics for GPS data.

The data frame is the transpose of the table below:

no_days
total_observed_time
mean_move_time
sd_move_time
mean_pause_time
sd_pause_time
mean_dist_travelled
sd_dist_travelled
mean_home_time
sd_home_time
mean_home_max_dist
sd_home_max_dist
mean_home_mean_dist
sd_home_mean_dist

5 – Example use

hourly, daily, summary = ow.gps_analysis(data_path = 'data.csv', timezone = 'US/Eastern')
daily.head(x)
datetime observed_time observed_time_day observed_time_night move_time pause_time dist_travelled home_time home_max_dist
2023-09-20 6.24 3.12 3.12 2.4 21.6 2.12 18.2 1.02
2023-09-21 5.65 2.65 3 8.6 15.4 4.15 17 0.25

6 – Dependencies

Below are dependencies specific to calculation of this measure.

Dependency License Justification
Forest BSD 3-Clause Used for GPS imputation algorithm and basic GPS trajectory processing
Clone this wiki locally