The sensor toolkit offers a collection of R scripts that can be used to help in managing a low-cost air pollution sensor network. Most analysis here is set to run in a "real time" framework, with use of rolling metrics.
library(devtools)
# load the dev sensortoolkit from GitHub
install_github('gmiskell/sensortoolkit')
library(sensortoolkit)
Some example data, example.RData
has been included for demonstration purposes. Note, the pollution data (pol
) is deliberately changed to a random value for data propriety reasons.
load('~/example.RData')
example
# A tibble: 22,362 x 5
date site lat lon pol
<dttm> <chr> <dbl> <dbl> <dbl>
1 2017-09-10 00:00:00 AQY-AA011 -43.5 173. 6.20
2 2017-09-11 00:00:00 AQY-AA011 -43.5 173. 3.19
3 2017-09-12 00:00:00 AQY-AA011 -43.5 173. 5.34
4 2017-09-13 00:00:00 AQY-AA011 -43.5 173. 5.56
5 2017-09-14 00:00:00 AQY-AA011 -43.5 173. 1.53
6 2017-09-15 00:00:00 AQY-AA011 -43.5 173. 3.72
7 2017-09-16 00:00:00 AQY-AA011 -43.5 173. 3.09
8 2017-09-17 00:00:00 AQY-AA011 -43.5 173. 4.28
9 2017-09-18 00:00:00 AQY-AA011 -43.5 173. 5.31
10 2017-09-19 00:00:00 AQY-AA011 -43.5 173. 3.98
# ... with 22,352 more rows
Proxies, a signal from an independent site or number of sites to the observation with similar statistical properties, was introduced in this paper. In brief, the paper aimed to minimise costs, both of expenditure and maintenance, for an air quality network. It presented a simple statistical framework that does not need extensive training datasets for automated verification of the reliability of data. Data are from a network of both low-cost and regulatory instruments measuring ground-level ozone, a pollutant with a relatively regular spatiotemporal profile, although there can be significant small-scale variations. The framework takes advantage of these characteristics to detect calibration drift in low-cost sensors by using running tests and using a suitably selected comparison dataset, or ‘proxy’. Land use was found to be a good determinant for a proxy, and drift was noted within one-week, on average. Results here were also used to correct sensor drift, with results in this paper
Here, the provided code uses only low-cost sensor data - nearby regulatory data has been found to be a good proxy if the data is available.
networkmedian_example <- example %>%
do(networkmedianFUN(., obs = 'pol', group = 'date', id = 'site')) %>%
rename(date = group, site = id)
networkmedian_example
# A tibble: 22,362 x 4
# Groups: date [1,246]
date obs network.proxy site
<dttm> <dbl> <dbl> <chr>
1 2017-09-09 11:00:00 5.19 5.46 AQY-AA011
2 2017-09-09 11:00:00 1.95 5.46 AQY-AA012
3 2017-09-09 11:00:00 1.95 5.46 AQY-AA013
4 2017-09-09 11:00:00 4.87 5.46 AQY-AA014
5 2017-09-09 11:00:00 3.93 5.46 AQY-AA015
6 2017-09-09 11:00:00 5.51 5.19 AQY-AA016
7 2017-09-09 11:00:00 5.53 5.19 AQY-AA017
8 2017-09-09 11:00:00 3.99 5.46 AQY-AA018
9 2017-09-09 11:00:00 6.05 5.19 AQY-AA019
10 2017-09-09 11:00:00 3.79 5.46 AQY-AA020
# ... with 22,352 more rows
nearestsite_example <- example %>%
do(nearestsiteFUN(., group = 'site', obs = 'pol', lat = 'lat', lon = 'lon')) %>%
rename(site = group)
as.tibble(nearestsite_example)
# A tibble: 22,362 x 4
date site obs nearest.proxy
<dttm> <chr> <dbl> <dbl>
1 2017-09-10 00:00:00 AQY-AA011 6.20 9.40
2 2017-09-11 00:00:00 AQY-AA011 3.19 4.98
3 2017-09-12 00:00:00 AQY-AA011 5.34 5.09
4 2017-09-13 00:00:00 AQY-AA011 5.56 7.70
5 2017-09-14 00:00:00 AQY-AA011 1.53 3.24
6 2017-09-15 00:00:00 AQY-AA011 3.72 2.71
7 2017-09-16 00:00:00 AQY-AA011 3.09 7.07
8 2017-09-17 00:00:00 AQY-AA011 4.28 2.85
9 2017-09-18 00:00:00 AQY-AA011 5.31 1.22
10 2017-09-19 00:00:00 AQY-AA011 3.98 5.95
# ... with 22,352 more rows
nearestn_example <- example %>%
do(nearestNSitesFUN(., id = 'site', lat = 'lat', lon = 'lon', n = 3)) %>%
rename(site = id)
as.tibble(nearestn_example)
# A tibble: 60 x 8
site lat lon n_id n_lat n_lon dist order
* <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <int>
1 AQY-AA011 -43.5 173. AQY-AA026 -43.5 173. 1775. 1
2 AQY-AA012 -43.5 173. AQY-AA014 -43.5 173. 2276. 1
3 AQY-AA013 -43.5 173. AQY-AA016 -43.5 173. 260. 1
4 AQY-AA014 -43.5 173. AQY-AA024 -43.5 173. 2256. 1
5 AQY-AA015 -43.5 173. AQY-AA027 -43.5 173. 3594. 1
6 AQY-AA016 -43.5 173. AQY-AA013 -43.5 173. 260. 1
7 AQY-AA017 -43.5 173. AQY-AA030 -43.5 173. 2252. 1
8 AQY-AA018 -43.6 173. AQY-AA022 -43.6 173. 1726. 1
9 AQY-AA019 -43.6 173. AQY-AA029 -43.6 173. 2468. 1
10 AQY-AA020 -43.5 173. AQY-AA028 -43.5 173. 1494. 1
# ... with 50 more rows
This proxy function also only identifies the locations within a specified radius. Following identification, one may choose to use the networkmedianFUN
and group by site
and n_id
. A note on the parameter, n
, which defines the size of the radius. It is calculated using degrees, which change relative to where on the globe the measurements are positioned. An online calculator can be useful in finding the correct n
here. Here, our dataset is around 40 degrees south. Entering 40.95 and 41 into the latitude options (a difference in degrees of 0.05) with similar longitude value is a 6 km distance, and so those locations within 6 km will be identified.
nearestkm_example <- example %>%
do(sitesInNKMFUN(., id = 'site', lat = 'lat', lon = 'lon', n = 0.05)) %>%
rename(site = id)
as.tibble(nearestkm_example)
# A tibble: 140 x 4
site lat lon n_id
<chr> <dbl> <dbl> <chr>
1 AQY-AA011 -43.5 173. AQY-AA012
2 AQY-AA011 -43.5 173. AQY-AA013
3 AQY-AA011 -43.5 173. AQY-AA014
4 AQY-AA011 -43.5 173. AQY-AA016
5 AQY-AA011 -43.5 173. AQY-AA026
6 AQY-AA012 -43.5 173. AQY-AA011
7 AQY-AA012 -43.5 173. AQY-AA013
8 AQY-AA012 -43.5 173. AQY-AA014
9 AQY-AA012 -43.5 173. AQY-AA016
10 AQY-AA012 -43.5 173. AQY-AA017
# ... with 130 more rows
ks_example <- networkmedian_example %>%
group_by(site) %>%
do(ksFUN(., obs = 'obs', date = 'date', proxy = 'network.proxy'))
|==============================================================|100% ~0 s remaining
as.tibble(ks_example)
# A tibble: 22,349 x 6
# Groups: site [20]
site date test statistic warning alarm
<chr> <dttm> <chr> <dbl> <lgl> <lgl>
1 AQY-AA011 2017-09-09 11:00:00 ks test NA NA NA
2 AQY-AA011 2017-09-09 12:00:00 ks test NA NA NA
3 AQY-AA011 2017-09-09 13:00:00 ks test NA NA NA
4 AQY-AA011 2017-09-09 14:00:00 ks test NA NA NA
5 AQY-AA011 2017-09-09 15:00:00 ks test NA NA NA
6 AQY-AA011 2017-09-09 16:00:00 ks test NA NA NA
7 AQY-AA011 2017-09-09 17:00:00 ks test NA NA NA
8 AQY-AA011 2017-09-09 18:00:00 ks test NA NA NA
9 AQY-AA011 2017-09-09 19:00:00 ks test NA NA NA
10 AQY-AA011 2017-09-09 20:00:00 ks test NA NA NA
# ... with 22,339 more rows
ggplot(ks_example, aes(date, statistic)) + geom_line() + facet_wrap(~site)
mv_example <- networkmedian_example %>%
group_by(site) %>%
do(mvFUN(., obs = 'obs', date = 'date', proxy = 'network.proxy')) %>%
spread(test, statistic) # if using warning and alarms keep data in long format
|==============================================================|100% ~0 s remaining
as.tibble(mv_example)
# A tibble: 22,362 x 6
# Groups: site [20]
site date warning alarm mv_intercept mv_slope
* <chr> <dttm> <lgl> <lgl> <dbl> <dbl>
1 AQY-AA011 2017-09-09 11:00:00 NA NA NA NA
2 AQY-AA011 2017-09-09 12:00:00 NA NA NA NA
3 AQY-AA011 2017-09-09 13:00:00 NA NA NA NA
4 AQY-AA011 2017-09-09 14:00:00 NA NA NA NA
5 AQY-AA011 2017-09-09 15:00:00 NA NA NA NA
6 AQY-AA011 2017-09-09 16:00:00 NA NA NA NA
7 AQY-AA011 2017-09-09 17:00:00 NA NA NA NA
8 AQY-AA011 2017-09-09 18:00:00 NA NA NA NA
9 AQY-AA011 2017-09-09 19:00:00 NA NA NA NA
10 AQY-AA011 2017-09-09 20:00:00 NA NA NA NA
# ... with 22,352 more rows
ggplot(mv_example, aes(date, mv_intercept)) + geom_line() + facet_wrap(~site)
loadMap(example, set.date = '2017-10-15', obs = 'pol', time.zone = 'Pacific/Auckland', id = 'site', lat = 'lat', lon = 'lon', dest = '')
shiny_display(as.data.frame(example), cols.to.display = 'pol')