-
Notifications
You must be signed in to change notification settings - Fork 0
Obis
Observational data for environmental forecasting can come from many sources. For exmple, nearly 2 decades ago Google used web searches to forecast the spread of the flu. There are also systemic scientific surveys where everything in every sample is documented. And then there are citizen science observations; these are not systematic and are obviously dependent not only upon the observers spotting something of interest but also dependent upon a report of the observation being made. There are tradeoffs to using each kind of dataset as they can vary widely in terms of quantity (number of observations) and quality (confidence in record). Sometimes when we forecast we actually use a blend of different observational data sources.
We’ll be using the Ocean Biodiveersity Information System OBIS online database to quickly gather observational records for marine species. You can access the data using the interactive portal, or you can use various software tools for automated download, like the robis R language package.
Here we show code steps with code that you can try on you own species; we don’t try to explain the details of the code itself.
We have written a small set of R functions to make fetching and using the data easier. We’ll use the ocean sunfish (scientific binomial name Mola mola) as our test case.
mola = fetch_obis(scientificname = "Mola mola") |>
print()
## Retrieved 5000 records of approximately 9526 (52%)Retrieved 9526 records of
## approximately 9526 (100%)
## Simple feature collection with 9526 features and 7 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -74.65 ymin: 38.8 xmax: -65.00391 ymax: 45.1333
## Geodetic CRS: WGS 84
## # A tibble: 9,526 × 8
## id basisOfRecord eventDate year month eventTime individualCount
## * <chr> <chr> <date> <dbl> <fct> <chr> <dbl>
## 1 00040fa1-7acd… HumanObserva… 2016-08-07 2016 Aug 16:15:00 1
## 2 000b42de-fbb4… HumanObserva… 2008-07-11 2008 Jul 14:50:38Z 1
## 3 0011e88f-7e7e… HumanObserva… 1980-05-23 1980 May 12:39:00 1
## 4 0025033b-f57a… HumanObserva… 1979-09-16 1979 Sep 14:13:00 1
## 5 00390e97-9607… HumanObserva… 2017-06-15 2017 Jun 10:07:13 1
## 6 003abd48-a98a… PreservedSpe… 2017-11-15 2017 Nov <NA> NA
## 7 00406908-529c… HumanObserva… 2015-06-26 2015 Jun 16:46:47Z 1
## 8 004dbc2b-2236… HumanObserva… 2015-06-27 2015 Jun 16:12:33Z 1
## 9 0050a981-803c… HumanObserva… 2006-09-17 2006 Sep 15:19:16Z 1
## 10 0053298d-2d6a… HumanObserva… 2005-07-21 2005 Jul <NA> 1
## # ℹ 9,516 more rows
## # ℹ 1 more variable: geometry <POINT [°]>
NOTE For each call to
fetch_obis()
data is downloaded and stored in your personal data directory.
ANOTHER NOTE
fetch_obis()
automatically subsets the global data to just the northwest Atlantic. Your can override that subset - see the documentation forfetch_obis()
infunctions/obis.R
.
An important thing to have under your belt is a strong acquaintance with your data. Seriously, emulating sports fanatics as they do sports stats has high value when forecasting. Know thy data! Let’s look a simple summary of the table.
summary(mola)
## id basisOfRecord eventDate year
## Length:9526 Length:9526 Min. :1932-09-15 Min. :1932
## Class :character Class :character 1st Qu.:2003-10-02 1st Qu.:2003
## Mode :character Mode :character Median :2009-07-11 Median :2009
## Mean :2006-10-02 Mean :2006
## 3rd Qu.:2016-11-05 3rd Qu.:2016
## Max. :2021-10-14 Max. :2021
## NA's :7 NA's :7
## month eventTime individualCount geometry
## Aug :2459 Length:9526 Min. : 1.000 POINT :9526
## Jun :2325 Class :character 1st Qu.: 1.000 epsg:4326 : 0
## Jul :2084 Mode :character Median : 1.000 +proj=long...: 0
## Sep : 923 Mean : 1.112
## May : 643 3rd Qu.: 1.000
## (Other):1085 Max. :25.000
## NA's : 7 NA's :318
Summaries are quick, but work best with number-like data and are kind of useless for text-like data. OK, a couple of things to notice…
- each column of the table gets some kind of summation depending upon the data type,
- when data is missing, R substitutes
NA
which stands for “not available”, -
year
covers 1932 through 2021 (and at least 7 dates are missing), - at least once there was an
individualCount
of 25 Mola mola in one observation, and there are 318 missing counts, -
month
is text-like data, but R knew to tally them up - that’s surprising.
Let’s summarize with other functions.
count(mola, basisOfRecord)
## Simple feature collection with 4 features and 2 fields
## Geometry type: GEOMETRY
## Dimension: XY
## Bounding box: xmin: -74.65 ymin: 38.8 xmax: -65.00391 ymax: 45.1333
## Geodetic CRS: WGS 84
## # A tibble: 4 × 3
## basisOfRecord n geometry
## * <chr> <int> <GEOMETRY [°]>
## 1 HumanObservation 9354 MULTIPOINT ((-65.07 42.68), (-65.067 42.65), (-6…
## 2 NomenclaturalChecklist 1 POINT (-65.80602 44.97985)
## 3 Occurrence 1 POINT (-65.2852 42.6243)
## 4 PreservedSpecimen 170 MULTIPOINT ((-67.05534 45.09908), (-66.35 45.133…
For each basisOfRecord
type we get a count, n
, and a geometry which
we’ll ignore for now.
Some observations are not simple in situ observations. Hmmmm, “PreservedSpecimen”? “NomenclaturalChecklist”? “Occurence”? What to think of those? If they are in an observational database, are they considered “presences”? At least they are small in number compared to “HumanObservation”.
Now let’s take a look at the distribution of months.
count(mola, month)
## Simple feature collection with 13 features and 2 fields
## Geometry type: MULTIPOINT
## Dimension: XY
## Bounding box: xmin: -74.65 ymin: 38.8 xmax: -65.00391 ymax: 45.1333
## Geodetic CRS: WGS 84
## # A tibble: 13 × 3
## month n geometry
## * <fct> <int> <MULTIPOINT [°]>
## 1 Jan 4 ((-71.31 41.49), (-72.05151 39.07928), (-71.989 39.668), (-67.79…
## 2 Feb 17 ((-73.0118 40.4872), (-72.97525 38.84837), (-72.7916 38.953), (-…
## 3 Mar 14 ((-66.22929 42.93149), (-72.53252 39.98476), (-72.7545 38.9118),…
## 4 Apr 240 ((-67.78418 42.97251), (-73.506 39.945), (-71.033 41.3), (-70.76…
## 5 May 643 ((-74.35 38.833), (-74.383 38.85), (-73.9 38.85), (-73.533 38.95…
## 6 Jun 2325 ((-65.62051 42.49731), (-65.53577 42.5802), (-65.52837 42.6589),…
## 7 Jul 2084 ((-65.17792 42.54604), (-65.23706 42.38588), (-65.22362 42.42201…
## 8 Aug 2459 ((-65.07 42.68), (-65.067 42.65), (-65.05 42.583), (-65.05 42.6)…
## 9 Sep 923 ((-65.09952 42.86947), (-65.07104 42.81719), (-65.65715 43.20611…
## 10 Oct 416 ((-67.06445 42.91399), (-67.51254 44.12437), (-67.83994 43.18025…
## 11 Nov 338 ((-67.96691 43.92518), (-69.66798 43.24936), (-73.508 39.5686), …
## 12 Dec 56 ((-68.39997 43.31811), (-70.48103 41.76705), (-70.65425 41.5649)…
## 13 <NA> 7 ((-65.80602 44.97985), (-73.54633 41.96333), (-70.9746 42.36032)…
Summer seems to be the time to spot a Mola mola! Those pesky missing
dates (NAs
) are still showing up. Let’s look at that graphically by
making a histogram (aka barplot).
ggplot(data = mola, # create the object
mapping = aes(x = month)) + # specify month along x-axis
geom_bar() + # request a barplot
labs(title = "Monthly Mola mola observations") # add a title
Aha! - definitely the numbers are up during the summer, but why is that?
Do they migrate away in the colder season? Or are the observers not
observing (and reporting)? Or is it a bit of both? And darn those NAs
.
How about over the years?
ggplot(data = mola, # create the object
mapping = aes(x = year)) + # specify year along x-axis
geom_bar() + # request a barplot
labs(title = "Yearly Mola mola observations") # add a title
So, the observations are bunched up in the more recent times. Why the gap?
The obis data is spatial point data, that is each observation is coupled with location information. We can leverage that to make some maps (spatial plots). Let’s plot all of the points to see where the Mola mola occur. For this we might include a coastline.
coast = read_coastline()
ggplot(data = mola) + # create a plot object
geom_sf(alpha = 0.1, # add the points with some tweaking of appearance
size = 0.7) +
geom_sf(data = coast, # add the coastline
col = "orange")
That’s interesting, most of the observations fall within an “outer limit”, and while there are patches in areas they really cover most of the area. Also, some of the observations have a pattern, like straight lines as if done on a survey. Let’s look again but plotting just each month.
coast = read_coastline()
ggplot(data = mola) + # create the plot
geom_sf(alpha = 0.1, size = 0.7) + # add the points
geom_sf(data = coast, # add the coast
col = "orange") +
facet_wrap(~month) # make a montage by month
There’s definitely a seasonal shift to the locations of the observations. But is that about the observer? Or the fish? Or both?
Many datasets have missing data. The remaining information about the
observations still has value, but sometimes doing math-y steps with
missing data can be a pain so it’s best to remove those. Below we show
how to remove the rows of observations that lack a proper eventDate
(and therefore lack a proper year
and month
).
mola = mola |>
filter(!is.na(eventDate))
summary(mola)
## id basisOfRecord eventDate year
## Length:9519 Length:9519 Min. :1932-09-15 Min. :1932
## Class :character Class :character 1st Qu.:2003-10-02 1st Qu.:2003
## Mode :character Mode :character Median :2009-07-11 Median :2009
## Mean :2006-10-02 Mean :2006
## 3rd Qu.:2016-11-05 3rd Qu.:2016
## Max. :2021-10-14 Max. :2021
##
## month eventTime individualCount geometry
## Aug :2459 Length:9519 Min. : 1.000 POINT :9519
## Jun :2325 Class :character 1st Qu.: 1.000 epsg:4326 : 0
## Jul :2084 Mode :character Median : 1.000 +proj=long...: 0
## Sep : 923 Mean : 1.112
## May : 643 3rd Qu.: 1.000
## Oct : 416 Max. :25.000
## (Other): 669 NA's :315
We can see that the 7-rows with missing eventDate
have been removed.
But what did we do? Here’s some pseudo-code…
start with the mola table,
pass it to the filter function,
ask the filter function to keep all of the rows where eventDate is not NA
make a summary of the new mola table