Obis

Observations from OBIS

Observational data for environmental forecasting can come from many sources. For exmple, nearly 2 decades ago Google used web searches to forecast the spread of the flu. There are also systemic scientific surveys where everything in every sample is documented. And then there are citizen science observations; these are not systematic and are obviously dependent not only upon the observers spotting something of interest but also dependent upon a report of the observation being made. There are tradeoffs to using each kind of dataset as they can vary widely in terms of quantity (number of observations) and quality (confidence in record). Sometimes when we forecast we actually use a blend of different observational data sources.

We’ll be using the Ocean Biodiveersity Information System OBIS online database to quickly gather observational records for marine species. You can access the data using the interactive portal, or you can use various software tools for automated download, like the robis R language package.

Here we show code steps with code that you can try on you own species; we don’t try to explain the details of the code itself.

Fetching species data

We have written a small set of R functions to make fetching and using the data easier. We’ll use the ocean sunfish (scientific binomial name Mola mola) as our test case.

mola = fetch_obis(scientificname = "Mola mola") |>
  print()

## Retrieved 5000 records of approximately 9526 (52%)Retrieved 9526 records of
## approximately 9526 (100%)

## Simple feature collection with 9526 features and 7 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -74.65 ymin: 38.8 xmax: -65.00391 ymax: 45.1333
## Geodetic CRS:  WGS 84
## # A tibble: 9,526 × 8
##    id             basisOfRecord eventDate   year month eventTime individualCount
##  * <chr>          <chr>         <date>     <dbl> <fct> <chr>               <dbl>
##  1 00040fa1-7acd… HumanObserva… 2016-08-07  2016 Aug   16:15:00                1
##  2 000b42de-fbb4… HumanObserva… 2008-07-11  2008 Jul   14:50:38Z               1
##  3 0011e88f-7e7e… HumanObserva… 1980-05-23  1980 May   12:39:00                1
##  4 0025033b-f57a… HumanObserva… 1979-09-16  1979 Sep   14:13:00                1
##  5 00390e97-9607… HumanObserva… 2017-06-15  2017 Jun   10:07:13                1
##  6 003abd48-a98a… PreservedSpe… 2017-11-15  2017 Nov   <NA>                   NA
##  7 00406908-529c… HumanObserva… 2015-06-26  2015 Jun   16:46:47Z               1
##  8 004dbc2b-2236… HumanObserva… 2015-06-27  2015 Jun   16:12:33Z               1
##  9 0050a981-803c… HumanObserva… 2006-09-17  2006 Sep   15:19:16Z               1
## 10 0053298d-2d6a… HumanObserva… 2005-07-21  2005 Jul   <NA>                    1
## # ℹ 9,516 more rows
## # ℹ 1 more variable: geometry <POINT [°]>

NOTE For each call to fetch_obis() data is downloaded and stored in your personal data directory.

ANOTHER NOTE fetch_obis() automatically subsets the global data to just the northwest Atlantic. Your can override that subset - see the documentation for fetch_obis() in functions/obis.R.

Getting to know your observations

An important thing to have under your belt is a strong acquaintance with your data. Seriously, emulating sports fanatics as they do sports stats has high value when forecasting. Know thy data! Let’s look a simple summary of the table.

summary(mola)

##       id            basisOfRecord        eventDate               year     
##  Length:9526        Length:9526        Min.   :1932-09-15   Min.   :1932  
##  Class :character   Class :character   1st Qu.:2003-10-02   1st Qu.:2003  
##  Mode  :character   Mode  :character   Median :2009-07-11   Median :2009  
##                                        Mean   :2006-10-02   Mean   :2006  
##                                        3rd Qu.:2016-11-05   3rd Qu.:2016  
##                                        Max.   :2021-10-14   Max.   :2021  
##                                        NA's   :7            NA's   :7     
##      month       eventTime         individualCount           geometry   
##  Aug    :2459   Length:9526        Min.   : 1.000   POINT        :9526  
##  Jun    :2325   Class :character   1st Qu.: 1.000   epsg:4326    :   0  
##  Jul    :2084   Mode  :character   Median : 1.000   +proj=long...:   0  
##  Sep    : 923                      Mean   : 1.112                       
##  May    : 643                      3rd Qu.: 1.000                       
##  (Other):1085                      Max.   :25.000                       
##  NA's   :   7                      NA's   :318

Summaries are quick, but work best with number-like data and are kind of useless for text-like data. OK, a couple of things to notice…

each column of the table gets some kind of summation depending upon the data type,
when data is missing, R substitutes NA which stands for “not available”,
year covers 1932 through 2021 (and at least 7 dates are missing),
at least once there was an individualCount of 25 Mola mola in one observation, and there are 318 missing counts,
month is text-like data, but R knew to tally them up - that’s surprising.

Let’s summarize with other functions.

count(mola, basisOfRecord)

## Simple feature collection with 4 features and 2 fields
## Geometry type: GEOMETRY
## Dimension:     XY
## Bounding box:  xmin: -74.65 ymin: 38.8 xmax: -65.00391 ymax: 45.1333
## Geodetic CRS:  WGS 84
## # A tibble: 4 × 3
##   basisOfRecord              n                                          geometry
## * <chr>                  <int>                                    <GEOMETRY [°]>
## 1 HumanObservation        9354 MULTIPOINT ((-65.07 42.68), (-65.067 42.65), (-6…
## 2 NomenclaturalChecklist     1                        POINT (-65.80602 44.97985)
## 3 Occurrence                 1                          POINT (-65.2852 42.6243)
## 4 PreservedSpecimen        170 MULTIPOINT ((-67.05534 45.09908), (-66.35 45.133…

For each basisOfRecord type we get a count, n, and a geometry which we’ll ignore for now.

Some observations are not simple in situ observations. Hmmmm, “PreservedSpecimen”? “NomenclaturalChecklist”? “Occurence”? What to think of those? If they are in an observational database, are they considered “presences”? At least they are small in number compared to “HumanObservation”.

Now let’s take a look at the distribution of months.

count(mola, month)

## Simple feature collection with 13 features and 2 fields
## Geometry type: MULTIPOINT
## Dimension:     XY
## Bounding box:  xmin: -74.65 ymin: 38.8 xmax: -65.00391 ymax: 45.1333
## Geodetic CRS:  WGS 84
## # A tibble: 13 × 3
##    month     n                                                          geometry
##  * <fct> <int>                                                  <MULTIPOINT [°]>
##  1 Jan       4 ((-71.31 41.49), (-72.05151 39.07928), (-71.989 39.668), (-67.79…
##  2 Feb      17 ((-73.0118 40.4872), (-72.97525 38.84837), (-72.7916 38.953), (-…
##  3 Mar      14 ((-66.22929 42.93149), (-72.53252 39.98476), (-72.7545 38.9118),…
##  4 Apr     240 ((-67.78418 42.97251), (-73.506 39.945), (-71.033 41.3), (-70.76…
##  5 May     643 ((-74.35 38.833), (-74.383 38.85), (-73.9 38.85), (-73.533 38.95…
##  6 Jun    2325 ((-65.62051 42.49731), (-65.53577 42.5802), (-65.52837 42.6589),…
##  7 Jul    2084 ((-65.17792 42.54604), (-65.23706 42.38588), (-65.22362 42.42201…
##  8 Aug    2459 ((-65.07 42.68), (-65.067 42.65), (-65.05 42.583), (-65.05 42.6)…
##  9 Sep     923 ((-65.09952 42.86947), (-65.07104 42.81719), (-65.65715 43.20611…
## 10 Oct     416 ((-67.06445 42.91399), (-67.51254 44.12437), (-67.83994 43.18025…
## 11 Nov     338 ((-67.96691 43.92518), (-69.66798 43.24936), (-73.508 39.5686), …
## 12 Dec      56 ((-68.39997 43.31811), (-70.48103 41.76705), (-70.65425 41.5649)…
## 13 <NA>      7 ((-65.80602 44.97985), (-73.54633 41.96333), (-70.9746 42.36032)…

Summer seems to be the time to spot a Mola mola! Those pesky missing dates (NAs) are still showing up. Let’s look at that graphically by making a histogram (aka barplot).

Plotting

ggplot(data = mola,                                # create the object
       mapping = aes(x = month)) +                 # specify month along x-axis
  geom_bar() +                                     # request a barplot
  labs(title = "Monthly Mola mola observations")   # add a title

Aha! - definitely the numbers are up during the summer, but why is that? Do they migrate away in the colder season? Or are the observers not observing (and reporting)? Or is it a bit of both? And darn those NAs.

How about over the years?

ggplot(data = mola,                                # create the object
       mapping = aes(x = year)) +                  # specify year along x-axis
  geom_bar() +                                     # request a barplot
  labs(title = "Yearly Mola mola observations")    # add a title

So, the observations are bunched up in the more recent times. Why the gap?

Spatial plotting

The obis data is spatial point data, that is each observation is coupled with location information. We can leverage that to make some maps (spatial plots). Let’s plot all of the points to see where the Mola mola occur. For this we might include a coastline.

coast = read_coastline()
ggplot(data = mola) +       # create a plot object
  geom_sf(alpha = 0.1,      # add the points with some tweaking of appearance
          size = 0.7) +  
  geom_sf(data = coast,     # add the coastline
          col = "orange")

That’s interesting, most of the observations fall within an “outer limit”, and while there are patches in areas they really cover most of the area. Also, some of the observations have a pattern, like straight lines as if done on a survey. Let’s look again but plotting just each month.

coast = read_coastline()
ggplot(data = mola) +                  # create the plot
  geom_sf(alpha = 0.1, size = 0.7) +   # add the points
  geom_sf(data = coast,                # add the coast
          col = "orange") +
  facet_wrap(~month)                   # make a montage by month

There’s definitely a seasonal shift to the locations of the observations. But is that about the observer? Or the fish? Or both?

Filtering missing data

Many datasets have missing data. The remaining information about the observations still has value, but sometimes doing math-y steps with missing data can be a pain so it’s best to remove those. Below we show how to remove the rows of observations that lack a proper eventDate (and therefore lack a proper year and month).

mola = mola |>
  filter(!is.na(eventDate))
summary(mola)

##       id            basisOfRecord        eventDate               year     
##  Length:9519        Length:9519        Min.   :1932-09-15   Min.   :1932  
##  Class :character   Class :character   1st Qu.:2003-10-02   1st Qu.:2003  
##  Mode  :character   Mode  :character   Median :2009-07-11   Median :2009  
##                                        Mean   :2006-10-02   Mean   :2006  
##                                        3rd Qu.:2016-11-05   3rd Qu.:2016  
##                                        Max.   :2021-10-14   Max.   :2021  
##                                                                           
##      month       eventTime         individualCount           geometry   
##  Aug    :2459   Length:9519        Min.   : 1.000   POINT        :9519  
##  Jun    :2325   Class :character   1st Qu.: 1.000   epsg:4326    :   0  
##  Jul    :2084   Mode  :character   Median : 1.000   +proj=long...:   0  
##  Sep    : 923                      Mean   : 1.112                       
##  May    : 643                      3rd Qu.: 1.000                       
##  Oct    : 416                      Max.   :25.000                       
##  (Other): 669                      NA's   :315

We can see that the 7-rows with missing eventDate have been removed. But what did we do? Here’s some pseudo-code…

start with the mola table, 
  pass it to the filter function,
  ask the filter function to keep all of the rows where eventDate is not NA
make a summary of the new mola table