-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathM1_ICES.Rmd
436 lines (316 loc) · 17.4 KB
/
M1_ICES.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
---
title: "M1_ICES_pre-script"
author: "Anders Torstensson <[email protected]>, Swedish Meteorological and Hydrological Institute"
date: '2021-11-25'
params:
ICES_export: "example.csv" # change to the name of your ICES export file
biomass: BMCCONT # choose BMCCONT for biomass (ug/l) or BMCEVOL for biovolume (um3/l)
output: html_document
---
## R Markdown script to tranform an ICES export file
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>.
When you click the **Knit** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.
The script reads a phytoplankton ICES export file (downloaded at https://dome.ices.dk/views/Phytoplankton.aspx) and prepares data for the **Helcom candidate indicator script** written by Joanna Calkiewicz & Janina Kownacka according to the document 'Seasonal succession of dominating phytoplankton groups'. The indicator focuses on the phytoplankton groups dinoflagellates (mixotrophic and autotrophic), cyanobacteria, diatoms and the species Mesodinium rubrum.
The script matches taxa names with the World register of marine species (WoRMS; http://www.marinespecies.org/index.php) in order to get higher taxonomic information (i.e. class), as taxa reported to ICES before the implementation of the PEG list lack taxonomic information. The matching with WoRMS is performed step wise at four levels to classify as many taxa as possible; one exact match, one fuzzy match, one freshwater match and one final match on genus level. The taxa that could not be classified after the step wise process are printed in the end of the HTML output.
The script groups taxa based on taxonomic level, and aggregates biomass data (in ug/l or um3/l, choose params above) for diatoms, cyanobacteria, autotrophic/mixotrophic dinoflagellates and for the species Mesodinium rubrum in each sample.
As of 2021, the ICES database lacks trophic type information on data published before ~2008. Pre-2008 dinoflagellate data will therefore be excluded as heterotrophic taxa cannot be discarded from the dataset.
The script outputs are the following:
**M1_ICES.html** - Summary of the script output
**data_all.csv** - Data file that can be read in the Helcom candidate indicator script.
**Plots** - Time series plots for the total phytoplankton biomass, and for each taxonomic group. Can be used to manually select the ref/test period in the Helcom candidate indicator script, and to identify outliers that may need manual attention before data are processed by the indicator script.
**WoRMS_classification_results.txt** - List of all unique taxa in the dataset and their WoRMS classification, including the fuzzy, freshwater and genus results. The list is printed for potential class verification purposes.
## Load necessary libraries
```{r loadlib, include=FALSE}
library(tidyverse) # Used for data wrangling
library(worrms) # Used to get taxonomic information
library(knitr) # For html table reports
library(kableExtra) # For html table reports
```
## Create coelesce function to faciliate table joins
The core operation of coalesce_join will be done by dplyr::coalesce, which replaces NA values in a vector with corresponding non-missing values from another of identical length (or length 1)
```{r coalesce_join, include=FALSE}
coalesce_join <- function(x, y,
by = NULL, suffix = c(".x", ".y"),
join = dplyr::full_join, ...) {
joined <- join(x, y, by = by, suffix = suffix, ...)
# names of desired output
cols <- union(names(x), names(y))
to_coalesce <- names(joined)[!names(joined) %in% cols]
suffix_used <- suffix[ifelse(endsWith(to_coalesce, suffix[1]), 1, 2)]
# remove suffixes and deduplicate
to_coalesce <- unique(substr(
to_coalesce,
1,
nchar(to_coalesce) - nchar(suffix_used)
))
coalesced <- purrr::map_dfc(to_coalesce, ~dplyr::coalesce(
joined[[paste0(.x, suffix[1])]],
joined[[paste0(.x, suffix[2])]]
))
names(coalesced) <- to_coalesce
dplyr::bind_cols(joined, coalesced)[cols]
}
```
## Read data
Reads a comma-separated ICEs export-file and removes any text within parentheses, as some taxonomic classifications may contain commas. E.g. PEG_Order EUPODISCALES (BIDDULPHIALES, CENTRALES) and PEG_class Tribophyceae (Xanthophyceae, Heterokontae).
```{r readdata, echo = FALSE}
start_time = Sys.time()
data <- read.csv(text = gsub("\\s*\\([^\\)]+\\)","", readLines(params$ICES_export)))
end_time = Sys.time()
end_time - start_time
```
## Clean input data
Filter out potential biomass data in the biovolume dataset and transform biomass/biovolume data to uniform scales (ug/l or um3/l)
```{r cleandata, echo = FALSE}
data <- data %>%
filter(!is.na(Year) & PARAM == params$biomass & !is.na(final_value)) # Removes junk rows (if present) and filters out all biomass data
if(levels(as.factor(data$PARAM)) == "BMCEVOL") {
data <- data %>%
filter(!MUNIT=="ug/l")
}
data$SPECI_name <- str_to_sentence(data$SPECI_name) # Transform taxa names to uniform case
data$STATN[data$STATN==""] <- "missing" # Avoid NAs in station name
data$SPECI_name <- gsub("ã«|ë","e" , data$SPECI_name ,ignore.case = TRUE) # Remove special characters that cause "408-JSON errors" in wm_record
data$final_value <- ifelse(data$MUNIT %in% "pg", data$final_value/10^6, data$final_value) # Convert biomass data if reported in pg/l to ug/l
data$final_value <- ifelse(data$MUNIT %in% "mm3/l", data$final_value*10^9, data$final_value) # Convert biovolume data if reported in mm3/l to um3/l
data$final_value <- ifelse(data$MUNIT %in% "mm3/m3", data$final_value*10^6, data$final_value) # Convert biovolume data if reported in mm3/m3 to um3/l
```
## Plot total biomass time series (ug/l or um3/l)
Manually check plots for outliers, data may have, for instance, been reported on the wrong measuring unit.
```{r plottotal, echo = FALSE}
print(paste("You have selected", paste(params$biomass), "as biomass parameter", sep=" "))
data_total <- aggregate(final_value ~ ï..tblSampleID+DATE+Year+Month+PARAM_desc, data, sum, na.action = na.omit)
colnames(data_total) <- c("sampleid", "date", "year", "month", "parameter", "biomass")
data_total$date <- as.Date(data_total$date,"%d/%m/%Y")
main_title <- paste("Total biomass (per sample),",unique(data$PARAM_desc), sep = " ")
p<-ggplot() +
geom_point(aes(x=date, y=log(biomass+1)), data_total,size=2) +
ggtitle(main_title) + theme(plot.title = element_text(size=10))
p
jpeg ("Total_biomass.jpg")
print (p)
dev.off()
```
## Create list of unique taxa names
```{r createunique, echo = FALSE}
Taxon <- sort(unique(data$SPECI_name))
Taxon <- Taxon[Taxon != ""]
print(paste("Dataset has", length(Taxon), "unique taxa", sep=" "))
```
## Get taxonomic records from WoRMS and create a lookup table
Taxa without a match are processed with the "fuzzy" and "freshwater" search tools later in the script. Taxa that still lack taxonomic classification are matched on genus level later in the script.
```{r getrecords, echo = FALSE, warning = FALSE}
getrecord <- data.frame()
start_time = Sys.time()
for (i in 1:length(Taxon)) {
tryCatch({
record <- wm_record_(name=Taxon[i])
getrecord <- rbind(getrecord, record[[1]])
}, error=function(e){})
}
end_time = Sys.time()
end_time - start_time
Taxon.df <- as.data.frame(Taxon)
names(Taxon.df)<-"scientificname"
Taxon.df$name2worms <- Taxon.df$scientificname
lookup <- Taxon.df %>%
left_join(getrecord, by = "scientificname", na_matches = "never")
lookup$SPECI_name <- lookup$name2worms
print(paste(length(lookup$AphiaID[!is.na(lookup$AphiaID)]), "taxa were matched based on exact names", sep=" "))
```
## Get list of taxa that could not be correctly matched in WoRMS
The script will instead assign classes using the WoRMS "fuzzy search tool". The first non-NA match is selected.
```{r fuzzymatch, echo = FALSE}
missing_id <- filter(lookup,is.na(AphiaID))
print(paste("Dataset has", length(missing_id$scientificname), "taxa without a direct WoRMS match, will continue with a fuzzy search for the selected taxa. See WoRMS_fuzzy_search_results.txt to verify correct matching", sep=" "))
missing_id_fuzzy <- data.frame()
start_time = Sys.time()
for (i in 1:length(missing_id$scientificname)) {
tryCatch({
missing_list <- wm_records_name(name=missing_id$scientificname[i], fuzzy=TRUE, marine_only=TRUE)
missing <- missing_list[1,]
missing$name2worms <- missing_id$scientificname[i]
for (i in 1:length(missing_list$scientificname)) {
if(is.na(missing$scientificname)) {
missing <- missing_list[i+1,]
}
}
missing_id_fuzzy <- rbind(missing_id_fuzzy, missing)
}, error=function(e){})
}
end_time = Sys.time()
end_time - start_time
missing_id_fuzzy$fuzzy <- missing_id_fuzzy$scientificname
print(paste(length(missing_id_fuzzy$scientificname), "additional taxa were matched using the fuzzy search tool", sep=" "))
```
## Get second list of taxa that could not be correctly matched in WoRMS
The script will instead assign classes using the WoRMS "fuzzy search tool" by including freshwater taxa. The first non-NA match is selected.
```{r freshwatermatch, echo = FALSE}
missing_id2 <- lookup %>%
left_join(missing_id_fuzzy, by = "name2worms", na_matches = "never") %>%
filter(is.na(AphiaID.x)) %>%
filter(is.na(AphiaID.y)) %>%
rename("scientificname"=scientificname.x)
print(paste("Dataset still has", length(missing_id2$scientificname), "taxa without a WoRMS match, will continue with a fuzzy freshwater search for the selected taxa.", sep=" "))
missing_id_fuzzy2 <- data.frame()
start_time = Sys.time()
for (i in 1:length(missing_id2$scientificname)) {
tryCatch({
missing_list2 <- wm_records_name(name=missing_id2$scientificname[i], fuzzy=TRUE, marine_only=FALSE)
missing_list2$name2worms <- missing_id2$scientificname[i]
missing2 <- missing_list2[1,]
for (i in 1:length(missing_list2$scientificname)) {
if(is.na(missing2$scientificname)) {
missing2 <- missing_list2[i+1,]
}
}
missing_id_fuzzy2 <- rbind(missing_id_fuzzy2, missing2)
}, error=function(e){})
}
end_time = Sys.time()
end_time - start_time
missing_id_fuzzy2 <- missing_id_fuzzy2 %>%
filter(!is.na(scientificname))
missing_id_fuzzy2$fuzzy2 <- missing_id_fuzzy2$scientificname
print(paste(length(missing_id_fuzzy2$scientificname[!is.na(missing_id_fuzzy2$scientificname)]), "additional taxa were matched using the fuzzy freshwater search tool", sep=" "))
```
## Join lookup table with the fuzzy and freshwater table
The remaining unmatched taxa will be matched on genus level
```{r join, echo = FALSE, warning = FALSE}
worms_all <- lookup %>%
coalesce_join(missing_id_fuzzy, by = "name2worms", na_matches = "never")
worms_all <- worms_all %>%
coalesce_join(missing_id_fuzzy2, by = "name2worms", na_matches = "never") %>%
mutate(fuzzy = coalesce(fuzzy, fuzzy2)) %>%
mutate(scientificname = ifelse(is.na(fuzzy), scientificname, fuzzy))
```
## Match genus name
Continue by matching the genus name of the remaining missing species with WoRMS, to get class information
```{r matchgenus, echo = FALSE, warning = FALSE}
missing_id3 <- worms_all %>%
filter(is.na(class))
missing_id3$scientificname <- word(missing_id3$SPECI_name, 1) # Select first word in species name
missing_id_fuzzy3 <- data.frame()
start_time = Sys.time()
for (i in 1:length(missing_id3$scientificname)) {
tryCatch({
missing_list3 <- wm_records_name(name=missing_id3$scientificname[i], fuzzy=TRUE, marine_only=FALSE)
missing_list3$name2worms <- missing_id3$scientificname[i]
missing3 <- missing_list3[1,]
for (i in 1:length(missing_list3$scientificname)) {
if(is.na(missing3$scientificname)) {
missing3 <- missing_list3[i+1,]
}
}
missing_id_fuzzy3 <- rbind(missing_id_fuzzy3, missing3)
}, error=function(e){})
}
end_time = Sys.time()
end_time - start_time
genus_all <- missing_id3 %>%
select("SPECI_name","scientificname") %>%
left_join(missing_id_fuzzy3, by = "scientificname", na_matches = "never") %>%
filter(!is.na(AphiaID)) %>%
rename("fuzzy3" = name2worms) %>%
distinct()
genus_all$worms_name <- genus_all$scientificname
print(paste(length(genus_all$SPECI_name), "additional taxa were matched by genus name", sep=" "))
```
## Summary of all matched taxa
Write a list of all WoRMS results in WoRMS_classification_results.txt
```{r lookup, echo = FALSE}
worms_all <- worms_all %>%
coalesce_join(genus_all, by = "SPECI_name", na_matches = "never") %>%
mutate(scientificname = ifelse(is.na(worms_name), scientificname, worms_name)) %>%
mutate(name2worms = ifelse(is.na(fuzzy3), name2worms, fuzzy3)) %>%
mutate(scientificname = ifelse(is.na(AphiaID), NA, scientificname)) %>%
select(-fuzzy, -fuzzy2, -fuzzy3, -worms_name) %>%
relocate(SPECI_name, name2worms) %>%
distinct()
classes <- worms_all %>%
select(SPECI_name,class) %>%
rename("PEG_class" = class)
classes %>%
rename("Reported name" = SPECI_name,
"WoRMS class" = PEG_class) %>%
kable %>%
kable_styling("striped", full_width = F) %>%
row_spec(which(is.na(classes$PEG_class)), color = "red") %>%
scroll_box(width = "1000px", height = "1000px")
write.table(worms_all,"WoRMS_classification_results.txt", sep = "\t", row.names = F)
```
## Summary of taxa that could not be classified by the script
Misspelled taxa names may be corrected in the input data file, if needed
```{r noclass, echo = FALSE}
print(paste(length(classes$PEG_class[is.na(classes$PEG_class)]), "taxa could not be assigned a class", sep=" "))
classes %>%
filter(is.na(PEG_class)) %>%
rename("Reported name"=SPECI_name,
"WoRMS class"=PEG_class) %>%
kable %>%
kable_styling("striped", full_width = F) %>%
scroll_box(width = "1000px", height = "1000px")
```
## Join lookup table with ICES export table
```{r merge, echo = FALSE}
data <- data %>%
left_join(classes, data, by = "SPECI_name", na_matches = "never") %>%
mutate(PEG_class = PEG_class.y)
```
## Filter biomass data for selected phytoplankton groups
(e.g. diatoms, cyanobacteria, autotrophic/mixotrophic dinoflagellates and for the species Mesodinium rubrum).
```{r selecttaxa, echo = FALSE}
data <- filter(data, PEG_class %in% c("Bacillariophyceae","Cyanophyceae") |
SPECI_name %in% "Mesodinium rubrum" |
PEG_class %in% "Dinophyceae" & PEG_TRPHY %in% c("AU","MX"))
print(paste("Dataset has", paste(length(data$final_value)), "datapoints with the selected taxa,", paste(params$biomass) ,"data will be aggregated for each sample", sep=" "))
```
## Aggregate biomass data for each sample, class and date
Groups are renamed to names of choice (e.g. Diatoms, Dinoflagellates) and the data table is converted to be compatible with the Helcom candidate indicator script.
```{r aggregate, echo = FALSE}
data_all <- aggregate(final_value ~ ï..tblSampleID + Country + RLABO + STATN + PEG_class + DATE + Year + Month, data, sum, na.action = na.omit)
data_all <- data_all %>%
mutate_all(str_replace_all, "Bacillariophyceae", "Diatoms") %>%
mutate_all(str_replace_all, "Dinophyceae", "Dinoflagellates") %>%
mutate_all(str_replace_all, "Cyanophyceae", "Cyanobacteria") %>%
mutate_all(str_replace_all, "Litostomatea", "Mesodinium_rubrum")
colnames(data_all) <- c("sampleid","country","rlabo","station","taxa","date","year","month","biomass")
data_all <- data_all %>%
mutate_at(vars(biomass, year, month), ~as.numeric(as.character(.)))
```
## Plot biomass time series data for each taxonomic group, for selecting relevant test and reference years
```{r plottaxa, echo = FALSE}
for (i in 1:length(unique(data_all$taxa))) {
data_all_taxa <- filter(data_all, taxa == unique(data_all$taxa)[i])
name_all_taxa <- unique(data_all$taxa)[i]
data_all_taxa$date <- as.Date(data_all_taxa$date,"%d/%m/%Y")
title <- paste(name_all_taxa, "_all_data.jpg", sep = "")
main_title <- paste(name_all_taxa, " biomass (per sample), ",unique(data$PARAM_desc), sep = "")
p <- ggplot() +
geom_point(aes(x=date, y=log(biomass+1)),data_all_taxa,size=2) +
ggtitle(main_title) + theme(plot.title = element_text(size=10))
plot(p)
jpeg (title)
print (p)
dev.off()
}
```
## Save file to output
data_all.csv can be imported in the Helcom candidate indicator script. The full list is also printed below.
```{r savefiles, echo = FALSE}
data_all <- data_all %>% arrange(year, month)
write.table(data_all,"data_all.csv",sep=';', dec=',')
data_all$biomass <- as.character(data_all$biomass)
data_all %>%
kable(digits = 3) %>%
kable_styling("striped", full_width = F) %>%
scroll_box(width = "1000px", height = "1000px")
```
## Reproducibility
Below is a summary of the packages in use and the time and date the output was processed to increase reproducibility.
```{r reproducibility, echo = FALSE}
# Date time
Sys.time()
# Here we store the session info for this script
sessioninfo::session_info()
```