R package for Two-Way Indicator Species Analysis (Hill 1979).
Two-Way Indicator Species Analysis was developed to classify community data tables. It was supposed to work in the same way as a traditional community ecologist in arranging a community data table.
TWINSPAN is available as a self-standing computer program that can be compiled to work on many platforms. This R package uses the same Fortran code, but allows using TWINSPAN from R together with other R functions for community ecology and statistics. The function runs silently – unlike the original TWINSPAN – and most information can be gained analysing the result with support functions provided in this package. Moreover, the package allows using Roleček et al (2009) modification which bases grouping on withing-group heterogeneity instead of cluster level.
You can install the current development version from GitHub using devtools package:
devtools::install_github("jarioksa/twinspan")
You need development tools to install a source package. In particular, you need C and Fortran compilers.
If you cannot install a source package, you can install twinspan from R-Universe:
install.packages('twinspan', repos = c('https://jarioksa.r-universe.dev', 'https://cloud.r-project.org'))
The basic command to run twinspan is – unsurprisingly – twinspan
:
> library(twinspan)
> library(natto) # for data
> data(spurn)
> tw <- twinspan(spurn, cutlevels = 0:6)
The example uses Spurn Point dune scrub data (Shimwell 1971).
TWINSPAN uses basically binary data, and quantitative data are broken into
pseudospecies by species abundances using argument cutlevels
. The default cut levels
c(0, 2, 5, 10, 20)
are suitable for percentage values, but the spurn
data are cover classes.
To keep the original class values, we list all cover classes as cut levels.
The pseudospecies data can be generated with twinsform
transformation:
> colnames(twinsform(spurn, cutlevels=0:6))
[1] "Elaerham1" "Jacovulg1" "Soladulc1" "Rubufrut1" "UrtidioiA1"
[6] "Rumecris1" "ClayperfB1" "StelmediB1" "FestrubrC1" "ElymrepeC1"
... cut ...
[111] "ElymrepeC4" "AmmoarenC4" "Epilangu4" "Elaerham5" "UrtidioiA5"
[116] "ClayperfB5" "StelmediB5" "FestrubrC5" "ElymrepeC5" "AmmoarenC5"
[121] "Elaerham6" "ClayperfB6" "AmmoarenC6" "Elaerham7"
The original data of 40 species are extended to a matrix of 124 pseudospecies.
The names of pseudospecies are formed appending an integer for the cut level
after the species name. Level 1
means that the taxon occurs in the data,
and 7
that it occurs at least the seventh cut level.
The twinspan
result can be inspected with support functions of the package.
The hierarchy of groups can be displayed as a cluster tree with
> plot(tw, main = "Numeric Label")
> plot(tw, binname = TRUE, main = "Binary Label")
twinspan
splits data into two groups and the height of the group shows the
level of hierarchy. The numeric labels are marked within squares, or used
as group names for terminal groups that are no longer divided. The number of
items (N
) is also given for each terminal group. When group binname = TRUE
), binary text labels are used instead of numbers. The first groups
are 0
and 1
, and when these are split 0
or 1
is appended so that the children
of 0
are 00
and 01
.
The summary of division process can be inspected with summary
(with argument
binname = TRUE
binary labels are used instead of numeric):
> summary(tw)
1) eig=0.56: +ElymrepeC1 < 1
2) eig=0.399: -UrtidioiA1 < 0
4) eig=0.355: -Soncarve1 < 0
8) N=1: A19
9) N=4: A1 A2 A3 A20
5) eig=0.263: +Inulcony1 < 1
10) eig=0.229: -Rumecris1 < 0
20) N=1: B16
21) N=4: B4 B5 B10 B12
11) N=2: B17 B18
3) eig=0.23: +AmmoarenC5 +Jacovulg1 -Galiveru1 < 1
6) N=3: C6 C8 X14
7) eig=0.226: -Calysold1 < 0
14) N=3: C7 C9 C15
15) N=2: C11 C13
twinspan
is divisive: it splits data into two parts at each step, and these steps are
described here. The splits are based on the first correspondence analysis axis of the
current subset which is still further polished to make the dichotomy clearer.
twinspan
finds the pseudospecies that best indicate the division based on CA axis,
and summary
shows these indicator. There are pseudospecies names with +
or -
signs.
These are used to calculate indicator scores for each quadrat, adding or subtracting
one for each pseudospecies in the quadrat. If the quadrat score is less than the critical
score, we proceed from group ElymrepeC1
(Elymus repens at class value 1). The quadrat score will be 1 for quadrats
with Elymys repens and 0 without, and with condition summary
gives the size (N
) and lists the names of the member quadrats. Capital letters
A
, B
, C
of the quadrat name give the original classification of Shimwell (1971).
The basic twinspan
hierarchy is based on the level of division and it does not
consider within-group heterogeneity. However, the package can evaluate heterogeneities
and use these for trees and further analyses enabling the modification of Roleček et al
(2009):
plot(tw, height = "chi", main = "Roleček Tree")
The measure of heterogeneity is the sum of all eigenvalues of group as it is
analysed in twinspan
. This is equal to scaled Chi-square, hence the name
"chi"
in the argument.
Group 2 of the first-level division is much more heterogeneous than group 3, and for comparable homogeneity of groups it would be best to combine level-1 group 3 with level-2 groups 4 and 5.
The basic plot
and its underlying function as.hclust
will use groups as terminal
nodes. With as.dendrogram
it is also possible to display quadrats (like they are
called in twinspan
):
plot(as.dendrogram(tw, height="chi"), type = "triangle")
The dendrogram used the Roleček modification. The first letter of the quadrat name gives the original class (Shimwell 1971), and this is fully recovered with three classes with Roleček modification.
You can extract the classification of each quadrat with cut
either for terminal
groups or for a certain level:
> cut(tw)
[1] 9 9 9 21 21 6 14 6 14 21 15 21 15 6 14 20 11 11 8 9
> cut(tw, level=2)
[1] 4 4 4 5 5 6 7 6 7 5 7 5 7 6 7 5 5 5 4 4
The Roleček groups respecting heterogeneity can be extracted with cuth
("cut height")
where you must specify the number of groups:
> cuth(tw, ngroups=3)
[1] 4 4 4 5 5 3 3 3 3 5 3 5 3 3 3 5 5 5 4 4
You can also predict the membership of quadrats based on the indicator pseudospecies
and threshold score. This can also be done with argument newdata
using data set that
contains same species, but is not used in developing the classification.
> predict(tw, level=2)
[1] 4 4 4 5 5 6 7 6 7 5 7 5 7 6 7 5 5 5 4 4
Care is needed with newdata
: TWINSPAN will predict a class also when the
newdata
is completely unrelated to the original data. If there are no indicator
species, the predicted class will be the one with indicator scores always 0, or group 21
in this example.
TWINSPAN classification is based
on the polished ordination, and the indicator pseudospecies only indicate this
division, and predict
based on pseudospecies can give different classificatin than
the actual TWINSPAN that split the data by polished axis of correspondence analysis.
Function misclassified
will list the quadrats where the pseudospecies-based and
actual TWINSPAN classification differ. In this simple data set there are no such conflicts.
The data can be tabulated with:
> twintable(tw)
00000000000011111111
00000111111100011111
011110000011 00011
01111
A AB BBBB X CCC
1AAA21BB1111CC1CC111
91230645027868479513
00000 ClayperfB -----6566443--------
00000 StelmediB -----2322543--------
00000 GeasfornB ------22-222--------
00001 CerafontB -----2--2233--------
00001 CirsvulgB -----2----2---------
00001 Heraspho ----------2---------
00001 CardhirsB ----------22--------
0001 Inulcony ----2-----23--------
0001 Bracruta --23--2-2222--------
001000 Hyporadi 2-------------------
001000 Arrhelat 3-------------------
001001 UrtidioiA 25334---------------
001001 Soncaspe ----2---------------
001001 EurhpraeA 33233---------------
001001 Lophhete ----2---------------
00101 Sambnigr ----22--------------
0011 Soladulc 2442332232-22-22--2-
0011 Rubufrut 223-4223222---2-----
0011 Epilangu 4---22--23----2-----
0011 Hypncupr -222---22-----2-----
01 Elaerham 77777777777756644555
01 Jacovulg 33232222---2--222322
01 Rumecris 322222-----2222-2-22
10 Soncarve 2-------2---222-223-
10 Calysold --------22-2-33222--
10 Agrostol ----2-----2--22-22--
10 Verocham 2-------2-----2-2---
11000 FestrubrC -------2----33322345
11000 BracalbiC ------------2-2-3-2-
11001 ElymrepeC ------------22445222
11001 AmmoarenC ------------44262653
11001 PoapratC -------------22-2232
1101 RanubulbC --------------2-2-22
1101 PlanlancC --------------2-2323
1101 Cladranf -------------------2
111 Ononspin ------------3-------
111 Galiveru ------------2-2-----
111 Bryuincl ------------2-2-----
111 Syntrura -------------22-----
111 Bovinigr --------------2-----
20 sites, 40 species
The binary labels before species and above quadrats specify the
grouping. Species names ending in capital letter A
, B
or C
where regarded as diagnostic for these quadrat groups
(Shimwell 1971). The table can be large, but you can limit the
number of species or only list "leading species" for each group
(most abundant and frequent in the species group), or species
used as indicators (see summary
) or both of these, or you
can subset quadrats.
A compact visual summary of classification can be displayed with
> image(tw, height="chi", leadingspecies=TRUE, reorder=TRUE)
TWINSPAN stands for TWo-way INdicator SPecies ANalysis, and most of
the functions described above can be used to display species. For instance,
summary(tw, "species")
will show the steps of grouping species.
Hill, M.O. (1979) TWINSPAN: A FORTRAN program for arranging multivariate data in an ordered two-way table by classification of the individuals and attributes. Ecology and Systematics, Cornell University, Ithaca, NY.
Roleček, J, Tichý, L., Zelený, D. & Chytrý, M. (2009). Modified TWINSPAN classification in which the hierarchy respects cluster heterogeneity. J Veg Sci 20: 596-602.
Shimwell, D. W. (1971) Description and Classification of Vegetation. Sidgwick & Jackson.