ecostructure is an R package for clustering and visualizing structure in ecological assemblages using Mixed Membership or Grade of Membership (GoM) models (see reference). The package can incorporate data on different spatial scales - assemblages can reflect species abundances in local communities (e.g., data in community ecology and microbiomics) or presence and absence of species across continent(s) (e.g., macroecology and biogeography). ecostructure also allows phylogenetic, geographic, and trait data to be incorporated in the clustering framework, producing clusters along different axes of bio-diversity, which we call motifs.
This tutorial focuses on the analysis of bird assemblages at global and local scales, showcasing the use of these tools and methods to assess the turnover of species across samples and illustrate that turnover with visualizations.
The main contributions of the ecostructure package are as follows
Functions to cluster species abundance and species presence/absence data by fitting GoM models.
New methods for visualizing the structure of species assemblages (e.g., communities or biotas), leveraging the output from the appropriate GoM model.
Tools to process species level data using a phylogeny, a trait dataset, or a GIS dataset in order to examine the contribution of history, function, and geography to species level structure.
The analytical framework provided by ecostructure is employed in our accompanying paper (see citation below). The paper focuses on a survey of bird species abundances at 38 local sites across the all elevations in the eastern and western Himalaya, the data from which is provided as part of this package.
If you find our package useful or want to known more about the underlying methods, please check out our paper:
White, Alexander E. and Dey, Kushal K. and Mohan, Dhananjai and Stephens, Matthew and Price, Trevor D. Regional influences on community structure across the tropical-temperate divide. Nature Communications. 2019. 10 (1). 2646. 10.1038/s41467-019-10253-6
ecostructure requires core geospatial libraries supporting the use of GIS data in R - GDAL (\(>= 2.0.1\)), GEOS (\(>= 3.4.0\)), and PROJ.4 (\(>= 4.8.0\)). Typically these libraries get automatically installed as part of the dependency packages to ecostructure.
ecostructure requires access to the “gfortran” library. Mac OS X users may encounter the error “library not found for -lgfortran” when installing. To fix this error, please follow the instructions at this link.
ecostructure requires R version \(>= 3.4.0\)
To install ecostructure, first, install the dependencies.
source("https://bioconductor.org/biocLite.R")
biocLite("Biobase")
install.packages("devtools")
devtools::install_github("kkdey/methClust");
devtools::install_github("kkdey/CountClust")
devtools::install_github("TaddyLab/maptpx")
Next, load the dependencies
library(Biobase)
library(methClust)
library(CountClust)
library(maptpx)
Next, install ecostructure.
devtools::install_github("kkdey/ecostructure")
Finally, load the ecostructure package.
library(ecostructure)
A comprehensive list of all the functions in the ecostructure package can be found by running the following command.
help(package = "ecostructure")
We present the abundances of birds in 38 five ha forest patches across elevations in both the eastern and western Himalayas, together with the site metadata. There are 304 species in the dataset, and species morphological trait data (species means) are also presented along with a time-calibrated phylogeny of those species. For methods related to measuring traits and building the phylogeny, see this paper for details).
The data is saved as an ExpressionSet object and can be read as follows
data("himalayan_birds")
One can extract the species abundance matrix as follows
species_counts <- t(exprs(himalayan_birds))
t(species_counts)[1:10,1:5]
## A2 A3 A4 A6 A7
## Macropygia_unchall 0 0 0 0 0
## Streptopelia_chinensis 0 0 0 0 0
## Streptopelia_senegalensis 0 0 0 0 0
## Columba_pulchricollis 0 0 0 0 2
## Streptopelia_orientalis 0 0 0 0 0
## Chalcophaps_indica 0 0 0 0 0
## Treron_curvirostra 2 0 0 0 0
## Treron_apicauda 0 2 2 0 0
## Treron_sphenurus 0 0 0 0 0
## Treron_phoenicopterus 0 0 0 0 0
The site metadata for the Himalayan sites - describing elevation, east/west, latitude, and longitude - can be extracted as follows.
site_metadata <- pData(phenoData(himalayan_birds))
head(site_metadata)
## Elevation North East WorE
## A2 198.25 26.97898 92.92198 E
## A3 734.25 27.00627 92.40457 E
## A4 1243.25 27.02750 92.41041 E
## A6 2629.00 27.14773 92.45938 E
## A7 2340.25 27.09198 92.40857 E
## A8 300.00 26.96138 93.01216 E
Finally, the species metadata, comprising the means of the bill traits, wing size, tarsus and mass of each species, can be read as follows
species_metadata <- pData(featureData(himalayan_birds))
head(species_metadata, 4)
## bill_length bill_width bill_depth wing tarsus
## Macropygia_unchall 11.080 4.260 4.970 197.550 26.305
## Streptopelia_chinensis 10.765 3.505 3.870 140.000 22.620
## Streptopelia_senegalensis 9.230 2.880 3.270 130.225 20.350
## Columba_pulchricollis 12.983 5.595 5.678 203.075 25.373
## mass
## Macropygia_unchall 168.0
## Streptopelia_chinensis 159.0
## Streptopelia_senegalensis 83.9
## Columba_pulchricollis 330.0
Besides the survey data, ecostructure also provides the presence absence matrix for all breeding birds in southeast Asia, encompassing the region of the survey sites in the above data. This data has been prepared from publicly available shapefiles (.shp) from BirdLife International.
data("indian_birds_pa")
The ecos_fit
function fits Grade of Membership models either on binary presence-absence data or species abundance counts data. We illustrate each case below.
We use ecos_fit
to evaluate structure in our Himalayan birds dataset. These data are counts of species in sites, and so the function fits a Multinomial model to generate species abundance motifs. The data structure (abundance or presence/absence) is evaluated by the function automatically.
set.seed(1000)
fit <- ecos_fit(species_counts, K = 2, tol = 0.1, num_trials = 10)
##
## Estimating on a 38 samples collection.
## Fit and Bayes Factor Estimation for K = 2
## log posterior increase: 18315.4, 83.3, 88.4, 0.6, done.
## log BF( 2 ) = 2175.01
##
## Estimating on a 38 samples collection.
## Fit and Bayes Factor Estimation for K = 2
## log posterior increase: 2937.3, 10.8, 0.5, done.
## log BF( 2 ) = 1772.92
##
## Estimating on a 38 samples collection.
## Fit and Bayes Factor Estimation for K = 2
## log posterior increase: 20698.2, 63.5, 12.4, 13.4, 6.6, 0.6, done.
## log BF( 2 ) = 1791.65
##
## Estimating on a 38 samples collection.
## Fit and Bayes Factor Estimation for K = 2
## log posterior increase: 15913.2, 73.6, 0.7, done.
## log BF( 2 ) = 1922.38
##
## Estimating on a 38 samples collection.
## Fit and Bayes Factor Estimation for K = 2
## log posterior increase: 2131.8, 61.4, 2, done.
## log BF( 2 ) = 1951.69
##
## Estimating on a 38 samples collection.
## Fit and Bayes Factor Estimation for K = 2
## log posterior increase: 3565.7, 4, done.
## log BF( 2 ) = 1874.04
##
## Estimating on a 38 samples collection.
## Fit and Bayes Factor Estimation for K = 2
## log posterior increase: 14118, 120, 45.1, 0.5, 0.3, 0.1, done.
## log BF( 2 ) = 1953.93
##
## Estimating on a 38 samples collection.
## Fit and Bayes Factor Estimation for K = 2
## log posterior increase: 2411.6, 47.5, 2.1, done.
## log BF( 2 ) = 1950.96
##
## Estimating on a 38 samples collection.
## Fit and Bayes Factor Estimation for K = 2
## log posterior increase: 20280.2, 88.5, 14.4, 4.2, 14.1, done.
## log BF( 2 ) = 2186.57
##
## Estimating on a 38 samples collection.
## Fit and Bayes Factor Estimation for K = 2
## log posterior increase: 23791.2, 31, 10.2, 25.1, 0.7, done.
## log BF( 2 ) = 1936.48
The function outputs a lower rank representation of the data in species_counts
, according to the value of K chosen by the user. Above, we chose to examine the primary partition of the data into 2 motifs by setting K
= 2. Increasing num_trials
will run more burn-in trials for the model fit, and decreasing tol
will make the underlying algorithm run longer before convergence. These may need to be modified if the input data matrix is small.
The main components of the output from ecos_fit
are:
omega
: A site \(\times\) K matrix, K representing the number of clusters or species abundance motifs. The row sums equal 1 and the entries are positive. These represent the contribution of the different motifs to the composition (i.e., species abundance distribution) of each site.
theta
: A species \(\times\) K matrix, where the column sums equal 1. These values represent the contributions of each species to each motif k.
We can take the output from the model fit and visualize the motif contributions matrix omega
using a stacked bar chart.
east_west_dir <- factor(site_metadata$WorE, levels = c("W", "E"))
elevation_metadata <- site_metadata$Elevation
ecos_blocks(fit$omega,
blocker_metadata = east_west_dir,
order_metadata = elevation_metadata)
Because we are interested in species turnover across elevational and geographic locations, we block the two elevational gradients (western or eastern) and plot them side by side.
This blocking structure (blocker_metadata
) can be used for any environmental gradient of interest, allowing replicates or experimental treatments (in our case, geographic location) to be easily visualized together. order_metadata
determines the order of the stacked bars in each block. Other parameters can be customized to modify the axes, titles, etc.
See ecos_plot_pie
below to plot the data in a spatially explicit context (i.e., on a map).
A central component of this framework is assessing the species contributions to each motif. One may be tempted to merely examine the theta
matrix component of the model fit, and identify the the species that differentially drive the clustering.
We use the function ExtractTopFeatures
from the R package CountClust to compute and record the top contributing species to each species abundance motif.
features <- CountClust::ExtractTopFeatures(fit$theta, top_features = 5,
method = "poisson", options = "max")
t(apply(features$indices, c(1,2), function(x) return(rownames(fit$theta)[x])))
## [,1] [,2]
## [1,] "Phylloscopus_chloronotus" "Phylloscopus_xanthoschistos"
## [2,] "Phylloscopus_reguloides" "Cyornis_rubeculoides"
## [3,] "Seicercus_whistleri" "Macronous_gularis"
## [4,] "Phylloscopus_occipitalis" "Orthotomus_sutorius"
## [5,] "Aethopyga_nipalensis" "Copsychus_malabaricus"
One may also want to compare the clustering of the original data to that of randomly generated null matrices of species abundances. We provide a way to compare the observed model fit to null model fits of various types. Randomized matrices of are generated internally using the randomizeMatrix
function in the package picante.
out <- ecos_nullmodel(species_counts, K=2, null.model = "richness",
iter_randomized= 10, option = "BF")
##
## Estimating on a 38 samples collection.
## Fit and Bayes Factor Estimation for K = 2
## log posterior increase: 20927.8, 10.5, 19.1, 4.6, 17.5, 5.2, 0.3, 0.9, 3.2, 0.5, 0.5, 0.2, done.
## log BF( 2 ) = 1353.48
##
## Estimating on a 38 samples collection.
## Fit and Bayes Factor Estimation for K = 2
## log posterior increase: 1845.7, 6.7, 3.1, 16.9, 7.2, 3.5, 3.4, 0.6, done.
## log BF( 2 ) = 1440.07
##
## Estimating on a 38 samples collection.
## Fit and Bayes Factor Estimation for K = 2
## log posterior increase: 18432.1, 11.1, 21.8, 1.1, 9.4, 0.4, 0.5, 2.1, 2, 0.2, 1, 0.2, done.
## log BF( 2 ) = 1455.82
##
## Estimating on a 38 samples collection.
## Fit and Bayes Factor Estimation for K = 2
## log posterior increase: 1707.1, 46.3, 8.3, 5.7, 0.4, 0.3, 1.5, 2.8, 1, 0.1, done.
## log BF( 2 ) = 1324.43
##
## Estimating on a 38 samples collection.
## Fit and Bayes Factor Estimation for K = 2
## log posterior increase: 1820, 14.1, 4.9, 3.6, 1.9, 3.3, 0.4, 1.4, 0.7, 0.9, 2.7, 0.4, done.
## log BF( 2 ) = 1473.37
##
## Estimating on a 38 samples collection.
## Fit and Bayes Factor Estimation for K = 2
## log posterior increase: 19254.9, 24.9, 9, 3.5, 0.8, 9.2, 2, 3.4, 0.4, 0.8, done.
## log BF( 2 ) = 1270.9
##
## Estimating on a 38 samples collection.
## Fit and Bayes Factor Estimation for K = 2
## log posterior increase: 1844.4, 28.1, 9.4, 6.8, 1.1, 1.6, done.
## log BF( 2 ) = 1243.54
##
## Estimating on a 38 samples collection.
## Fit and Bayes Factor Estimation for K = 2
## log posterior increase: 1842.1, 18, 6.3, 11.9, 7.4, 1.3, 1.4, 8.2, 1.3, 0.5, done.
## log BF( 2 ) = 1360.54
##
## Estimating on a 38 samples collection.
## Fit and Bayes Factor Estimation for K = 2
## log posterior increase: 20732.8, 41, 18.9, 5.5, 7.1, 0.5, 0.2, done.
## log BF( 2 ) = 1386.47
##
## Estimating on a 38 samples collection.
## Fit and Bayes Factor Estimation for K = 2
## log posterior increase: 19997.9, 14.6, 47.5, 4.4, 0.7, 0.6, 0.6, 16.3, 1.8, 1.1, 0, done.
## log BF( 2 ) = 1279.4
##
## Estimating on a 38 samples collection.
## Fit and Bayes Factor Estimation for K = 2
## log posterior increase: 14985.7, 110.4, 43.3, 2.5, 0.1, done.
## log BF( 2 ) = 1761.84
Here we describe how to cluster species using a presence-absence data matrix of species across the globe or sub-regions of the globe.
The matrix may represent be local data or macroecological data (species in a 1\(^\circ\) \(\times\) 1\(^\circ\) map cell) - for functions to generate these data using GIS data sources (shapfiles or geodatabase files) see other sections below.
We again use ecos_fit
to fit Grade of Membership models on binary presence-absence data using a Binomial/Bernoulli model. We refer to the clusters/motifs obtained by this procedure as species motifs.
data("indian_birds_pa")
pres_ab_fit <- ecos_fit(indian_birds_pa, K = 2, tol = 0.1)
The above code takes around 4 minutes to run on a single machine.
Again, the function outputs a lower (specified by K) rank representation of the data indian_birds_pa
. The output pres_ab_fit
consists of omega
and theta
components and these matrices can be understood as:
omega
: A site \(\times\) K matrix, K representing the number of clusters or species motifs. The row sums equal 1 and the entries are positive. These represent the contribution of the different motifs to the species composition of each location.
theta
: A species \(\times\) K matrix, where each species has a probability of being found in the kth motif. These probabilities are not constrained to sum to 1 as in the multinomial case we illustrated earlier.
We have an example model fit of the above function run as part of the ecostructure package.
data("pres_ab_fit")
We can take the output from the model fit and again visualize the membership proportions omega
. However, as the data may consist of thousands of spatially explicit locations, we do not use ecos_blocks
as in the previous example, but instead use the function ecos_plot_pie
, which incorporates latitudinal and longitudinal coordinates of the original sites. We plot the omega
matrix in a spatial context as follows.
ecos_plot_pie(omega = pres_ab_fit$omega,
lat_lim = c(5,50),
long_lim = c(55,120),
path = "geostructure_plot.png",
color= c("red","blue"))
## reading background map shapefile from inst/extdata/ne_110m_coastline
## folder
This function will create a file geostructure_plot.png
in the current folder (getwd()
in R to figure out).
One can easily apply this to any region of interest. We provide an example of visualizing species presence-absence structure as generated by applying the binomial GoM model with \(K=6\) on the bird species presence absence matrix from Australia. Again these data have been processed from BirdLife International
data("australia_birds")
australia_model <- ecos_fit(australia_birds$pr_ab_data,
K=6, tol = 0.1)
The results from this fit, which takes < 2 minutes, are saved as australia_model
and can be visualized as follows
data("australia_model")
ecos_plot_pie(omega = australia_model$omega,
coords = australia_birds$latlong,
long_lim = c(110,160),
lat_lim = c(-50,-10),
path = "geostructure_plot_aus.png",
color= c("orange", "red", "yellow",
"deepskyblue", "chartreuse",
"blue"))
The plot should look as follows
ecostructure offers tools for users to examine how local patterns of community structure are related to potential geographic sources. One way to interpret the geographic inputs into a given site is to generate a site’s dispersion field by overlapping the geographic distributions of the species present at the site.
For a given dataset of local species presences (can be abundances but dispersion fields are generated based on presence), we can generate dispersion fields for each site and again fit a GoM model. These methods require some types of GIS data source, either shapefiles of species distributions, a geodatabase of species distributions (as in the .gdb of global bird species ditrbutions provided by BirdLife International), or even point occurences of species presences in a region.
We generate dispersion fields for our 38 sites using the function dsp_create_from_survey
.
## If GIS data are in the form of separate shapefiles (.shp)
local_him_disp <- dsp_create_from_survey(local_data = species_counts,
gis_data_type = "shp",
shp_dir = {PATH TO directory with species shapefiles})
## If GIS data are in the form of a geodatabase (.gdb)
local_him_disp <- dsp_create_from_survey(local_data = species_counts,
gis_data_type = "gdb",
gdb_dir = {PATH To Directory with Species Geodatabase})
dsp_create_from_survey
takes a local data matrix and a specified GIS data type to produce dispersion fields in a specified regions. The function reads in shapefiles for each species (or a .gdb for all species) and rasterizes those distributions for a given resolution. There are many arguments to customize so be sure to examine the documentation for this function. The output ‘local_him_disp’ consists of 3 objects.
raster
: A list of rasters representing the dispersion fields for each site (length is the number of rows (sites) in local data
) at the specified resolution.
precise
: A list of rasters representing the dispersion fields for each site (length is the number of rows (sites) in local data
) at the resolution of the argument precise
. These could be used for plotting high resolution dispersion fields.
matrix
: A list of the same rasters is raster
but in matrix form. These are used for easily processing the dispersion fields into the structure required to fit a GoM model.
We can generate maps for all of the disperison fields using the function dsp_plot_map
. We provide the dispersion fields for our 38 Himalyan community surveys, dispersion_field_ex
, as a list of matrices.
data("dispersion_field_ex")
dsp_plots <- dsp_plot_map(dispersion_field_ex,
raster_latlim = c(5, 50), raster_longlim = c(50, 120),
scale = "percentage")
dsp_plots[[1]]
dsp_plots
is now a list of plots, each of which visualizes the dispersion fields from our local sites.
We can easily process the dispersion fields into one data matrix by using dsp_to_matrix
dsp_fields_matrix <- dsp_to_matrix(dispersion_field_ex)
The output from dsp_to_matrix
is a data matrix with rows equivalent to each site’s dispersion field as a vector.
dim(dsp_fields_matrix)
## [1] 38 201600
Each column is a map cell in the region of interest, with the value being a count from each site of how many species from that site overlap the cell.
As these data are counts, we can now fit a multinomial GoM (using to examine geographic motifs. In this case, omega
is a matrix of motif contributions to dispersion fields, and the theta
matrix represents each map cells contrbution to each of the motifs. We can plot the theta
s on a map, showing the contribution of each map cell to the motif. Some biologists may think of these motifs as analagous to the regional inputs into local communities.
The model fit is done using ecos_fit
as above. Because this geographic data matrix may be particularly large, depending on the resolution chosen to produce the dispersion fields, the fit may take longer than the fit on the original data matrix.
fit <- ecos_fit(dsp_fields_matrix, K = 2, tol = 0.1, num_trials = 10)
We can plot the theta distributions of each geographical motif using the function dsp_motif
. We provide an example theta matrix with the package called example_theta
.
data("example_theta")
himalayan_geo_motifs <- dsp_motif(example_theta,
color_ramp = c("black", "darkseagreen3",
"orange","red"),
dsp_fld_res = 8)
himalayan_geo_motifs$motif_maps[[1]]
We call this color palette the Ghostbusters theme !! :)
The relationships between species, either phylogenetically or functionally, are an important component in evaluating ecological structure. We provide two functions, one for phylogenies and one for trait data, that collapse species into either clades or functional groups based on a desired cut off.
We load the time-calibrated phylogenetic tree phylo_tree
of the 304 Himalayan birds in species_counts
. We then collapse the species into clades to based on certain time point cut off on the tree collapse_at
. Below we cut the tree off at 10 million years ago.
data("phylo_tree")
phylo_counts <- ecos_prepare_by_phylo(species_counts, phylo_tree, collapse_at = 10)
dim(phylo_counts$outdat)
## [1] 38 63
We use morphological trait metadata from himalayan_birds
data and then use the distance metric based on these traits to collapse the bird species as above. As in the case of the time component above, one can set prop_div
to a collapse the tips of trait dendrogram to produce a specific number of functional groups (the proportion of the original diversity). Below we create \(91\) functional groups out of the original 304 species.
bill_traits <- as.matrix(dist(scale(species_metadata[,c(1:3)])))
counts_bill_traits <- ecos_prepare_by_trait(counts = species_counts,
traits = bill_traits,
prop_div=0.3)
dim(counts_bill_traits)
## [1] 38 91
Both phylo_counts
and counts_bill_traits
can now be passed to ecos_fit
to generate either phylogenetic motifs or trait motifs.
There are a number of ways to generate species presence absence data to evaluate species motifs. Conservation organizations (IUCN, BirdLife International, etc.) often distribute shapefiles and geodatabase that contain geographic distributions in the form of polygons. Point occurances can also be overlaid on a raster to produce a rough geographic distribution (this may be suitable for fitting GoMs at a large scale).
We provide a function dsp_create_from_gdb
that creates a presence absence matrix using these GIS data types. This function is slow on large datasets. We illustrate how it might be used below on the geodatabase of global bird distrbutions provided by BirdLife International.
## Request and fetch geodatabase (gdb) from BirdLife International : http://datazone.birdlife.org/species/requestdis or
## say we call one such gdb - example_birdlife.gdb
gdb_global_birds <- st_read("example_birdlife.gdb")
## Then run the following commands to generate the presence absence matrix
disp_from_scratch <- dsp_create_from_gdb(gdb_object = gdb_global_birds,
raster_resolution = 1,thresh = 3,
raster_latlim = c(-90,90),
raster_longlim = c(-180,180),
species_feature = "SCINAME")
The presence absence data is output as disp_from_scratch$pres_ab
on which ecos_fit
can be run.
fit <- ecos_fit(disp_from_scratch$pres_ab, K = 5,
tol = 0.1, num_trials = 10)
dsp_create_from_gdb
also returns the dispersion fields similar to dispersion fields create.
One may request the BirdLife dataset from here. Presence/absence data illustrated above was generated using this data source.
Users may also look to other packages that specifically focus on generating presence absence data from GIS data types. letsR is an incredibly useful package to this end. Presence-absence matrices from letsR can certainly be used in tandem with the ecostructure functions. Our goal is to contribute to that body of work by creating a tool that employs the speed and utility of the packages sf and fasterize (see velox as well).
ecostructure has been developed by Kushal K Dey and Alexander E White, in the labs of Matthew Stephens and Trevor Price labs at the University of Chicago.
The local Himalayan bird surveys were conducted by Trevor Price and Dhananjai Mohan.
For any queries or concerns related to the software, you can open an issue here.
Or you can contact us directly: Kushal K Dey - kshldey@gmail.com or Alex White - aewhite100@gmail.com.