
Search for an Optimal Multi-Regime (Shift) Configuration on a Phylogeny
Source:R/searchOptimalConfiguration.R
searchOptimalConfiguration.RdGreedy, stepwise search for evolutionary regime shifts on a SIMMAP-style phylogeny
using multivariate mvgls fits from mvMORPH. The routine:
builds one-shift candidate trees for all internal nodes meeting a tip-size threshold (via
generatePaintedTrees),fits each candidate in parallel and ranks them by improvement in the chosen information criterion (IC;
GICorBIC),iteratively adds shifts that pass a user-defined acceptance threshold,
optionally revisits accepted shifts to prune overfitting using a small IC tolerance window,
optionally computes per-shift IC weights by refitting the model with each shift removed.
Models are fitted directly in multivariate trait space (no PCA), assuming a multi-rate
Brownian Motion with proportional VCV scaling across regimes. Extra arguments in ...
are forwarded to mvgls (e.g., method = "LL" or
method = "PL-LOOCV", penalty, error = TRUE, etc.).
Usage
searchOptimalConfiguration(
baseline_tree,
trait_data,
formula = "trait_data ~ 1",
min_descendant_tips,
num_cores = 2,
ic_uncertainty_threshold = 1,
shift_acceptance_threshold = 1,
uncertaintyweights = FALSE,
uncertaintyweights_par = FALSE,
plot = FALSE,
IC = "GIC",
store_model_fit_history = TRUE,
verbose = FALSE,
...
)Arguments
- baseline_tree
A rooted SIMMAP/
phyloobject representing the baseline (single-regime) tree. If not SIMMAP-initialized, it should already be painted to a single baseline state and have tip order matchingtrait_data.- trait_data
A
matrixordata.frameof continuous trait values with row names matchingbaseline_tree$tip.label(same order). For the defaultformula = "trait_data ~ 1",trait_datais typically supplied as a numeric matrix so that the multivariate response is interpreted correctly bymvgls(). When using more general formulas (e.g., pGLS-style models), adata.framewith named columns can be used instead.- formula
Character formula passed to
mvgls. Defaults to"trait_data ~ 1", which fits an intercept-only model treating the supplied multivariate trait matrix as the response. This is the appropriate choice for most morphometric data where there are no predictor variables. For more general models,formulacan reference subsets oftrait_dataexplicitly, for example"trait_data[, 1:5] ~ 1"to treat columns 1–5 as a multivariate response, or"trait_data[, 1:5] ~ trait_data[, 6]"to fit a multivariate pGLS with column 6 as a predictor.- min_descendant_tips
Integer (\(\ge\)1). Minimum number of tips required for an internal node to be considered as a candidate shift (forwarded to
generatePaintedTrees). Larger values reduce the number of candidate shifts by excluding very small clades. For empirical datasets, values around10are a reasonable starting choice and can be tuned in sensitivity analyses.- num_cores
Integer. Number of workers for parallel candidate scoring. Uses
future::plan(multicore)on Unix outsideRStudio; otherwise usesfuture::plan(multisession). During the parallel candidate-scoring blocks, BLAS/OpenMP threads are capped to 1 (per worker) to avoid CPU oversubscription.- ic_uncertainty_threshold
Numeric (\(\ge\)0). Reserved for future development in post-search pruning and uncertainty analysis; currently not used by
searchOptimalConfiguration().- shift_acceptance_threshold
Numeric (\(\ge\)0). Minimum IC improvement (baseline - new) required to accept a candidate shift during the forward search. Larger values yield more conservative models. For analyses based on the Generalized Information Criterion (
"GIC"), a threshold on the order of20units is a conservative choice that tends to admit only strongly supported shifts. Simulation studies (Berv et al., in preparation) suggest that this choice yields good balanced accuracy between detecting true shifts and avoiding false positives, but users should explore alternative thresholds in sensitivity analyses for their own datasets.- uncertaintyweights
Logical. If
TRUE, compute per-shift IC weights serially by refitting the optimized model with each shift removed in turn. Exactly one ofuncertaintyweightsoruncertaintyweights_parmust beTRUEto trigger IC-weight calculations; setting both toTRUEwill result in an error. When enabled, the per-shift weights are returned in the$ic_weightscomponent of the result.- uncertaintyweights_par
Logical. As above, but compute per-shift IC weights in parallel using future.apply. Exactly one of
uncertaintyweightsoruncertaintyweights_parmust beTRUEto trigger IC-weight calculations.- plot
Logical. If
TRUE, draw/update a SIMMAP plot as the search proceeds (requires phytools).- IC
Character. Which information criterion to use, one of
"GIC"or"BIC"(case-sensitive).- store_model_fit_history
Logical. If
TRUE, store a per-iteration record of fitted models, acceptance decisions, and IC values. To keep memory usage low during the search, per-iteration results are written to a temporary directory (tempdir()) and read back into memory at the end of the run.- verbose
Logical. If
TRUE, report progress during candidate generation and model fitting. By default, progress is emitted viamessage(). Whenplot = TRUEin an interactiveRStudiosession, progress is written viacat()so it remains visible while plots are updating. Set toFALSEto run quietly (default). UsesuppressMessages()(andcapture.output()if needed) to silence or capture output.- ...
Additional arguments passed to
mvgls(e.g.,method,penalty,target,error, etc.).
Value
A named list with (at minimum):
user_input: captured call (as a list) for reproducibility.tree_no_uncertainty_transformed: SIMMAP tree from the optimal (no-uncertainty) model on the transformed scale used internally bymvgls.tree_no_uncertainty_untransformed: same topology with original edge lengths restored.model_no_uncertainty: the finalmvglsmodel object.shift_nodes_no_uncertainty: integer vector of accepted shift nodes.optimal_ic: final IC value;baseline_ic: baseline IC.IC_used:"GIC"or"BIC";num_candidates: count of candidate one-shift models evaluated.model_fit_history: ifstore_model_fit_history = TRUE, a list of per-iteration fits (loaded from temporary files written during the run) and anic_acceptance_matrix(IC value and acceptance flag per step).VCVs: named list of regime-specific VCV matrices extracted from the final model (penalized-likelihood estimates if PL was used).
Additional components appear conditionally:
ic_weights: adata.frameof per-shift IC weights and evidence ratios whenuncertaintyweightsoruncertaintyweights_parisTRUE.warnings: character vector of warnings/errors encountered during fitting (if any).
Details
Input requirements.
Tree:
baseline_treeshould be a rootedphylo(or SIMMAP-style) tree with branch lengths interpreted in units of time. An ultrametric tree is not required.Trait data alignment:
rownames(trait_data)must matchbaseline_tree$tip.labelin both names and order; any tips without data should be pruned beforehand.Data type:
trait_datais typically a numeric matrix of continuous traits; high-dimensional settings (p \(\ge\) n) are supported via penalized-likelihoodmvgls()fits.
Search outline.
Baseline: Fit
mvglson the baseline tree (single regime) to obtain the baseline IC.Candidates: Build one-shift trees for eligible internal nodes (
generatePaintedTrees); fit each withfitMvglsAndExtractGIC.formulaorfitMvglsAndExtractBIC.formula(internal helpers; not exported) and rank by \(\Delta\)IC.Greedy add: Add the top candidate, refit, and accept if \(\Delta\)IC \(\ge\)
shift_acceptance_threshold; continue down the ranked list.Optional IC weights: If
uncertaintyweights(oruncertaintyweights_par) isTRUE, compute an IC weight for each accepted shift by refitting the final model with that shift removed and comparing the two ICs viaaicw.
Parallelization. Candidate sub-model fits are distributed with future + future.apply.
On Unix, multicore is used; on Windows, multisession. A sequential plan is restored afterward.
Plotting. If plot = TRUE, trees are rendered with
plotSimmap(); shift IDs are labeled with nodelabels().
Regime VCVs. The returned $VCVs are extracted from the fitted multi-regime model via
extractRegimeVCVs and reflect regime-specific covariance
estimates (when mvgls is fitted under a PL/ML method).
For high-dimensional trait datasets (p \(\ge\) n), penalized-likelihood settings in
mvgls() are often required for stable estimation. In practice, methods such as
method = "LL" or method = "H&L" combined with appropriate penalties (e.g.,
ridge-type penalties) have proven effective for intercept-only multivariate Brownian
motion models, as illustrated in the package vignettes. Users should consult the
mvMORPH documentation for details on available methods and penalties and
tune these choices to the structure of their data.
Note
Internally, this routine coordinates multiple unexported helper functions:
generatePaintedTrees, fitMvglsAndExtractGIC.formula,
fitMvglsAndExtractBIC.formula, addShiftToModel,
removeShiftFromTree, and extractRegimeVCVs. Through these,
it may also invoke lower-level utilities such as paintSubTree_mod
and paintSubTree_removeShift. These helpers are internal
implementation details and are not part of the public API.
Convergence and robustness
The search is greedy and may converge to a local optimum. Use a stricter
shift_acceptance_threshold to reduce overfitting, and re-run the search
with different min_descendant_tips and IC choices ("GIC" vs "BIC")
to assess stability of the inferred shifts. For a given run, the optional IC-weight
calculations (uncertaintyweights or uncertaintyweights_par) can be used
to quantify support for individual shifts. It is often helpful to repeat the analysis
under slightly different settings (e.g., thresholds or candidate-size constraints) and
compare the resulting sets of inferred shifts.
See also
mvgls, GIC, BIC,
plot_ic_acceptance_matrix for visualizing IC trajectories and shift
acceptance decisions, and generateViridisColorScale for mapping
regime-specific rates or parameters to a viridis color scale when plotting trees;
packages: mvMORPH, future, future.apply, phytools, ape.
Examples
library(ape)
library(phytools)
#> Loading required package: maps
#>
#> Attaching package: ‘maps’
#> The following object is masked from ‘package:viridis’:
#>
#> unemp
library(mvMORPH)
#> Loading required package: corpcor
#> Loading required package: subplex
#> ##
#> ## mvMORPH package (1.2.1)
#> ## Multivariate evolutionary models
#> ##
#> ## See the tutorials: browseVignettes("mvMORPH")
#> ##
#> ## To cite package 'mvMORPH': citation("mvMORPH")
#> ##
set.seed(1)
# Simulate a tree
tr <- pbtree(n = 50, scale = 1)
# Define two regimes: "0" (baseline) and "1" (high-rate) on a subset of tips
states <- setNames(rep("0", Ntip(tr)), tr$tip.label)
high_clade_tips <- tr$tip.label[1:20]
states[high_clade_tips] <- "1"
# Make a SIMMAP tree for the BMM simulation
simmap <- phytools::make.simmap(tr, states, model = "ER", nsim = 1)
#> make.simmap is sampling character histories conditioned on
#> the transition matrix
#>
#> Q =
#> 0 1
#> 0 -0.3147857 0.3147857
#> 1 0.3147857 -0.3147857
#> (estimated using likelihood);
#> and (mean) root node prior probabilities
#> pi =
#> 0 1
#> 0.5 0.5
#> Done.
# Simulate traits under a BMM model with ~10x higher rate in regime "1"
sigma <- list(
"0" = diag(0.1, 2),
"1" = diag(1.0, 2)
)
theta <- c(0, 0)
sim <- mvMORPH::mvSIM(
tree = simmap,
nsim = 1,
model = "BMM",
param = list(
ntraits = 2,
sigma = sigma,
theta = theta
)
)
# mvSIM returns either a matrix or a list of matrices depending on mvMORPH version
X <- if (is.list(sim)) sim[[1]] else sim
rownames(X) <- simmap$tip.label
# Run the search on the unpainted tree (single baseline regime)
res <- searchOptimalConfiguration(
baseline_tree = as.phylo(simmap),
trait_data = X,
formula = "trait_data ~ 1",
min_descendant_tips = 10,
num_cores = 1, # keep it simple / CRAN-safe
shift_acceptance_threshold = 20, # conservative GIC threshold
IC = "GIC",
plot = FALSE,
store_model_fit_history = FALSE,
verbose = FALSE
)
res$shift_nodes_no_uncertainty
#> [1] 54
res$optimal_ic - res$baseline_ic
#> [1] -38.3827
str(res$VCVs)
#> List of 2
#> $ 0: num [1:2, 1:2] 0.129 0 0 0.129
#> $ 1: num [1:2, 1:2] 0.991 0 0 0.991