Skip to contents

Greedy, stepwise search for evolutionary regime shifts on a SIMMAP-style phylogeny using multivariate mvgls fits from mvMORPH. The routine:

  1. builds one-shift candidate trees for all internal nodes meeting a tip-size threshold (via generatePaintedTrees),

  2. fits each candidate in parallel and ranks them by improvement in the chosen information criterion (IC; GIC or BIC),

  3. iteratively adds shifts that pass a user-defined acceptance threshold,

  4. optionally revisits accepted shifts to prune overfitting using a small IC tolerance window,

  5. optionally computes per-shift IC weights by refitting the model with each shift removed.

Models are fitted directly in multivariate trait space (no PCA), assuming a multi-rate Brownian Motion with proportional VCV scaling across regimes. Extra arguments in ... are forwarded to mvgls (e.g., method = "LL" or method = "PL-LOOCV", penalty, error = TRUE, etc.).

Usage

searchOptimalConfiguration(
  baseline_tree,
  trait_data,
  formula = "trait_data ~ 1",
  min_descendant_tips,
  num_cores = 2,
  ic_uncertainty_threshold = 1,
  shift_acceptance_threshold = 1,
  uncertaintyweights = FALSE,
  uncertaintyweights_par = FALSE,
  plot = FALSE,
  IC = "GIC",
  store_model_fit_history = TRUE,
  verbose = FALSE,
  ...
)

Arguments

baseline_tree

A rooted SIMMAP/phylo object representing the baseline (single-regime) tree. If not SIMMAP-initialized, it should already be painted to a single baseline state and have tip order matching trait_data.

trait_data

A matrix or data.frame of continuous trait values with row names matching baseline_tree$tip.label (same order). For the default formula = "trait_data ~ 1", trait_data is typically supplied as a numeric matrix so that the multivariate response is interpreted correctly by mvgls(). When using more general formulas (e.g., pGLS-style models), a data.frame with named columns can be used instead.

formula

Character formula passed to mvgls. Defaults to "trait_data ~ 1", which fits an intercept-only model treating the supplied multivariate trait matrix as the response. This is the appropriate choice for most morphometric data where there are no predictor variables. For more general models, formula can reference subsets of trait_data explicitly, for example "trait_data[, 1:5] ~ 1" to treat columns 1–5 as a multivariate response, or "trait_data[, 1:5] ~ trait_data[, 6]" to fit a multivariate pGLS with column 6 as a predictor.

min_descendant_tips

Integer (\(\ge\)1). Minimum number of tips required for an internal node to be considered as a candidate shift (forwarded to generatePaintedTrees). Larger values reduce the number of candidate shifts by excluding very small clades. For empirical datasets, values around 10 are a reasonable starting choice and can be tuned in sensitivity analyses.

num_cores

Integer. Number of workers for parallel candidate scoring. Uses future::plan(multicore) on Unix outside RStudio; otherwise uses future::plan(multisession). During the parallel candidate-scoring blocks, BLAS/OpenMP threads are capped to 1 (per worker) to avoid CPU oversubscription.

ic_uncertainty_threshold

Numeric (\(\ge\)0). Reserved for future development in post-search pruning and uncertainty analysis; currently not used by searchOptimalConfiguration().

shift_acceptance_threshold

Numeric (\(\ge\)0). Minimum IC improvement (baseline - new) required to accept a candidate shift during the forward search. Larger values yield more conservative models. For analyses based on the Generalized Information Criterion ("GIC"), a threshold on the order of 20 units is a conservative choice that tends to admit only strongly supported shifts. Simulation studies (Berv et al., in preparation) suggest that this choice yields good balanced accuracy between detecting true shifts and avoiding false positives, but users should explore alternative thresholds in sensitivity analyses for their own datasets.

uncertaintyweights

Logical. If TRUE, compute per-shift IC weights serially by refitting the optimized model with each shift removed in turn. Exactly one of uncertaintyweights or uncertaintyweights_par must be TRUE to trigger IC-weight calculations; setting both to TRUE will result in an error. When enabled, the per-shift weights are returned in the $ic_weights component of the result.

uncertaintyweights_par

Logical. As above, but compute per-shift IC weights in parallel using future.apply. Exactly one of uncertaintyweights or uncertaintyweights_par must be TRUE to trigger IC-weight calculations.

plot

Logical. If TRUE, draw/update a SIMMAP plot as the search proceeds (requires phytools).

IC

Character. Which information criterion to use, one of "GIC" or "BIC" (case-sensitive).

store_model_fit_history

Logical. If TRUE, store a per-iteration record of fitted models, acceptance decisions, and IC values. To keep memory usage low during the search, per-iteration results are written to a temporary directory (tempdir()) and read back into memory at the end of the run.

verbose

Logical. If TRUE, report progress during candidate generation and model fitting. By default, progress is emitted via message(). When plot = TRUE in an interactive RStudio session, progress is written via cat() so it remains visible while plots are updating. Set to FALSE to run quietly (default). Use suppressMessages() (and capture.output() if needed) to silence or capture output.

...

Additional arguments passed to mvgls (e.g., method, penalty, target, error, etc.).

Value

A named list with (at minimum):

  • user_input: captured call (as a list) for reproducibility.

  • tree_no_uncertainty_transformed: SIMMAP tree from the optimal (no-uncertainty) model on the transformed scale used internally by mvgls.

  • tree_no_uncertainty_untransformed: same topology with original edge lengths restored.

  • model_no_uncertainty: the final mvgls model object.

  • shift_nodes_no_uncertainty: integer vector of accepted shift nodes.

  • optimal_ic: final IC value; baseline_ic: baseline IC.

  • IC_used: "GIC" or "BIC"; num_candidates: count of candidate one-shift models evaluated.

  • model_fit_history: if store_model_fit_history = TRUE, a list of per-iteration fits (loaded from temporary files written during the run) and an ic_acceptance_matrix (IC value and acceptance flag per step).

  • VCVs: named list of regime-specific VCV matrices extracted from the final model (penalized-likelihood estimates if PL was used).

Additional components appear conditionally:

  • ic_weights: a data.frame of per-shift IC weights and evidence ratios when uncertaintyweights or uncertaintyweights_par is TRUE.

  • warnings: character vector of warnings/errors encountered during fitting (if any).

Details

Input requirements.

  • Tree: baseline_tree should be a rooted phylo (or SIMMAP-style) tree with branch lengths interpreted in units of time. An ultrametric tree is not required.

  • Trait data alignment: rownames(trait_data) must match baseline_tree$tip.label in both names and order; any tips without data should be pruned beforehand.

  • Data type: trait_data is typically a numeric matrix of continuous traits; high-dimensional settings (p \(\ge\) n) are supported via penalized-likelihood mvgls() fits.

Search outline.

  1. Baseline: Fit mvgls on the baseline tree (single regime) to obtain the baseline IC.

  2. Candidates: Build one-shift trees for eligible internal nodes (generatePaintedTrees); fit each with fitMvglsAndExtractGIC.formula or fitMvglsAndExtractBIC.formula (internal helpers; not exported) and rank by \(\Delta\)IC.

  3. Greedy add: Add the top candidate, refit, and accept if \(\Delta\)IC \(\ge\) shift_acceptance_threshold; continue down the ranked list.

  4. Optional IC weights: If uncertaintyweights (or uncertaintyweights_par) is TRUE, compute an IC weight for each accepted shift by refitting the final model with that shift removed and comparing the two ICs via aicw.

Parallelization. Candidate sub-model fits are distributed with future + future.apply. On Unix, multicore is used; on Windows, multisession. A sequential plan is restored afterward.

Plotting. If plot = TRUE, trees are rendered with plotSimmap(); shift IDs are labeled with nodelabels().

Regime VCVs. The returned $VCVs are extracted from the fitted multi-regime model via extractRegimeVCVs and reflect regime-specific covariance estimates (when mvgls is fitted under a PL/ML method).

For high-dimensional trait datasets (p \(\ge\) n), penalized-likelihood settings in mvgls() are often required for stable estimation. In practice, methods such as method = "LL" or method = "H&L" combined with appropriate penalties (e.g., ridge-type penalties) have proven effective for intercept-only multivariate Brownian motion models, as illustrated in the package vignettes. Users should consult the mvMORPH documentation for details on available methods and penalties and tune these choices to the structure of their data.

Note

Internally, this routine coordinates multiple unexported helper functions: generatePaintedTrees, fitMvglsAndExtractGIC.formula, fitMvglsAndExtractBIC.formula, addShiftToModel, removeShiftFromTree, and extractRegimeVCVs. Through these, it may also invoke lower-level utilities such as paintSubTree_mod and paintSubTree_removeShift. These helpers are internal implementation details and are not part of the public API.

Convergence and robustness

The search is greedy and may converge to a local optimum. Use a stricter shift_acceptance_threshold to reduce overfitting, and re-run the search with different min_descendant_tips and IC choices ("GIC" vs "BIC") to assess stability of the inferred shifts. For a given run, the optional IC-weight calculations (uncertaintyweights or uncertaintyweights_par) can be used to quantify support for individual shifts. It is often helpful to repeat the analysis under slightly different settings (e.g., thresholds or candidate-size constraints) and compare the resulting sets of inferred shifts.

See also

mvgls, GIC, BIC, plot_ic_acceptance_matrix for visualizing IC trajectories and shift acceptance decisions, and generateViridisColorScale for mapping regime-specific rates or parameters to a viridis color scale when plotting trees; packages: mvMORPH, future, future.apply, phytools, ape.

Examples

library(ape)
library(phytools)
#> Loading required package: maps
#> 
#> Attaching package: ‘maps’
#> The following object is masked from ‘package:viridis’:
#> 
#>     unemp
library(mvMORPH)
#> Loading required package: corpcor
#> Loading required package: subplex
#> ##
#> ## mvMORPH package (1.2.1)
#> ## Multivariate evolutionary models
#> ##
#> ## See the tutorials: browseVignettes("mvMORPH")
#> ##
#> ## To cite package 'mvMORPH': citation("mvMORPH")
#> ##
set.seed(1)

# Simulate a tree
tr <- pbtree(n = 50, scale = 1)

# Define two regimes: "0" (baseline) and "1" (high-rate) on a subset of tips
states <- setNames(rep("0", Ntip(tr)), tr$tip.label)
high_clade_tips <- tr$tip.label[1:20]
states[high_clade_tips] <- "1"

# Make a SIMMAP tree for the BMM simulation
simmap <- phytools::make.simmap(tr, states, model = "ER", nsim = 1)
#> make.simmap is sampling character histories conditioned on
#> the transition matrix
#> 
#> Q =
#>            0          1
#> 0 -0.3147857  0.3147857
#> 1  0.3147857 -0.3147857
#> (estimated using likelihood);
#> and (mean) root node prior probabilities
#> pi =
#>   0   1 
#> 0.5 0.5 
#> Done.

# Simulate traits under a BMM model with ~10x higher rate in regime "1"
sigma <- list(
  "0" = diag(0.1, 2),
  "1" = diag(1.0, 2)
)
theta <- c(0, 0)

sim <- mvMORPH::mvSIM(
  tree  = simmap,
  nsim  = 1,
  model = "BMM",
  param = list(
    ntraits = 2,
    sigma   = sigma,
    theta   = theta
  )
)

# mvSIM returns either a matrix or a list of matrices depending on mvMORPH version
X <- if (is.list(sim)) sim[[1]] else sim
rownames(X) <- simmap$tip.label

# Run the search on the unpainted tree (single baseline regime)
res <- searchOptimalConfiguration(
  baseline_tree              = as.phylo(simmap),
  trait_data                 = X,
  formula                    = "trait_data ~ 1",
  min_descendant_tips        = 10,
  num_cores                  = 1,   # keep it simple / CRAN-safe
  shift_acceptance_threshold = 20,  # conservative GIC threshold
  IC                         = "GIC",
  plot                       = FALSE,
  store_model_fit_history    = FALSE,
  verbose                    = FALSE
)

res$shift_nodes_no_uncertainty
#> [1] 54
res$optimal_ic - res$baseline_ic
#> [1] -38.3827
str(res$VCVs)
#> List of 2
#>  $ 0: num [1:2, 1:2] 0.129 0 0 0.129
#>  $ 1: num [1:2, 1:2] 0.991 0 0 0.991