Reproducing Tay et al. (2023)

Overview

This guide documents an end-to-end workflow for reproducing the main Episodic analysis pattern used in Tay et al. (2023) with episodic.

This guide focuses on the Fixed Local Clock (FLC) Shared-Stem model, which was the main model used in Tay et al. (2023), and compares it against strict and relaxed clock models using marginal likelihood estimation and Bayes factors.

Reference

Tay et al. 2023, Molecular Biology and Evolution, 40(10): msad212
https://academic.oup.com/mbe/article/40/10/msad212/7280106

Inputs

The input data is available in tests/data/Tay2023.afa and tests/data/Tay2023.yaml (config file).

Analysis setup

We aim to reproduce Figures 9 and 10 from Tay et al. (2023) by:

defining foreground lineage groups
running strict, relaxed, and FLC clock models
generating model-comparison and tree outputs

We analyze the Tay et al. dataset in a Bayesian phylogenetic framework using BEAST 1.10 through the episodic workflow, comparing three molecular clock classes: strict clock (SC), uncorrelated gamma-distributed relaxed clock (UCGD), and fixed local clock (FLC). In this guide, the focal local-clock model is FLC stem, with strict and relaxed clocks included as comparators for model fit and interpretation.

The analysis is conducted on a shared tree with group-defined foreground lineages (B.1.1.7, B.1.351, P.1, B.1.617.2) and a background lineage set defined by all remaining taxa. For FLC models, evolutionary rates are estimated separately for background and foreground branches according to the selected local-clock parameterization. This allows direct quantification of episodic rate shifts relative to the background process.

The substitution and tree models follow the default Episodic BEAST template used in this workflow (GTR with among-site rate heterogeneity and an exponential-growth coalescent prior). Proper priors are specified for molecular clock parameters, including gamma-distributed priors on clock-rate terms using the configured shape and scale (rate_gamma_prior_shape, rate_gamma_prior_scale). In relaxed-clock analyses, the UCGD mean and shape parameters are estimated from the posterior.

Posterior inference is performed by MCMC with user-defined chain length and sampling frequency (beast.chain_length, beast.samples), and duplicate runs are used to assess stability across independent chains. Posterior traces and summaries are then inspected for adequate mixing and effective sample size.

To compare molecular clock models, marginal likelihoods are estimated using generalized stepping-stone/path-sampling settings configured under marginal_likelihood (for example, number of path steps, per-path chain length, and duplicate estimators). These estimates are summarized into Bayes factor-style model comparisons, alongside complementary effect-size outputs from local-clock rate contrasts.

Finally, the workflow produces MCC trees, posterior summary tables, and clock-specific diagnostic plots to connect model fit with biologically interpretable branch-rate patterns.

Run

Create a run config (for example tests/data/Tay2023.yaml). The configure controls every aspect of the workflow, however you can also override settings on the command line (for example, to run marginal likelihood estimation separately).

# Clocks to compare
clock:
- strict
- relaxed
- flc-shared-stem
# Gamma prior settings for clock rates (applies to all clocks, including background and foreground rates in FLC models)
rate_gamma_prior_scale: 0.1
rate_gamma_prior_shape: 0.5
# BEAST MCMC settings
beast:
  chain_length: 10000000
  samples: 1000
  duplicates: 2 # run in duplicate to check convergence
  threads: 4
# We expect the format of fasta headers to be ID@GROUP@DATE, 
# e.g. [email protected]@2020.11 (date is in decimal years)
date_delimiter: '@'
date_index: -1
# These groups will be forced monophyletic and assigned to the foreground in the FLC models
group: 
- B.1.1.7
- B.1.351
- P.1
- B.1.617.2
# Marginal likelihood estimation settings
marginal_likelihood:
  chain_length: 1000000
  duplicates: 5
  log_every: 1000
  path_steps: 100
output:
  dir: Tay2023

Execute workflow

First we can run a dry run to check the workflow is configured correctly:

episodic run --config tests/data/Tay2023.yaml --alignment tests/data/Tay2023.afa --dry

We can generate a DAG to visualise the workflow structure:

episodic run --config tests/data/Tay2023.yaml --alignment tests/data/Tay2023.afa --dag Tay2023_dag.pdf

Finally, we can run the full analysis:

episodic run --config tests/data/Tay2023.yaml --alignment tests/data/Tay2023.afa

To perform the marginal likelihood estimation (no run by default), add --marginal-likelihood-estimate to the command above.

episodic run --config tests/data/Tay2023.yaml --alignment tests/data/Tay2023.afa --marginal-likelihood-estimate

Info

By default, episodic will run using all available CPU cores. You can control this with the --cores flag to limit to total number of cores.

Key outputs

The y-axis is the evolutionary rate and the x-axis is for the background and foreground branches. The densities on the right corresponds to the prior on the evolutionary rates, which is a distribution. The log Bayes factor on effect size is inf (overwhelming support), and it is calculated as the log ratio of the posterior and prior odds of foreground branch rate being higher than that the background. Maximum clade credibility (MCC) trees are generated for each clock and duplicate run, with node heights summarised according to the heights option (for example, mean). Here we can see the elevated foreground rate on the stem branches leading into the VOC clades in the FLC shared-stem model highlighted in yellow.

Here we can see marginal likelihood estimates (MLE) for each clock model, with the FLC shared-stem model having the highest marginal likelihood and strong support (mean log Bayes factor >2.8) over the strict and relaxed clock models.

Relaxed clock MCC tree