| Title: | Tests of Multispecies Coalescent Gene Tree Simulator Output |
|---|---|
| Description: | Statistical tests for validating multispecies coalescent gene tree simulators, using pairwise distances and rooted triple counts. See Allman ES, Baños HD, Rhodes JA 2023. Testing multispecies coalescent simulators using summary statistics, IEEE/ACM Trans Comput Biol Bioinformat, 20(2):1613–1618. <doi:10.1109/TCBB.2022.3177956>. |
| Authors: | Elizabeth Allman [aut, cre, cph], Hector Banos [aut, cph], John Rhodes [aut, cph] |
| Maintainer: | Elizabeth Allman <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.1 |
| Built: | 2026-05-31 07:35:46 UTC |
| Source: | https://github.com/cran/MSCsimtester |
Takes as input theoretical pairwise distance densities under the MSC and
empirical pairwise distances from gene trees in a sample, as returned by
the function pairwiseDist. Uses the package kSamples to perform
either one test on the entire dataset or multiple tests on subsamples.
ADtest(distanceDensities, subsampleSize = FALSE)ADtest(distanceDensities, subsampleSize = FALSE)
distanceDensities |
A list containing values needed for performing Anderson-Darling
test(s) on a gene tree sample and species tree, as output by |
subsampleSize |
A positive integer to perform multiple tests on subsamples,
or |
The Anderson-Darling test compares the empirical distance distribution for a supplied gene tree
sample to a sample drawn from the theoretical distribution. The output, passed from the kSamples package,
thus says that 2 samples are being compared, to test a null-hypothesis that they come from the same distribution.
See kSamples documentation for function ad.test for more details.
Repeated runs of this function will give different results, since the sample from the theoretical distribution will vary. Under the null hypothesis p-values for different runs should be approximately uniformly distributed.
Numerical issues may result in poor performance of Anderson-Darling tests when the sample size
is very large, so
an optional parameter subsampleSize can be set to create subsamples of smaller size.
If subsampleSize is a positive integer,
Anderson-Darling tests are performed on each subset, comparing them to
a random sample of the same size from the theoretical distribution. Good fit is indicated by an approximately uniform
distribution of the subsample p-values.
An object of type ADtestOutput including a sample $Sample from the theoretical distance distribution of
the same size as the empirical one, and $ADtest which is of type kSamples and
has all output from the Anderson-Darling test if only
one test was performed, or the number of tests if tests were performed on subsamples.
stree=read.tree(text="((((a:10000,b:10000):10000,c:20000):10000,d:30000):10000,e:40000);") pops=c(15000,25000,10000,1,1,1,1,1,12000) gts=read.tree(file=system.file("extdata","genetreeSample",package="MSCsimtester")) distDen=pairwiseDist(stree,pops,gts,"a","b") ADtest(distDen) ADtest(distDen,1000)stree=read.tree(text="((((a:10000,b:10000):10000,c:20000):10000,d:30000):10000,e:40000);") pops=c(15000,25000,10000,1,1,1,1,1,12000) gts=read.tree(file=system.file("extdata","genetreeSample",package="MSCsimtester")) distDen=pairwiseDist(stree,pops,gts,"a","b") ADtest(distDen) ADtest(distDen,1000)
A dataset of 10,000 gene trees on 5 taxa simulated under the MSC on a species tree.
A text file with 10,000 metric Newick gene trees on the taxa a,b,c,d,e
This simulated dataset was produced by SimPhy Mallo D, De Oliveira Martins L, Posada D (2016). “SimPhy: Phylogenomic Simulation of Gene, Locus, and Species Trees.” Systematic Biology, 65(2), 334-344. doi:10.1093/sysbio/syv082., using the species tree
((((a:10000,b:10000):10000,c:20000):10000,d:30000):10000,e:40000);
with population sizes
c(15000,25000,10000,1,1,1,1,1,12000)
and edges ordered by the ape function read.tree.
File is accessed as system.file("extdata","genetreeSample",package="MSCsimtester"), for example
using the ape command:
gts=read.tree(file=system.file("extdata","genetreeSample",package="MSCsimtester") )
Mallo D, De Oliveira Martins L, Posada D (2016). “SimPhy: Phylogenomic Simulation of Gene, Locus, and Species Trees.” Systematic Biology, 65(2), 334-344. doi:10.1093/sysbio/syv082.
The package performs comparisons of certain summary statistics for simulated gene tree samples to
theoretical predictions under the multispecies coalescent model.
The primary functions are rootedTriple for comparison of frequencies of topological rooted triples
on gene trees, and pairwiseDist and ADtest for comparison of the distributions of pairwise
distances between taxa on gene trees.
Required input is a collection of gene trees, stored as a multiPhylo object by the ape package,
and a rooted species tree, as a Phylo object,
with edge lengths in generations, together with constant population sizes for each edge.
MSCsimtester builds on the packages ape and kSamples.
For further examples of use and citation purposes, see Allman ES, Baños HD, Rhodes JA (2023). “Testing Multispecies Coalescent Simulators Using Summary Statistics.” IEEE/ACM Transactions on Computational Biology and Bioinformatics, 20(02), 1613-1618. ISSN 1557-9964. doi:10.1109/TCBB.2022.3177956..
Maintainer: Elizabeth Allman [email protected] [copyright holder]
Authors:
Hector Banos [email protected] [copyright holder]
John Rhodes [email protected] [copyright holder]
Allman ES, Baños HD, Rhodes JA (2023). “Testing Multispecies Coalescent Simulators Using Summary Statistics.” IEEE/ACM Transactions on Computational Biology and Bioinformatics, 20(02), 1613-1618. ISSN 1557-9964. doi:10.1109/TCBB.2022.3177956.
Computes theoretical pairwise distance densities under the MSC on a species tree and empirical pairwise distances from gene trees in a sample. A histogram of empirical values is plotted over the theoretical pdf.
pairwiseDist( stree, popSizes, gtSample, taxon1, taxon2, numSteps = 1000, tailProb = 0.01, numBreaks = 40 )pairwiseDist( stree, popSizes, gtSample, taxon1, taxon2, numSteps = 1000, tailProb = 0.01, numBreaks = 40 )
stree |
An object of class |
popSizes |
A vector containing constant population sizes, one entry for each
edge/population in the species tree, for a haploid population. Sizes should be doubled for diploids.
If |
gtSample |
An object of class |
taxon1 |
A string specifying one taxon on |
taxon2 |
A string specifying a second taxon on |
numSteps |
A positive integer number of values to be computed for
graphing the theoretical pairwise distance density. Default is |
tailProb |
A cutoff value, between 0 and 1, for the theoretical density, with a default of 0.01. |
numBreaks |
Number of breaks in histogram. Default is |
A list of items needed for Anderson-Darling test(s), for use by ADtest,
returned invisibly. See function code for more details.
plotEdgeOrder, plotPops, ADtest
stree=read.tree(text="((((a:10000,b:10000):10000,c:20000):10000,d:30000):10000,e:40000);") pops=c(15000,25000,10000,1,1,1,1,1,12000) gts=read.tree(file=system.file("extdata","genetreeSample",package="MSCsimtester")) pairwiseDist(stree,pops,gts,"a","b")stree=read.tree(text="((((a:10000,b:10000):10000,c:20000):10000,d:30000):10000,e:40000);") pops=c(15000,25000,10000,1,1,1,1,1,12000) gts=read.tree(file=system.file("extdata","genetreeSample",package="MSCsimtester")) pairwiseDist(stree,pops,gts,"a","b")
Under the MSC, each edge in the species tree must be assigned a population size. This function displays the species tree with the edges numbered, to aid the user in entering constant population sizes as an appropriately ordered list.
plotEdgeOrder(stree)plotEdgeOrder(stree)
stree |
An object of class |
NONE
pairwiseDist, rootedTriple, plotPops
stree=read.tree(text="(((a:10000,b:10000):10000,c:20000):10000,d:30000);") plotEdgeOrder(stree) pops=c(30000,20000,1,1,1,1,10000) plotPops(stree,pops)stree=read.tree(text="(((a:10000,b:10000):10000,c:20000):10000,d:30000);") plotEdgeOrder(stree) pops=c(30000,20000,1,1,1,1,10000) plotPops(stree,pops)
Plot species tree, with population sizes on edges.
plotPops(stree, populations)plotPops(stree, populations)
stree |
An object of class |
populations |
A vector containing constant population sizes, one entry for each edge/population in the species tree, with last entry for the population ancestral to the root. |
NONE
pairwiseDist, rootedTriple, plotEdgeOrder
stree=read.tree(text="(((a:10000,b:10000):10000,c:20000):10000,d:30000);") plotEdgeOrder(stree) pops=c(30000,20000,1,1,1,1,10000) plotPops(stree,pops)stree=read.tree(text="(((a:10000,b:10000):10000,c:20000):10000,d:30000);") plotEdgeOrder(stree) pops=c(30000,20000,1,1,1,1,10000) plotPops(stree,pops)
ADtestOutput.Print function for objects of class ADtestOutput.
## S3 method for class 'ADtestOutput' print(x, ...)## S3 method for class 'ADtestOutput' print(x, ...)
x |
An object of class |
... |
Further arguments to be passed to or from other methods. |
NONE
rootedTripleOutput.Print function for objects of class rootedTripleOutput.
## S3 method for class 'rootedTripleOutput' print(x, ...)## S3 method for class 'rootedTripleOutput' print(x, ...)
x |
An object of class |
... |
Further arguments to be passed to or from other methods. |
Print function for objects of class rootedTripleOutput.
NONE
For a given species tree with population sizes, compares the expected frequencies of rooted triples to empirical frequencies in a sample of gene trees, using Chi-squared tests with 2 d.f. The exact and estimated internal branch length (in coalescent units) of the rooted triple in the species tree are also computed for comparison. A single test can be performed on the entire gene tree sample, or multiple tests on subsamples.
rootedTriple( stree, popSizes, gtSample, taxon1, taxon2, taxon3, subsampleSize = FALSE )rootedTriple( stree, popSizes, gtSample, taxon1, taxon2, taxon3, subsampleSize = FALSE )
stree |
An object of class |
popSizes |
An ordered list containing constant population sizes for each species tree edge, for a haploid organism.
Sizes should be doubled for diploids. If |
gtSample |
An object of class |
taxon1 |
A string specifying one taxon on |
taxon2 |
A string specifying a second taxon on |
taxon3 |
A string specifying a third taxon on |
subsampleSize |
A positive integer or |
When subsampleSize is FALSE the Chi-squared test is performed using all
gene trees in gtSample. Results are reported in tabular form in the console.
When subsampleSize is positive, the N trees in gtSample
will be partitioned into N/subsampleSize subsamples, with a Chi-squared test
performed for each. Histograms are plotted for (1) the p-values for the Chi-squared tests on
subsamples, and (2) subsample estimates of the internal branch length for the rooted triple on the
species tree, with the true value marked.
Three distinct taxon names must be supplied, all of which must occur on stree
and in each of the gene trees in the sample.
If subsampleSize is FALSE, returns an object of type rootedTripleOutput
which contains a table $TripletCounts of empirical and expected rooted triple counts,
a p-value $pv from the Chi-squared test, and a column $InternalEdge of estimated and exact internal edge lengths.
If subsampleSize is TRUE, returns NULL but produces several plots.
stree=read.tree(text="((((a:10000,b:10000):10000,c:20000):10000,d:30000):10000,e:40000);") pops=c(15000,25000,10000,1,1,1,1,1,12000) gts=read.tree(file=system.file("extdata","genetreeSample",package="MSCsimtester")) rootedTriple(stree,pops,gts,"a","b","c") rootedTriple(stree,pops,gts,"a","b","c",1000)stree=read.tree(text="((((a:10000,b:10000):10000,c:20000):10000,d:30000):10000,e:40000);") pops=c(15000,25000,10000,1,1,1,1,1,12000) gts=read.tree(file=system.file("extdata","genetreeSample",package="MSCsimtester")) rootedTriple(stree,pops,gts,"a","b","c") rootedTriple(stree,pops,gts,"a","b","c",1000)