Aphid
This is a C implementation of the "Aphid" method, following its specifications within Nicolas Galtier's draft paper Phylogenetic conflicts: distinguishing gene flow from incomplete lineage sorting (2023), and the original C code.
Sources available at: https://gitlab.mbb.cnrs.fr/ibonnici/aphid/.
Build instructions
With Apptainer/Singularity
Aphid can be run from an apptainer provided you have Apptainer (aka. Singularity) installed.
First, build the container image as root:
# Get definition file.
curl -L \
https://gitlab.mbb.univ-montp2.fr/ibonnici/aphid/-/jobs/artifacts/main/raw/singularity.def?job=container-recipes \
> aphid.def # (or download by hand)
# Build image.
sudo apptainer build aphid.sif aphid.def
Then run the container as regular user:
./aphid.sif <my_input_files>
With Docker
Aphid can be run from a docker container, provided you have Docker installed.
First, build the container image:
docker buildx build -t aphid \
https://gitlab.mbb.univ-montp2.fr/ibonnici/aphid/-/jobs/artifacts/main/raw/Dockerfile?job=container-recipes
Then run the container:
docker run --rm -it -v ${PWD}:/home/aphid aphid <my_input_files>
Manual compilation
The program is still fairly easy to compile yet, provided you have git and gcc installed.
First clone and compile the project:
git clone --recursive https://gitlab.mbb.univ-montp2.fr/ibonnici/aphid/
cd aphid
gcc -lm aphid.*.c -o aphid
Then run the binary obtained:
./aphid <my_input_files>
Running Aphid
Input
Aphid reads data from three distinct files (see examples below):
- A gene-tree file.
- A taxon file.
- A control file.
The Gene-Tree File
The gene-tree file includes one line per gene tree. Each line contains three tab-separated entries:
- Gene tree.
- Locus length.
- Locus name.
The gene tree must be in Newick format, rooted,
with branch lengths expressed in per site number of mutation.
The locus length corresponds to the number of sites of the alignment
from which the gene tree was reconstructed,
such that the product of branch length by locus length
is the total number of mutations having occurred in the considered branch
at the considered locus.
The locus name is an unconstrained string supposed to identify the tree.
The Taxon File
The taxon file includes:
- A line defining the focal
triplet
(mandatory). - A line defining the
outgroup
(mandatory). - An optional line defining
other
relevant species according to the following format:
triplet: species_A, species_B, species_C
outgroup: <comma-separated species list>
other: <comma-separated species list>
The triplet
line defines the three focal species of the analysis.
Order matters: the third species of the list is the earliest branching one,
i.e. it corresponds to species C
of the assumed ((A,B),C)
species tree.
The outgroup
and other
lines
define the species used to estimate locus-specific mutation rates.
Gene trees not including the three species of the triplet
and at least one outgroup
species are not analyzed.
Species not listed in the taxon file, if any, are not considered.
The other
species should branch somewhere
beween the outgroup
and the focal triplet
in the species tree.
Their inclusion is intended to increase the accuracy
of relative mutation rate estimation,
especially when the outgroup
includes few species.
Consider for instance the following species tree:
(out,(other1,(other2,(other3,(other4,(C,(B,A)))))));
If we only specify the triplet A, B, C
and the outgroup out
in the taxon file,
the relative mutation rate of a locus will be estimated
based on these four species.
If, in the taxon file we add the following line:
other = other1, other2, other3, other4
Then the relative mutation rate of a locus will be estimated
based on up to eight species,
depending on how many of the other
species
are present in the considered gene tree.
The Control File
Options are passed via the control file
using the <option_name> = <option_value>
syntax (see example below):
-
require_triplet_other_monophyly
: If set to1
(default), only gene trees in which thetriplet
species andother
species (if any) form a monophyletic group will be analyzed. -
require_outgroup_monophyly
: If set to1
(default), only gene trees in which theoutgroup
species form a monophyletic group will be analyzed. -
max_alphai
: Gene trees with an estimated relative mutation rate higher than this value, or lower than 1 / this value, will not be analyzed (default =10
). -
max_clock_ratio
: Gene trees in which the ratio oftriplet
estimated mutation rate tooutgroup
+other
estimated mutation rate exceeds this value, or is below 1 / this value, will not be analyzed (default =2
). -
nb_GF_times
: Number of assumed gene flow times (default =2
). -
GF_time_coeff
: Comma-separated list ofnb_GF_times
coefficients specifying the gene flow times, expressed as fractions of theA|B
speciation time. Each time must be between0
and1
. (default =1.0, 0.5
). -
min_d
: If internal branch length times locus length is below this value, then the gene tree will be considered topologically unresolved (default =0.5
). -
CI_param
: If set to1
, confidence intervals around parameter estimates will be calculated (default =0
). -
CI_ILS_GF
: If set to1
, confidence intervals around the estimated contribution of ILS and GF to the conflict will be calculated (default =0
). -
verbose
: If set to1
(default), results are output in a human-readable way. If set to0
results are output as a single line (see below). -
fixed_tau1
: If set to a positive value, parameter τ_1 will not be optimized but rather fixed to that value (default =-1
). -
fixed_tau
,fixed_theta
,fixed_pab
,fixed_pac
,fixed_pbc
,fixed_panc
: Same as above for parameters τ_2, θ, p_ab, p_ac, p_bc, and p_a. Ifnb_GF_times
is set to1
, thenfixed_panc
is automatically set to1.0
.
Lines starting with #
are ignored
and can be used for inserting comments in the control file.
Output
Standard output
Parameter estimates,
ILS/GF estimated contributions,
discordant topology imbalance,
and various descriptors of the data and analysis
are written to the standard output (terminal).
If verbose=1
, the output is detailed and human-readable.
If verbose=0
, the output is less detailed and written as a single line,
in the following format:
nb_gene, \
ntopo0, \
ntopo1, \
ntopo2, \
ntopo3, \
av_lg, \
tau1, \
tau2, \
theta, \
pab, \
pac, \
pbc, \
pa, \
no_event, \
noconflict_ILS, \
noconflict_GF, \
conflict_ILS, \
conflict_GF, \
imbalance_ILS, \
dominant_ILS, \
imbalance_GF, \
dominant_GF, \
max_lnL
With:
-
nb_gene
: Number of analyzed gene trees after filtering (detailed output gives more information on filtering steps). -
ntopo0 ntopo1, ntopo2, ntopo3
: Number of analyzed gene trees with an((A,B),C)
,((A,C),B)
,((B,C),A)
and(A,B,C)=unresolved
topology, respectively. -
av_lg
: Mean locus length of analyzed gene trees. -
tau1, tau2, theta, pab, pac, pbc, pa
: Parameter estimates. -
no_event
: Estimated contribution of the no_event scenario. -
no_conflict_ILS
: Estimated contribution of the ILS scenario with an((A,B),C)
topology. -
no_conflict_GF
: Estimated contribution of GF scenarios with an((A,B),C)
topology. -
conflict_ILS
: Estimated contribution of ILS scenarios with a topology different from((A,B),C)
. -
conflict_GF
: Estimated contribution of GF scenarios with a topology different from((A,B),C)
. -
imbalance_ILS
: Estimated discordant topology imbalance associated with ILS. -
dominant_ILS
: The most prevalent discordant topology associated with ILS. -
imbalance_GF
: The estimated discordant topology imbalance associated with GF. -
dominant_ILS
: The most prevalent discordant topology associated with GF. -
max_lnL
: The maximum likelihood.
The no_event
, no_conflict_ILS
and no_conflict_GF
posterior estimates
are provided.
But it is unclear how reliable these estimates generally are.
Their sum correspond to the overall estimated prevalence
of the canonical ((A,B),C)
topology among gene trees.
The discordant topology imbalance is a number
between 0.5 (no imbalance) and 1 (maximal imbalance)
indicating whether one of the two discordant topologies
((A,C),B)
and ((B,C),A)
dominates over the other one.
Which one dominates is specified by dominant_ILS
and dominant_GF
.
Output File
In addition to the information displayed on standard output,
Aphid generates a .csv
file including one line per analyzed gene tree.
This file provides gene tree-specific descriptors
(name, topology, branch lengths, locus length)
and Aphid annotations
— specifically, the posterior probabilities
of each scenario for each gene tree.
Example Input Files
Gene-Tree File
exemple.in
((out1:0.07,out2:0.07):0.03,(oth1:0.06,((A:0.02,B:0.02):0.01,C:0.03):0.03):0.04); 1000 canonical
((out1:0.068,out2:0.066):0.028,(oth1:0.059,((A:0.042,C:0.044):0.002,B:0.045):0.017):0.036); 500 discordant_tall
((out1:0.068,out2:0.066):0.028,(oth1:0.059,((B:0.042,C:0.044):0.002,A:0.045):0.017):0.036); 500 other_discordant_tall
((out1:0.064,out2:0.075):0.027,(oth1:0.058,((B:0.01,C:0.008):0.01,A:0.02):0.027):0.036); 500 discordant_short
((out1:0.064,out2:0.075):0.027,(oth1:0.058,((B:0.019,C:0.022):0.01,A:0.028):0.037):0.036); 555 ancient_GF
((out1:0.07,out2:0.07):0.03,(oth1:0.06,((A:0.035,B:0.037):0.003,C:0.041):0.02):0.04); 1212 canonical_ILS
((out1:0.04,out2:0.03):0.02,(oth1:0.04,((A:0.035,B:0.038):0.026,C:0.062):0.052):0.05); 666 non_clock_like
((out1:0.07,out2:0.07):0.03,(oth1:0.06,((A:0.008,B:0.009):0.02,C:0.03):0.03):0.04); 421 canonical_GF
((out1:0.07,out2:0.07):0.03,((A:0.02,B:0.02):0.01,C:0.03):0.07); 1500 missing_other=OK
((out1:0.07,out2:0.07):0.03,(oth1:0.06,(A:0.03,C:0.03):0.03):0.04); 1001 incomplete_triplet=notOK
(out2:0.1,(oth1:0.06,((A:0.02,B:0.02):0.01,C:0.03):0.03):0.04); 1234 missing_1outgroup=OK
(out1:0.1,((A:0.02,B:0.02):0.01,C:0.03):0.07); 100 minimal_sampling
(out2:0.15,((A:0.03,B:0.02):0.015,C:0.025):0.06); 1200 other_minimal_sampling
(oth1:0.06,((A:0.02,B:0.02):0.01,C:0.03):0.03); 600 missing_all_outgroup=notOK
((out1:0.07,out2:0.07):0.03,((oth1:0.04,additional1:0.04):0.02,((A:0.02,B:0.02):0.01,C:0.03):0.03):0.04); 321 additional_species=OK
((additional2:0.04,additional3:0.04):0.06,((out1:0.07,out2:0.07):0.03,((oth1:0.04,additional1:0.04):0.02,((A:0.02,B:0.02):0.01,C:0.03):0.03):0.04):0.03); 1111 more_additional_species=stillOK
((additional2:0.04,additional3:0.04):0.06,((out1:0.07,(out2:0.06,additional4:0.06):0.01):0.03,((oth1:0.04,additional1:0.04):0.02,((A:0.02,B:0.02):0.01,(C:0.02,additional5:0.02):0.01):0.03):0.04):0.03); 200 even_more_additional_species=stillOK
(additional1:0.1,(oth1:0.06,((A:0.02,B:0.02):0.01,C:0.03):0.03):0.04); 2000 missing_all_outgroup=notOK_even_if_additional
(out1:0.2,(out2:0.14,(oth1:0.12,((A:0.04,B:0.04):0.02,C:0.06):0.06):0.08):0.06); 500 paraphyletic_outgroup=OK_or_not_depending_options
(other1:0.1,((out1:0.07,out2:0.07):0.03,((A:0.02,B:0.02):0.01,C:0.03):0.04):0.01); 500 paraphyletic_triplet+other=OK_or_not_depending_options
Taxon file
exemple.tax
triplet: A, B, C
outgroup: out1, out2
other: oth1
Control File
exemple.opt
# Gene tree filtering.
require_triplet_other_monophyly = 1
require_outgroup_monophyly = 1
max_alphai = 10
max_clock_ratio = 2
# Likelihood calculation.
nb_HGT_times = 2
HGT_time_coeff = 1.0, 0.5
min_d = 0.5
# fixed_pac = 0
# (uncomment the above line to calculate the likelihood
# assuming no GF between A and C)
# Output.
CI_param = 0
CI_ILS_GF = 0
verbose = 1