From 91ac08c5745495796539a9e1a84e755ccf8c05d7 Mon Sep 17 00:00:00 2001 From: Iago Bonnici <iago.bonnici@umontpellier.fr> Date: Mon, 26 Jun 2023 16:55:50 +0200 Subject: [PATCH] Import original documentation in README. --- README.md | 318 +++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 316 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index ba363b1..a0b0e8b 100644 --- a/README.md +++ b/README.md @@ -6,6 +6,8 @@ following its specifications within Nicolas Galtier's draft paper distinguishing gene flow from incomplete lineage sorting* (2023), and the original C code. +Sources available at: https://gitlab.mbb.cnrs.fr/ibonnici/aphid/. + ## Build instructions ### With Apptainer/Singularity @@ -31,6 +33,8 @@ Then run the container as regular user: ./aphid.sif <my_input_files> ``` +[Apptainer]: https://apptainer.org/ + ### With Docker Aphid can be run from a docker container, @@ -47,6 +51,7 @@ Then run the container: docker run --rm -it -v ${PWD}:/home/aphid aphid <my_input_files> ``` +[Docker]: https://www.docker.com/ ### Manual compilation @@ -67,7 +72,316 @@ Then run the binary obtained: ./aphid <my_input_files> ``` -[Apptainer]: https://apptainer.org/ -[Docker]: https://www.docker.com/ [git]: https://git-scm.com/ [gcc]: https://gcc.gnu.org/ + +## Running Aphid + +## Input + +Aphid reads data from three distinct files (see examples below): +- A __gene-tree__ file. +- A __taxon__ file. +- A __control__ file. + +### The Gene-Tree File + +The gene-tree file includes one line per gene tree. +Each line contains three tab-separated entries: +- Gene tree. +- Locus length. +- Locus name. + +The *gene tree* must be in Newick format, rooted, +with branch lengths expressed in per site number of mutation. +The *locus length* corresponds to the number of sites of the alignment +from which the gene tree was reconstructed, +such that the product of branch length by locus length +is the total number of mutations having occurred in the considered branch +at the considered locus. +The *locus name* is an unconstrained string supposed to identify the tree. + +### The Taxon File + +The taxon file includes: +- A line defining the focal `triplet` (mandatory). +- A line defining the `outgroup` (mandatory). +- An optional line defining `other` relevant species + according to the following format: + +``` +triplet: species_A, species_B, species_C +outgroup: <comma-separated species list> +other: <comma-separated species list> +``` + +The `triplet` line defines the three focal species of the analysis. +Order matters: the third species of the list is the earliest branching one, +*i.e.* it corresponds to species `C` of the assumed `((A,B),C)` species tree. +The `outgroup` and `other` lines +define the species used to estimate locus-specific mutation rates. +Gene trees not including the three species of the `triplet` +and at least one `outgroup` species are not analyzed. +Species not listed in the taxon file, if any, are not considered. + +The `other` species should branch somewhere +beween the `outgroup` and the focal `triplet` in the species tree. +Their inclusion is intended to increase the accuracy +of relative mutation rate estimation, +especially when the `outgroup` includes few species. +Consider for instance the following species tree: + +``` +(out,(other1,(other2,(other3,(other4,(C,(B,A))))))); +``` + +If we only specify the triplet `A, B, C` +and the outgroup `out` in the taxon file, +the relative mutation rate of a locus will be estimated +based on these four species. +If, in the taxon file we add the following line: +``` +other = other1, other2, other3, other4 +``` + +Then the relative mutation rate of a locus will be estimated +based on up to eight species, +depending on how many of the `other` species +are present in the considered gene tree. + +### The Control File + +Options are passed *via* the control file +using the `<option_name> = <option_value>` syntax (see example below): + +- `require_triplet_other_monophyly`: + If set to `1` (default), + only gene trees in which the `triplet` species and `other` species (if any) + form a monophyletic group will be analyzed. + +- `require_outgroup_monophyly`: + If set to `1` (default), + only gene trees in which the `outgroup` species form a monophyletic group + will be analyzed. + +- `max_alphai`: + Gene trees with an estimated relative mutation rate higher than this value, + or lower than 1 / this value, will not be analyzed (default = `10`). + +- `max_clock_ratio`: + Gene trees in which the ratio of `triplet` estimated mutation rate to + `outgroup` + `other` estimated mutation rate exceeds this value, + or is below 1 / this value, will not be analyzed (default = `2`). + +- `nb_GF_times`: + Number of assumed gene flow times (default = `2`). + +- `GF_time_coeff`: + Comma-separated list of `nb_GF_times` coefficients + specifying the gene flow times, + expressed as fractions of the `A|B` speciation time. + Each time must be between `0` and `1`. + (default = `1.0, 0.5`). + +- `min_d`: + If internal branch length times locus length is below this value, + then the gene tree will be considered topologically unresolved + (default = `0.5`). + +- `CI_param`: + If set to `1`, + confidence intervals around parameter estimates will be calculated + (default = `0`). + +- `CI_ILS_GF`: + If set to `1`, + confidence intervals around the estimated contribution of ILS and GF + to the conflict will be calculated (default = `0`). + +- `verbose`: + If set to `1` (default), + results are output in a human-readable way. + If set to `0` results are output as a single line (see below). + +- `fixed_tau1`: + If set to a positive value, + parameter τ_1 will not be optimized but rather fixed to that value + (default = `-1`). + +- `fixed_tau`, `fixed_theta`, `fixed_pab`, `fixed_pac`, `fixed_pbc`, `fixed_panc`: + Same as above for parameters τ_2, θ, p_ab, p_ac, p_bc, and p_a. + If `nb_GF_times` is set to `1`, + then `fixed_panc` is automatically set to `1.0`. + +Lines starting with `#` are ignored +and can be used for inserting comments in the control file. + +## Output + +### Standard output + +Parameter estimates, +ILS/GF estimated contributions, +discordant topology imbalance, +and various descriptors of the data and analysis +are written to the standard output (terminal). +If `verbose=1`, the output is detailed and human-readable. +If `verbose=0`, the output is less detailed and written as a single line, +in the following format: + +``` +nb_gene, \ +ntopo0, \ +ntopo1, \ +ntopo2, \ +ntopo3, \ +av_lg, \ +tau1, \ +tau2, \ +theta, \ +pab, \ +pac, \ +pbc, \ +pa, \ +no_event, \ +noconflict_ILS, \ +noconflict_GF, \ +conflict_ILS, \ +conflict_GF, \ +imbalance_ILS, \ +dominant_ILS, \ +imbalance_GF, \ +dominant_GF, \ +max_lnL +``` + +With: + +- `nb_gene`: + Number of analyzed gene trees after filtering + (detailed output gives more information on filtering steps). + +- `ntopo0 ntopo1, ntopo2, ntopo3`: + Number of analyzed gene trees with + an `((A,B),C)`, `((A,C),B)`, `((B,C),A)` and `(A,B,C)=unresolved` topology, + respectively. + +- `av_lg`: + Mean locus length of analyzed gene trees. + +- `tau1, tau2, theta, pab, pac, pbc, pa`: + Parameter estimates. + +- `no_event`: + Estimated contribution of the no_event scenario. + +- `no_conflict_ILS`: + Estimated contribution of the ILS scenario with an `((A,B),C)` topology. + +- `no_conflict_GF`: + Estimated contribution of GF scenarios with an `((A,B),C)` topology. + +- `conflict_ILS`: + Estimated contribution of ILS scenarios + with a topology different from `((A,B),C)`. + +- `conflict_GF`: + Estimated contribution of GF scenarios + with a topology different from `((A,B),C)`. + +- `imbalance_ILS`: + Estimated discordant topology imbalance associated with ILS. + +- `dominant_ILS`: + The most prevalent discordant topology associated with ILS. + +- `imbalance_GF`: + The estimated discordant topology imbalance associated with GF. + +- `dominant_ILS`: + The most prevalent discordant topology associated with GF. + +- `max_lnL`: + The maximum likelihood. + +The `no_event`, `no_conflict_ILS` and `no_conflict_GF` posterior estimates +are provided. +But it is unclear how reliable these estimates generally are. +Their sum correspond to the overall estimated prevalence +of the canonical `((A,B),C)` topology among gene trees. +The discordant topology imbalance is a number +between 0.5 (no imbalance) and 1 (maximal imbalance) +indicating whether one of the two discordant topologies +`((A,C),B)` and `((B,C),A)` dominates over the other one. +Which one dominates is specified by `dominant_ILS` and `dominant_GF`. + +### Output File + +In addition to the information displayed on standard output, +Aphid generates a `.csv` file including one line per analyzed gene tree. +This file provides gene tree-specific descriptors +(name, topology, branch lengths, locus length) +and Aphid annotations +— specifically, the posterior probabilities +of each scenario for each gene tree. + +## Example Input Files + +### Gene-Tree File + +`exemple.in` +``` +((out1:0.07,out2:0.07):0.03,(oth1:0.06,((A:0.02,B:0.02):0.01,C:0.03):0.03):0.04); 1000 canonical +((out1:0.068,out2:0.066):0.028,(oth1:0.059,((A:0.042,C:0.044):0.002,B:0.045):0.017):0.036); 500 discordant_tall +((out1:0.068,out2:0.066):0.028,(oth1:0.059,((B:0.042,C:0.044):0.002,A:0.045):0.017):0.036); 500 other_discordant_tall +((out1:0.064,out2:0.075):0.027,(oth1:0.058,((B:0.01,C:0.008):0.01,A:0.02):0.027):0.036); 500 discordant_short +((out1:0.064,out2:0.075):0.027,(oth1:0.058,((B:0.019,C:0.022):0.01,A:0.028):0.037):0.036); 555 ancient_GF +((out1:0.07,out2:0.07):0.03,(oth1:0.06,((A:0.035,B:0.037):0.003,C:0.041):0.02):0.04); 1212 canonical_ILS +((out1:0.04,out2:0.03):0.02,(oth1:0.04,((A:0.035,B:0.038):0.026,C:0.062):0.052):0.05); 666 non_clock_like +((out1:0.07,out2:0.07):0.03,(oth1:0.06,((A:0.008,B:0.009):0.02,C:0.03):0.03):0.04); 421 canonical_GF +((out1:0.07,out2:0.07):0.03,((A:0.02,B:0.02):0.01,C:0.03):0.07); 1500 missing_other=OK +((out1:0.07,out2:0.07):0.03,(oth1:0.06,(A:0.03,C:0.03):0.03):0.04); 1001 incomplete_triplet=notOK +(out2:0.1,(oth1:0.06,((A:0.02,B:0.02):0.01,C:0.03):0.03):0.04); 1234 missing_1outgroup=OK +(out1:0.1,((A:0.02,B:0.02):0.01,C:0.03):0.07); 100 minimal_sampling +(out2:0.15,((A:0.03,B:0.02):0.015,C:0.025):0.06); 1200 other_minimal_sampling +(oth1:0.06,((A:0.02,B:0.02):0.01,C:0.03):0.03); 600 missing_all_outgroup=notOK +((out1:0.07,out2:0.07):0.03,((oth1:0.04,additional1:0.04):0.02,((A:0.02,B:0.02):0.01,C:0.03):0.03):0.04); 321 additional_species=OK +((additional2:0.04,additional3:0.04):0.06,((out1:0.07,out2:0.07):0.03,((oth1:0.04,additional1:0.04):0.02,((A:0.02,B:0.02):0.01,C:0.03):0.03):0.04):0.03); 1111 more_additional_species=stillOK +((additional2:0.04,additional3:0.04):0.06,((out1:0.07,(out2:0.06,additional4:0.06):0.01):0.03,((oth1:0.04,additional1:0.04):0.02,((A:0.02,B:0.02):0.01,(C:0.02,additional5:0.02):0.01):0.03):0.04):0.03); 200 even_more_additional_species=stillOK +(additional1:0.1,(oth1:0.06,((A:0.02,B:0.02):0.01,C:0.03):0.03):0.04); 2000 missing_all_outgroup=notOK_even_if_additional +(out1:0.2,(out2:0.14,(oth1:0.12,((A:0.04,B:0.04):0.02,C:0.06):0.06):0.08):0.06); 500 paraphyletic_outgroup=OK_or_not_depending_options +(other1:0.1,((out1:0.07,out2:0.07):0.03,((A:0.02,B:0.02):0.01,C:0.03):0.04):0.01); 500 paraphyletic_triplet+other=OK_or_not_depending_options +``` + +### Taxon file + +`exemple.tax` +``` +triplet: A, B, C +outgroup: out1, out2 +other: oth1 +``` + +### Control File + +`exemple.opt` +``` +# Gene tree filtering. +require_triplet_other_monophyly = 1 +require_outgroup_monophyly = 1 +max_alphai = 10 +max_clock_ratio = 2 + +# Likelihood calculation. +nb_HGT_times = 2 +HGT_time_coeff = 1.0, 0.5 +min_d = 0.5 +# fixed_pac = 0 +# (uncomment the above line to calculate the likelihood +# assuming no GF between A and C) + +# Output. +CI_param = 0 +CI_ILS_GF = 0 +verbose = 1 +``` -- GitLab