README.md



OrthoFinder — Accurate inference of orthogroups, orthologues, gene trees and rooted species tree made easy!


What does OrthoFinder do?
OrthoFinder is a fast, accurate and comprehensive analysis tool for comparative genomics. It finds orthologues, orthogroups, infers gene trees for all orthogroups and infers a species tree for the species being analysed. OrthoFinder also identifies the root of the species tree and provides lots of useful statistics for comparative genomic analyses. OrthoFinder is very simple to use and all you need to run it is a set of protein sequence files (one per species) in FASTA format .
For more details see the OrthoFinder paper below.
Emms, D.M. and Kelly, S. (2015) OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy, Genome Biology 16:157
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0721-2
http://www.stevekellylab.com/software/orthofinder
https://github.com/davidemms/OrthoFinder

What's New
Sep. 2016: OrthoFinder now infers the gene trees for the orthogroups, the rooted species tree, all orthologues between all species and calculates summary statistics.
Jul. 2016: OrthoFinder now outputs summary statistics for the orthogroups produced. Statistics are in the files Statistics_Overall.csv, Statistics_PerSpecies.csv and Orthogroups_SpeciesOverlaps.csv.
Jul. 2016: Provided standalone binaries for those without access to python (download the package from OrthoFinder's GitHub releases tab).
Jun. 2016: Parallelised the remainder of the OrthoFinder algorithm.
Jan. 2016: Added the ability to add and remove species.
Sept. 2015: Added the trees_from_MSA.py utility to automatically calculate multiple sequence alignments and gene trees for the orthogroups calcualted using OrthoFinder.

Usage
OrthoFinder runs as a single command that takes as input a directory of FASTA files of proteomes (amino acid sequences), one per species, and outputs a file containing the orthogroups of genes from these species, a gene tree for each orthogroups, the rooted species tree and all orthologues between all the species:
python orthofinder.py -f fasta_directory -t number_of_processes
For example, if you want to run it using 16 processors in parallel on the example dataset move to the directory containing orthofinder.py and call:
python orthofinder.py -f ExampleDataset -t 16
Once complete your results will be in ExampleDataset/Results_<date>/
See below for details on:

adding extra species to a previous analysis
quickly running OrthoFinder on a subset of species from a previous analysis
running OrthoFinder from pre-computed BLAST search results
preparing files in the format required by OrthoFinder so you can run the BLAST searches yourself

###Standalone Binaries
If you do not have access to a python 2.X version you can use the standalone binaries in the bin directory instead e.g.:
bin/orthofinder -f ExampleDataset -t 16

Output File Format
###Orthogroups
An orthogroup is the set of genes that are descended from a single gene in the last common ancestor of the species being analysed. Orthogroups are like gene families, but are constructed via the application of robust phylogenetic criteria.
OrthoFinder generates three output files for orthogroups:
1) Orthogroups.csv is a tab separated text file. Each row comprises a single orthogroup and contains all the genes that belong to that orthogroup. The genes are organized into separate columns where each column corresponds to a single species.
2) Orthogroups.txt is a tab separated text file that is identical in format to the output file from OrthoMCL. This enables OrthoFinder to easily slot into existing bioinformatic pipelines.
3) Orthogroups_UnassignedGenes.csv is a tab separated text file that is identical in format to Orthogroups.csv but contains all of the genes that were not assigned to any orthogroup.
4) Statistics_Overall.csv is a tab separated text file giving statistics for the orthogroups.
5) Statistics_PerSpecies.csv is a tab separated text file giving statistics for the orthogroups on a species-by-species basis.
6) Orthogroups_SpeciesOverlaps.csv is a tab separated text file containing a matrix of the number of orthogroups shared by each species-pair (i.e. the number of orthogroups which contain at least one gene from each of the species-pairs)
###Statistics Files
Most of the terms in the files Statistics_Overall.csv and Statistics_PerSpecies.csv are self-explanatory, the remainder are defined below:

Species-specific orthogroup: An orthogroups that consist entirely of genes from one species.
G50: The number of genes in the orthogroup such that 50% of genes are in orthogroups of that size or larger.
O50: The smallest number of orthogroups such that 50% of genes are in orthogroups of that size or larger.
Single-copy orthogroup: An orthogroup with exactly one gene (and no more) from each species. These orthogroups are ideal for inferring a species tree. Note that trees for all orthogroups can be generated using the trees_from_MSA.py script.
Unassigned gene: A gene that has not been put into an orthogroup with any other genes.

###Orthologues, Gene Trees & Rooted Species Tree
The orthologues, gene trees and rooted species tree are in a sub-directory called Orthologues_<date>

Installing Dependencies