OrthoFinder — Accurate inference of orthogroups, orthologues, gene trees and rooted species tree made easy!
Figure 1: Automatic OrthoFinder analysis
What does OrthoFinder do?
OrthoFinder is a fast, accurate and comprehensive analysis tool for comparative genomics. It finds orthologues and orthogroups infers gene trees for all orthogroups and infers a rooted species tree for the species being analysed. OrthoFinder also provides comprehensive statistics for comparative genomic analyses. OrthoFinder is simple to use and all you need to run it is a set of protein sequence files (one per species) in FASTA format.
For more details see the OrthoFinder paper below.
Emms, D.M. and Kelly, S. (2015) OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy, Genome Biology 16:157
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0721-2
http://www.stevekellylab.com/software/orthofinder
https://github.com/davidemms/OrthoFinder
What's New
Oct. 2016: Check out the new PDF Manual!
Sep. 2016: OrthoFinder now infers the gene trees for the orthogroups, the rooted species tree, all orthologues between all species and calculates summary statistics.
Jul. 2016: OrthoFinder now outputs summary statistics for the orthogroups produced. Statistics are in the files Statistics_Overall.csv, Statistics_PerSpecies.csv and Orthogroups_SpeciesOverlaps.csv.
Jul. 2016: Provided standalone binaries for those without access to python (download the package from OrthoFinder's GitHub releases tab).
Jun. 2016: Parallelised the remainder of the OrthoFinder algorithm.
Jan. 2016: Added the ability to add and remove species.
Sept. 2015: Added the trees_from_MSA utility to automatically calculate multiple sequence alignments and gene trees for the orthogroups calcualted using OrthoFinder.
Orthogroups, Orthologues & Paralogues
'Orthologue' is a term that applies to genes from two species. Orthologues are pairs of genes that descended from a single gene in the last common ancestor (LCA) of two species (Figure 2A & B). An orthogroup is the natural extension of the concept of orthology to groups of species. An orthogroup is the group of genes descended from a single gene in the LCA of a group of species (Figure 2A). When looking at the gene tree, the first divergence between the genes in an orthogroup is a speciation event and the same is true for orthologues.
As a result of gene duplication events, it is possible to have multiple genes from the same species with both orthologues and orthogroups. In the example (Figure 2A & B), the human gene HuA has two genes that are orthologues of it in chicken, ChA1 and ChA2. Looking again at the orthogroup, we see that there are two chicken genes (Figure 2A) but only one gene from mouse and human. Some authors refer to the genes ChA1 and ChA2 as co-orthologues of HuA to emphasise the fact that there are multiple orthologues. These genes are nevertheless still orthologues and so we will usually just use this broader term. In fact, gene duplication events are so common that in addition to the one-to-many relationship implied by the term 'co-orthologues', there are frequently many-to-many relationships between orthologues. All of these relationships are identified by an OrthoFinder analysis.
Gene duplication events give rise to paralogues. Paralogues are pairs of genes that diverged from a single gene at a gene duplication event. The two chicken genes ChA1 and ChA2 are paralogues (Figure 2A & C). Two genes from different species can also be paralogues if the diverged from one another at a gene duplication event, although there are no examples of this in Figure 2. Since all branching events in a gene tree are either speciation events (that give rise to orthologues) or duplication events (that give rise to paralogues), any genes in the same orthogroup that are not orthologues must necessarily be paralogues.
Figure 2: Orthologues, Orthogroups & Paralogues
Why Orthogroups
If you followed the explanations above it will be clear that an orthogroup is just a gene family/clade of genes defined at a specific taxonomic level—namely, those genes descended from a single gene at the time of the LCA. Some may regard this definition of an orthogroup as unsatisfactory since an orthogroup can contain genes that are paralogues of one another (ChA1 is a paralogue of ChA2 in Figure 2). However, this definition of an orthogroup is the only logically consistent way of extending the concept of orthology to multiple species. If there have been gene duplication events it is not possible to create a group of genes containing all orthologues and only orthologues—try it with the example above!
One can still identify orthologues between the genes in each pair of species though, but the orthogroup is the correct unit of comparison when considering the group of species as a whole. In fact, one use for orthogroups is for identifying orthologues: The canonical way to identify orthologues is using a gene tree, and an orthogroup is exactly the set of genes that need to be in a the gene tree in order to identify all orthologues. This is the method used by OrthoFinder.
Setting Up OrthoFinder
OrthoFinder runs on Linux and Mac, setup instructions are given below.
Set Up
-
Download the latest release from github: https://github.com/davidemms/OrthoFinder/releases (for this example we will assume it is OrthoFinder-1.0.6.tar.gz, change this as appropriate.)
-
In a terminal, 'cd' to where you downloaded the package
-
Extract the files:
tar xzf OrthoFinder-1.0.6.tar.gz
-
Test you can run OrthoFinder:
OrthoFinder-1.0.6/orthofinder -h
. OrthoFinder should print its 'help' text.
To perform an analysis OrthoFinder requires some dependencies to be installed and in the system path (only the first two are needed to infer orthogroups and all four are needed to infer orthologues and gene trees as well):