Snippets Groups Projects

Merge branch 'master' of github.com:lmathon/eDNA--benchmark_pipelines

root authored 5 years ago

3edc2779

3edc2779 5 years ago

Name	Last commit	Last update
00_Input_data
01_merging
02_demultiplex
03_dereplication
04_filtering
05_error
06_chimera
07_assignation
98_infos
99_utils
obitools_reference
optimal_pipeline
.gitattributes
.gitignore
README.md
image_time.PNG
schema_protocole.PNG

benchmark_pipelines

Assessment of metabarcoding data pipelines : The goal of this study is to compare the performance of several bioinformatic tools to analyze fish eDNA data, in order to build the most efficient pipeline. Seven steps of the analysis were identified, from the assembly of paired-end reads to the taxonomic assignation. For each of these steps, different programs to compare were identified, as shown below :

For each step, all the programs are compared, while the start and the end of the pipeline are standardized with a reference pipeline (obitools_reference). This pipeline is based on Obitools, a set of python programs designed to analyse Next Generation Sequencer outputs (illumina) in the context of DNA Metabarcoding.

The optimal pipeline obtained will be again compared to existant complete pipelines (QIIME2 and BARQUE).

Dependencies

BASH
- openssl
- wget
- curl
Singularity

Installation

Install from source code

To install all the programs used in this study, please follow the instructions on their installation pages : ObiTools, VSEARCH, PEAR, FLASH, CASPER, fastq-join, cutadapt, Prinseq, SWARM and SINTAX.

The installation guidelines for the complete pipelines can be found here : QIIME2 and BARQUE.

Singularity containers

Other possibility : all the programs have been installed in a Singularity container, which you can access here :

First you need to install Singularity.

We provide ready to run versions of Singularity containers.

Our complete collection of singularity recipes is available here.

To download ObiTools's container :

singularity pull --name obitools.simg shub://Grelot/bioinfo_singularity_recipes:obitools

To download container with vsearch, PEAR, FLASh, CASPER, cutadapt, fastq-join, Tally, Prinseq, usearch (sintax), SWARM, Flexbar :

singularity pull --name ednatools.simg shub://Grelot/bioinfo_singularity_recipes:ednatools

Data requirement

Input FASTQ files and Reference database

The dataset used for this study, containing forward and reverse reads from 12S mitochondrial gene fragment of fish, has been simulated with Grinder. For the full simulation protocole, please visit https://github.com/lmathon/metabarcoding_data_simulation.

The dataset is stored on MEGA. To dowload, uncrypt and unzip forward_reverse_reads from our MEGA cloud, run :

bash 99_utils/mega_download/download_input_data.sh

The forward_reverse_reads folder will be created at 00_Input_data/forward_reverse_reads.

The reference database needed for the taxonomic assignment step is also stored on MEGA and will be downloaded at the same time as the input FASTQ files. The reference_database folder will be created at 00_Input_data/reference_database :

Sample description file

The sample description file, containing the primers and the tags associated to each sample, is available here.

Abundance data

Thanks to the simulated dataset, we know exactly the relative abundance of each species in each sample and replicate. These data can be found here species_abundance_per_sample and will be compared to the output of each pipeline tested to assess their efficiency.

(Note that the input FASTQ files and the abundance data will change each time you run a grinder simulation. The files given here correspond to the grinder simulation made to obtain the data for our program comparison)

Performance measures

To assess the efficiency of each program, we measure the execution time (among other metrics).

Each time you test a different program for a given analysis step, you can record the time, memory usage, and CPU usage of this command by running time in front of the command :

/usr/bin/time command [-options]

This will give this output in the standard error file :

where :

%elapsed = time in hours:minutes:seconds

Other performance metrics will be calculated for each pipeline tested : sensitivity and F-measure index will be calculated from the number of true positive, false positive and false negative outputted by each pipeline. Relative abundances outputted by each pipeline are compared to the expected abundances (species_abundance_per_sample).

To run any script, run this command line :

bash 99_utils/submitjob_sge_cluster/bash2sge.sh SCRIPT.sh
qsub

Analysis steps

For simplicity, each pipelines are separated in folders corresponding to the steps, in 01_merging, 02_demultiplex, 03_dereplication, 04_quality-filter, 05_error-removal, 06_chimera-removal and 07_taxonomic-assignation.

1 - Merging paired-end reads

The first step consists in assembling forward and reverse reads of the sequences. We tested several assemblers, with no specific parameters.

01_merging contains the scripts to run each of these programs.

2 - Demultiplexing

Once the reads assembled, the primers are removed (max. 2 mismatches allowed by primers). The tags are also removed (no mismatch allowed) and each read is assigned to the sample it comes from.

02_demultiplex contains the scripts to run each programs used at this step.

3 - Dereplicating

The reads are then dereplicated: identical reads are gathered in a unique read and the count is saved.

All the scripts to run the different programs are in 03_dereplication.

4 - Quality filtering

Reads are then checked for their quality : sequences longer than 20bp and with no ambiguous bases are kept.

The scripts to run the different programs are in 04_quality-filter

5 - PCR / Sequencing error removal

Each program or pipeline offers different tools to remove PCR or sequencing errors. For ObiTools, the program obiclean keeps only the head sequences for each PCR replicate.

The scripts for the different programs are in 05_error-removal.

6 - Chimera removal

Several programs are specialized in identifying and removing chimeras.

06_chimera-removal contains the scripts to run these programs.

7 - Taxonomic assignation

The last step of the analysis is to assign every sequence to a taxa. In our case, we use a homemade reference database. To be assigned at the species level, the query sequence must be at least similar at 98% to the reference sequence.

07_taxonomic-assignation contains the scripts to run the different assigning programs.

Outputs

Taxa/sample tables

The outputs of each pipeline tested can be found in the Outputs folder, in the folder corresponding to the step tested, under the name of the program tested. The main folder contains all the intermediate files produced by the pipeline. The final folder contains the taxonomic table. For example, to find the results of the pipeline testing the program flash for merging reads : 01_merging/Outputs/02_flash/final/merging_flash.csv

Time and memory reports

Time reports for each program compared are stored here.

Optimal pipeline

The optimal pipeline built with the most performant program from each step is here