Skip to content
Snippets Groups Projects
Bastien Mace's avatar
Bastien Macé authored
2dbc889e
History

swarm_and_obitools

Bastien Macé, 2020


Table of contents


Introduction

This project is based on the idea that gathering similar sequences allows to faithfully study them by eliminating sequences generated from PCR or NGS errors.

For that, we will use the OBITools commands and swarm.

  • OBITools are commands written in python
  • swarm is a command written in C++ and which can be used with a Unix shell

In this example, two datasets are used because the study analyzes the result of a pair-end sequencing (Example of filtrated eDNA from aquarium seawater).

Installation

Preliminary steps for OBITools

  • First you need to have Anaconda installed

If it's not the case, click on this link and download it.

Install the download in your shell :

bash Anaconda3-2020.07-Linux-x86_64.sh

Then, close your shell and reopen it. Verify conda is correctly installed. It should be here :

~/anaconda3/bin/conda

Write the following line :

conda config --set auto_activate_base false
  • Create your new environment obitools from your root in your corresponding path. For example :
ENVYAML=./dada2_and_obitools/obitools_env_conda.yaml
conda env create -f $ENVYAML

Now you can activate your environment :

conda activate obitools

And deactivate it :

conda deactivate

Preliminary steps for swarm

  • Get the compressed packaged on the creator GitHub in your downloads folder and install it :
git clone https://github.com/torognes/swarm.git
cd swarm/
make
  • Copy the binary to make the command accessible for all users :
cp -r ./bin/swarm /usr/local/bin

STEP 1 : Pair-ended merging

First, unzip your data in your shell if you need :

unzip mullus_surmuletus_data.zip

Activate your environment in your shell :

conda activate obitools

Use the command illuminapairedend to make the pair-ended merging from the forward and reverse strands of the sequences you have in your data. The command aligns the complementary strands in order to get a longer sequence. In fact, after PCR, the last bases are rarely correctly sequenced. So having the forward and the reverse strands allows to lenghten the sequence, thanks to the beginning of the reverse strand, which is usually correctly sequenced.

illuminapairedend --score-min=40 -r mullus_surmuletus_data/Aquarium_2_R1.fastq mullus_surmuletus_data/Aquarium_2_R2.fastq > Aquarium_2.fastq
# a new .fastq file is created, it contains the sequences after the merging of forward and reverse strands
# alignments which have a quality score higher than 40 (-- score-min=40) are merged and annotated "aligned", while alignemnts with a lower quality score are concatenated and annotated "joined"

To only conserve the sequences which have been merged, use obigrep :

obigrep -p 'mode!="joined"' Aquarium_2.fastq > Aquarium_2.ali.fastq
# -p requires a python expression
# python creates a new dataset (.ali.fastq) which only contains the sequences annotated "aligned"

STEP 2 : Demultiplexing

A .txt file assigns each sequence to its sample thanks to its tag, because each tag corresponds to a reverse or a forward sequence from a sample.

To compare the sequences next, you need to remove the tags and the primers, by using the ngsfilter command :

ngsfilter -t mullus_surmuletus_data/Med_corr_tags.txt -u Aquarium_2.unidentified.fastq Aquarium_2.ali.fastq > Aquarium_2.ali.assigned.fastq
# the command creates new files :
# ".unidentified.fastq" file contains the sequences that were not assigned whith a correct tag
# ".ali.assigned.fastq" file contains the sequences that were assigned with a correct tag, so they contain only the barcode sequences

Then, separate your .ali.assigned.fastq files depending on their samples in placing them in a dedicated folder (useful for next steps) :

mkdir samples
# create the folder
mv -t samples Aquarium_2.ali.assigned.fastq
# place the latests ".fastq" files in the folder
cd samples
obisplit -t samples --fastq sample/Aquarium_2.ali.assigned.fastq
# separate the files depending on their samples
mv -t ./swarm_and_obitools Aquarium_2.ali.assigned.fastq
# remove the original files from the folder

Now you have as many files as samples, containing merged pair-ended and demultiplexed sequences.

STEP 3 : Dereplication

Now that you have the sequences corresponding to the barcode you want to study, dereplicate them to only conserve the amplicons with their abundance stored in the header with obiuniq :

obiuniq Aquarium_2.fastq > Aquarium_2.uniq.fasta

This command also transforms fastq files into fasta format.

STEP 4 : Filtering

The obigrep command filters the sequences according to different criteria which you can chose, such as the sequence length, or the abundance of the amplicons :

obigrep -l 20 -p 'count>=10' Aquarium_2.uniq.fasta > Aquarium_2.grep.fasta
# "-l 20" option eliminates sequences with a length shorter than 20 bp
# "-p 'count>=10'" option eliminates sequences with an abundance inferior to 10

STEP 5 : Elimination of PCR errors

obiclean is a command which eliminates punctual errors caused during PCR. The algorithm makes parwise alignments for all the amplicons. It counts the number of dissimilarities between the amplicons, and calculates the ratio between the abundance of the two amplicons aligned. If there is only one dissimilarity (parameter by default, can be modified) and if the ratio is lower than a chosen threshold, the less abundant amplicon is considered as a variant of the most abundant one.

Sequences which are at the origin of variants without being considered as one are tagged "head". The variants are tagged "internal". The other sequences are tagged "singleton".

obiclean -r 0.05 -H Aquarium_2.grep.fasta > Aquarium_2.clean.fasta
# here, the command returns only the sequences tagged "head" by the algorithm, and the chosen ratio is 0.05

STEP 6 : Taxonomic assignment

ecotag is a command which permits to assign each head amplicon to its corresponding taxon. It requires to first having used ecoPCR with your primers used to amplify your sequences. This command have given a file containing the taxons which can potentially be amplified by the selected primers. ecotag permits to assign your sequences to one of these taxons, with a minimum similarity score fixed at a chosen value, and so to be sure that your sequences come from the correct taxon. However, this step is optionnal is your primers are specific enough.

ecotag -m 0.5 -d ./only_obitools/embl_std -R ./only_obitools/base_ref_finale_formated.fasta Aquarium_2.clean.fasta > Aquarium_2.tag.fasta
# only the sequences with a similarity score higher than 0.95 are annotated to their corresponding taxon

Then, after a selection of the amplicons corresponding to your studied taxon, you can eliminate the non-interesting attributes. Here, we only conserve the amplicons abundance :

obiannotate -k count Aquarium_2.tag.fasta > Aquarium_2.tag_1.fasta
# only the attribute "count" is conserved

STEP 7 : Gathering in OTUs (swarm)

swarm gathers the sequences in OTU thanks to this algorithm :

  • First, sequences are pairwise aligned to count the number of dissimilarities between them
  • A threshold d is chosen, when the number of dissimilarities is inferior or equal to d, both sequences are gathered in a same OTU
  • This process is repeated to add iteratively the sequences to an OTU
  • The most abundant sequence of each OTU is chosen to represent the OTU
  • The abundance of the OTU is constituted by adding the abundances of each sequence included in the OTU
swarm -z -d 1 -o stats_Aquarium_2.txt -w Aquarium_2.end.fasta < Aquarium_2.tag_1.fasta
# "-z" option permits to accept the abundance in the header, provided that there is no space in the header and that the value is preceded by "size="
# "-d" is the maximal number of differences tolerated between 2 sequences to be gathered in the same OTU
# "-o" option returns a ".txt" file in which each line corresponds to an OTU with all the amplicons belonging to this OTU
# "-w" option gives a "fasta" file with the representative sequence of each OTU

An option called fastidious can be added, with -f, in order to integrate small OTUs in larger related OTUs. We don't use it here because it doesn't change the output at all for this example.

STEP 8 : Analyse your results

Then, you can filtrate your OTUs and the amplicons present in it.

For example, Elbrecht et al. recommend in their publication to eliminate :

  • OTUs with an abundance inferior to 0.01%
  • Amplicons with an abundance inferior to 0.003%
  • Amplicons with a relative abundance inferior to 5% in their OTU

Now you can make a statistical analysis to evaluate your filtering quality, after comparing the OTUs returned by the pipeline with your reference dataset.