Skip to content
Snippets Groups Projects

STACKS2 using SNAKEMAKE Workflow

RADseq workflow using STACKS2 This was designed to process RADseq data from RESERVEBENEFIT project.

Table of contents

  1. Introduction
  2. Installation
    1. Prerequisite
    2. Data Files
    3. Set up
  3. Reporting bugs
  4. Running the pipeline
    1. Initialisation
    2. Configuration
    3. Run the pipeline into a single command
    4. Run the pipeline step by step

1. Introduction

blablabla

2. Installation

2.1 Prerequisite

You must install the following softwares and packages :

  • SNAKEMAKE 5.3.0

    • Check version and if the program is correctly installed by typing :
    snakemake --version
    ## should give you the output
    5.3.0
  • STACKS 2.2

    • Check version and if programs are correctly installed by typing :
    process_radtags --version
    clone_filter --version
    gstacks --version
    populations --version
    ## should give you the output
    2.2
  • BWA 0.7.17

    tar -xvf bwa-x.x.x.tar.bz2   
    cd bwa-x.x.x
    ./configure --prefix=/where/to/install
    make  
    make install
    • Check version and if programs are correctly installed by typing :
    bwa
    ## should give you the output
    Program: bwa (alignment via Burrows-Wheeler transformation)
    Version: 0.7.17-r1188
    ...
  • SAMTOOLS 1.9

    cd htslib-1.x
    ./configure --prefix=/where/to/install
    make
    make install
    cd ..
    ## and similarly for samtools :
    cd samtools-1.x
    ./configure --prefix=/where/to/install
    make
    make install
    • Check version and if programs are correctly installed by typing :
    samtools --version
    ## should give you the output
    samtools 1.9
    Using htslib 1.9
    Copyright (C) 2018 Genome Research Ltd.

2.2 Data Files

The included data files are : let's define some wildcards *

  • {run} : any runs
  • {pool} : any pools into a run
  • {species} : any species
  • config.yaml : defines a dictionary of configuration parameters and their values used on each step commands of the pipeline.
  • barcodes.txt : file containing barcodes used for {pool} into {run}
  • {species}_infos.csv : information .csv table related to {species} each row is a sample and they are 4 columns which are run,pool,barcode,ID
  • {species}_populations_map.txt : information table .tsv related to {species}. Each row is a sample and they are 2 columns which are ID,population. This file can be generated by the pipeline (see Configuration section). However we strongly recommand you to do it manually.

2.3 Set Up

clone the project and switch to the main folder, it's your working directory

git clone http://gitlab.mbb.univ-montp2.fr/reservebenefit/snakemake_stacks2.git
cd snakemake_stacks2

You will see the following folders :

  • 00-scripts: contains all the required scripts to run the whole pipeline
  • 01-info_files : contains all the required data files (see Data Files section below)
  • 02-raw : must contain your data from paired-end illumina sequencing runs. The data must be stored this way :
    02-raw/
        runA/
            poolA1/
                {poolA1}_R1_001.fastq.gz
                {poolA1}_R2_001.fastq.gz
            poolA2/
                {poolA2}_R1_001.fastq.gz
                {poolA2}_R2_001.fastq.gz
            ...
        runB/
            poolB1/
                {poolB1}_R1_001.fastq.gz
                {poolB1}_R2_001.fastq.gz
            ...
        ...        
  • 03-samples: will store the results generated by demultiplexing with process_radtags and clone filtering clone_filter. The data must be stored this way :
     03-samples/
         runA/
             poolA1/
                 sample_{barcode1}.1.fq.gz
                 sample_{barcode1}.2.fq.gz
                 sample_{barcode2}.1.fq.gz
                 sample_{barcode2}.2.fq.gz
                 sample_{barcode3}.1.fq.gz
                 sample_{barcode3}.2.fq.gz
                 ...
             poolA1_clone_filtered/
                 sample_{barcode1}.1.1.fq.gz
                 sample_{barcode1}.2.2.fq.gz
                 sample_{barcode2}.1.1.fq.gz
                 sample_{barcode2}.2.2.fq.gz
                 sample_{barcode3}.1.1.fq.gz
                 sample_{barcode3}.2.2.fq.gz
                 ...
             poolA2/
                 sample_{barcode1}.1.fq.gz
                 sample_{barcode1}.2.fq.gz
                 ...
             poolA2_clone_filtered/
                 sample_{barcode1}.1.1.fq.gz
                 sample_{barcode1}.2.2.fq.gz
                 ...
             ...
         runB/
             poolB1/
                 sample_{barcode1}.1.fq.gz
                 sample_{barcode1}.2.fq.gz
                 ...
             poolB1_clone_filtered/
                 sample_{barcode1}.1.1.fq.gz
                 sample_{barcode1}.2.2.fq.gz
                 ...
             ...
         ...        
  • 04-all_samples: paired end fastq.gz files are named according to {species}_infos.csv information. Then reads are aligned onto reference genome sequences stored into 08-genomes. This folder contains "named" fatsq files and corresponding alignments .bam files. .sorted.bam are SORTED alignment files and .sorted.bam.bai are corresponding index. The data must be stored this way :
    04-all_samples/
        speciesA/
           {sampleA1}.1.fq.gz
           {sampleA1}.2.fq.gz
           {sampleA1}.bam
           {sampleA1}.sorted.bam
           {sampleA1}.sorted.bam.bai
           {sampleA2}.1.fq.gz
           {sampleA2}.2.fq.gz
           {sampleA2}.bam
           {sampleA2}.sorted.bam
           {sampleA2}.sorted.bam.bai
           ...
        speciesB/
           {sampleB1}.1.fq.gz
           {sampleB1}.2.fq.gz
           {sampleB1}.bam
           {sampleB1}.sorted.bam
           {sampleB1}.sorted.bam.bai
           ...
        ...        
  • 05-stacks : outputs from gstacks
  • 06-populations : outputs from populations
  • 08-genomes : reference genome of each any species {species} used for the analysis. .fasta file is mandatory and stores all the scaffolds sequences of {species} genome assembly. .amb, .ann, .bwt, .pac, .sa are index files required by BWA 0.7.17. They will be automatically generated if absent. The data must be stored this way :
    08-genomes/
          {species}_genome.amb
          {species}_genome.ann
          {species}_genome.bwt
          {species}_genome.fasta
          {species}_genome.pac
          {species}_genome.sa       
  • 10-logs : log files generated by every command
    • process_radtags
    • clone_filter
    • genome_alignment
    • gstacks
    • populations

3. Reporting bugs

If you're sure you've found a bug — e.g. if one of my programs crashes with an obscur error message, or if the resulting file is missing part of the original data, then by all means submit a bug report.

I use GitLab's issue system as my bug database. You can submit your bug reports there. Please be as verbose as possible — e.g. include the command line, etc

4. Running the pipeline

4.1 Initialisation

  • open a shell
  • make a folder, name it yourself, I named it workdir
mkdir workdir
cd workdir
  • clone the project and switch to the main folder, it's your working directory
git clone http://gitlab.mbb.univ-montp2.fr/reservebenefit/snakemake_stacks2.git
cd snakemake_stacks2

4.2 Configuration

WORK IN PROGRESS !!!!

4.3 Run the pipeline into a single command

Once you finished Initialisation and Configuration steps. You can run the whole pipeline simply typing :

## number of CPU cores available for running the pipeline (for instance here 64 cores)
N_CORES=64
## run the pipeline into a single command
bash main.sh $N_CORES

4.4 Run the pipeline step by step

WORK IN PROGRESS !!!!

that's it ! The pipeline is running and crunching your data. Look for the log folder output folder after the pipeline is finished.