README installation update

bd1ba812 · peguerin · 4021254b · bd1ba812
Commit bd1ba812 authored 6 years ago by peguerin
--- a/README.md
+++ b/README.md
@@ -9,15 +9,15 @@ This was designed to process RADseq data from [RESERVEBENEFIT](https://www.biodi

 1. [Introduction](#1-introduction)
 2. [Installation](#2-installation)
-  1. [Prerequisite](#21-prerequisite)
-  2. [Data Files](#22-data-files)
-  3. [Set up](#23-set-up)
+    1. [Prerequisite](#21-prerequisite)
+    2. [Data Files](#22-data-files)
+    3. [Set up](#23-set-up)
 3. [Reporting bugs](#3-reporting-bugs)
 4. [Running the pipeline](#5-running-the-pipeline)
-  1. [Initialisation](#41-initialisation)
-  2. [Configuration](#42-configuration)
-  3. [Run the pipeline into a single command](#43-run-the-pipeline-into-a-single-command)
-  4. [Run the pipeline step by step](#44-run-the-pipeline-step-by-step)
+    1. [Initialisation](#41-initialisation)
+    2. [Configuration](#42-configuration)
+    3. [Run the pipeline into a single command](#43-run-the-pipeline-into-a-single-command)
+    4. [Run the pipeline step by step](#44-run-the-pipeline-step-by-step)


 # 1. Introduction
@@ -40,7 +40,7 @@ You must install the following softwares and packages :
    5.3.0
    ```

- [STACKS 2.0b](http://catchenlab.life.illinois.edu/stacks/)
+- [STACKS 2.2](http://catchenlab.life.illinois.edu/stacks/)
   * Check version and if programs are correctly installed by typing :

    ```
@@ -49,7 +49,7 @@ You must install the following softwares and packages :
    gstacks --version
    populations --version
    ## should give you the output
-    2.0b
+    2.2
    ```

 - [BWA 0.7.17](https://icb.med.cornell.edu/wiki/index.php/Elementolab/BWA_tutorial)
@@ -99,11 +99,15 @@ You must install the following softwares and packages :

 ## 2.2 Data Files
 The included data files are :
+let's define some wildcards `*`
+- `{run}` : any runs
+- `{pool}` : any pools into a run
+- `{species}` : any species

 * [config.yaml](01-info_files/config.yaml) :
-* [barcodes.txt](01-info_files/barcodes.txt) :
-* [infos.csv](01-info_files) :
-* [populations_map.txt](01-info_files) :
+* [barcodes.txt](01-info_files/barcodes.txt) : file containing barcodes used for {pool} into {run}
+* [{species}_infos.csv](01-info_files) : information `.csv` table related to {species} each row is a sample and they are 4 columns which are run,pool,barcode,ID 
+* [{species}_populations_map.txt](01-info_files) : information table `.tsv` related to {species}. Each row is a sample and they are 2 columns which are ID,population. This file can be generated by the pipeline (see [Configuration](#42-configuration) section). However we strongly recommand you to do it manually.

 ## 2.3 Set Up

@@ -114,7 +118,109 @@ cd snakemake_stacks2
 ```
 You will see the following folders :

-
+* [00-scripts](00-scripts): contains all the required scripts to run the whole pipeline
+* [01-info_files](01-info_files) : contains all the required data files (see [Data Files](#22-data-files) section below)
+* [02-raw](02-raw) : must contain your data from paired-end illumina sequencing runs. The data must be stored this way :
+    ```
+    02-raw/
+        runA/
+            poolA1/
+                {poolA1}_R1_001.fastq.gz
+                {poolA1}_R2_001.fastq.gz
+            poolA2/
+                {poolA2}_R1_001.fastq.gz
+                {poolA2}_R2_001.fastq.gz
+            ...
+        runB/
+            poolB1/
+                {poolB1}_R1_001.fastq.gz
+                {poolB1}_R2_001.fastq.gz
+            ...
+        ...        
+    ```
+* [03-samples](03-samples): will store the results generated by demultiplexing with [process_radtags](http://catchenlab.life.illinois.edu/stacks/comp/process_radtags.php) and clone filtering [clone_filter](http://catchenlab.life.illinois.edu/stacks/comp/clone_filter.php). The data must be stored this way :
+   ```
+    02-raw/
+        runA/
+            poolA1/
+                sample_{barcode1}.1.fq.gz
+                sample_{barcode1}.2.fq.gz
+                sample_{barcode2}.1.fq.gz
+                sample_{barcode2}.2.fq.gz
+                sample_{barcode3}.1.fq.gz
+                sample_{barcode3}.2.fq.gz
+                ...
+            poolA1_clone_filtered/
+                sample_{barcode1}.1.1.fq.gz
+                sample_{barcode1}.2.2.fq.gz
+                sample_{barcode2}.1.1.fq.gz
+                sample_{barcode2}.2.2.fq.gz
+                sample_{barcode3}.1.1.fq.gz
+                sample_{barcode3}.2.2.fq.gz
+                ...
+            poolA2/
+                sample_{barcode1}.1.fq.gz
+                sample_{barcode1}.2.fq.gz
+                ...
+            poolA2_clone_filtered/
+                sample_{barcode1}.1.1.fq.gz
+                sample_{barcode1}.2.2.fq.gz
+                ...
+            ...
+        runB/
+            poolB1/
+                sample_{barcode1}.1.fq.gz
+                sample_{barcode1}.2.fq.gz
+                ...
+            poolB1_clone_filtered/
+                sample_{barcode1}.1.1.fq.gz
+                sample_{barcode1}.2.2.fq.gz
+                ...
+            ...
+        ...        
+    ```
+* [04-all_samples](04-all_samples): paired end `fastq.gz` files are named according to [{species}_infos.csv](01-info_files) information. Then reads are aligned onto reference genome sequences stored into [08-genomes](08-genomes). This folder contains "named" fatsq files and corresponding alignments `.bam` files. `.sorted.bam` are SORTED alignment files and `.sorted.bam.bai` are corresponding index. The data must be stored this way :
+    ```
+    02-raw/
+        speciesA/
+           {sampleA1}.1.fq.gz
+           {sampleA1}.2.fq.gz
+           {sampleA1}.bam
+           {sampleA1}.sorted.bam
+           {sampleA1}.sorted.bam.bai
+           {sampleA2}.1.fq.gz
+           {sampleA2}.2.fq.gz
+           {sampleA2}.bam
+           {sampleA2}.sorted.bam
+           {sampleA2}.sorted.bam.bai
+           ...
+        speciesB/
+           {sampleB1}.1.fq.gz
+           {sampleB1}.2.fq.gz
+           {sampleB1}.bam
+           {sampleB1}.sorted.bam
+           {sampleB1}.sorted.bam.bai
+           ...
+        ...        
+    ```
+* [05-stacks](05-stacks) : outputs from [gstacks](http://catchenlab.life.illinois.edu/stacks/comp/gstacks.php)
+* [06-populations](06-populations) : outputs from [populations](http://catchenlab.life.illinois.edu/stacks/comp/populations.php)
+* [08-genomes](08-genomes) : reference genome of each any species {species} used for the analysis. `.fasta` file is mandatory and stores all the scaffolds sequences of {species} genome assembly. `.amb`, `.ann`, `.bwt`, `.pac`, `.sa` are index files required by [BWA 0.7.17](https://icb.med.cornell.edu/wiki/index.php/Elementolab/BWA_tutorial). They will be automatically generated if absent. The data must be stored this way :
+    ```
+    08-genomes/
+          {species}_genome.amb
+          {species}_genome.ann
+          {species}_genome.bwt
+          {species}_genome.fasta
+          {species}_genome.pac
+          {species}_genome.sa       
+    ```
+* [10-logs](10-logs) : log files generated by every command
+    - process_radtags
+    - clone_filter
+    - genome_alignment
+    - gstacks
+    - populations

 # 3. Reporting bugs

@@ -122,7 +228,7 @@ If you're sure you've found a bug — e.g. if one of my programs crashes
 with an obscur error message, or if the resulting file is missing part
 of the original data, then by all means submit a bug report.

-I use [GitLab's issue system](https://gitlab.com/reservebenefit/snakemake_stacks2/issues)
+I use [GitLab's issue system](http://gitlab.mbb.univ-montp2.fr/reservebenefit/snakemake_stacks2/issues)
 as my bug database. You can submit your bug reports there. Please be as
 verbose as possible — e.g. include the command line, etc