======================================================================
Your First Workflow on Google Cloud Virtual Machine (Snakemake RNA-seq)
======================================================================

This Snakemake `RNA-seq <https://www.technologynetworks.com/genomics/articles/rna-seq-basics-applications-and-protocol-299461#:~:text=RNA%2Dseq%20(RNA%2Dsequencing,patterns%20encoded%20within%20our%20RNA.>`_ workflow maps read-pairs to a reference genome and quantify the produce to produce a transcript. `Snakemake <https://snakemake.readthedocs.io/en/stable/>`_ workflows are Python based, extended by declarative code which defines rules. The rules indicate how the output files are created from the input files.

Requirements:

-  `Anaconda/Miniconda <https://conda.io/projects/conda/en/latest/user-guide/install/index.html>`_


.. note:: The requirements above are crucial to running this workflow. Please make sure you have them installed properly prior to running this workflow.


Download this tutorial:
::

  $ sudo add-apt-repository universe
  $ sudo apt update
  $ sudo apt install subversion

  #cloning this tutorial
  $ svn checkout https://github.com/isb-cgc/RunningWorkflows-on-the-GoogleCloud/trunk/Snakemake-RNAseq

Starting folder:


::

   Snakemake-RNAseq
   |_________data
   |           |_________ggal_gut_1.fq
   |           |_________ggal_gut_2.fq
   |           |_________ggal.gtf
   |           |_________reference.fa
   |_________step1
   |           |_________indexing.smk
   |           |_________Snakefile
   |_________step2
   |           |_________therest.smk
   |           |_________Snakefile
   |_________environment.yml


Setting up your Conda environment using the environment.yml file
================================================================
The file **environment.yml** contains a list of all the softwares required to run this tutorial
::

  $ cd Snakemake-RNAseq
  $ conda env create --name tutorial --file environment.yml
  $ conda activate tutorial

.. note:: Syntax: conda env create --name <yourEnvironmentName> --file environment.yml. For this tutorial <yourEnvironmentName> = tutorial

It should look like this:

.. image:: images/YourFirstWorkflow_1.png
   :align: left

Congratulations! You are now ready to run your first workflow using snakemake.

Step 1: Creating Index files from reference genome using hisat2-build
=====================================================================


From folder rnaseq-snake:
::

  $ cd step1

Let's take a look at the 2 files in this folder: **Snakefile** and **indexing.smk**.


::

  |_________step1
             |_________indexing.smk
             |_________Snakefile


Snakefile is the default name for snakemake's script. The "snakemake"
command will automatically find a file name Snakefile and execute it;  **indexing.smk** is a helper file that will be called within the Snakefile.

Let's run it by using:

::

  $ snakemake


Snakemake will run and use hisat2 to build index files from file rnaseq-snake/data/**reference.fa**. All the index files will be moved to rnaseq-snake/step1/**reference**.

Let's take a quick look at the code:

::

  #Snakefile
  rule targets:
    input:
        "index.done"
  include: "indexing.smk"


The first rule's input of a Snakefile (rule targets) will determine what's the output of that workflow. In this case, index.done is the target file that snakemake will look for. **Include: "indexing.smk"** will call the following helper script:

.. note:: Only the very first Snakefile rule's input behaves like this; other input will be the actual inputs of your workflow.

::

  #Indexing.smk
  rule indexing:
    input:
        "../data/reference.fa"
    output:
        touch('index.done')
    shell:
        """
        mkdir reference
        hisat2-build {input} index
        mv index.* reference/
        """

In **Indexing.smk** file we have an actual input "../data/reference.fa" and the output section tells snakemake to create an empty file "index.done", which is the file that the first rule will check to make sure that this helper script actually run. Then the shell script is executed as follows: a folder called reference got created, then Hisat2 created index files from the fasta file, and then all the index files got moved to the reference folder.

After **Step 1**:

::

   rnaseq-snakemake
   |_________data
   |           |_________ggal_gut_1.fq
   |           |_________ggal_gut_2.fq
   |           |_________ggal.gtf
   |           |_________reference.fa
   |_________step1
   |           |_________indexing.smk
   |           |_________Snakefile
   |           |_________[index.done]
   |           |_________[reference]
   |                           |_________[index.1.ht2]
   |                           |              [(2-7)]
   |                           |_________[index.8.ht2]
   |_________step2
   |           |_________therest.smk
   |           |_________Snakefile
   |_________environment.yml


Step 2: Creating the BAM file and the Transcript from reads and index files
==========================================================================

Step 2 is similar to Step 1.

From folder step1, to run step 2:

::

   $ cd ..
   $ cd step2
   $ snakemake

After **Step 2**:

::

   rnaseq-snakemake
   |_________data
   |           |_________ggal_gut_1.fq
   |           |_________ggal_gut_2.fq
   |           |_________ggal.gtf
   |           |_________reference.fa
   |_________step1
   |           |_________indexing.smk
   |           |_________Snakefile
   |           |_________index.done
   |           |_________reference
   |                           |_________index.1.ht2
   |                           |              (2-7)
   |                           |_________index.8.ht2
   |_________step2
   |           |_________therest.smk
   |           |_________Snakefile
   |           |_________[ggal_gut.cutadapt.sam]
   |           |_________[e2t.ctab]
   |           |_________[i_data.ctab]
   |           |_________[i2t.ctab]
   |           |_________[t_data.ctab]
   |           |_________[e_data.ctab]
   |           |_________[ggal_gut.tsv]
   |           |_________[ggal_gut_ref.gtf]
   |           |_________[ggal_gut_transcript.gtf]
   |           |_________[ggal_gut.cutadapt.bam]
   |           |_________[ggal_gut.cutadapt.bam.bai]
   |_________environment.yml

The script will call `hisat2 <http://daehwankimlab.github.io/hisat2/>`_ , `samtools <http://www.htslib.org/>`_, and `stringtie <https://ccb.jhu.edu/software/stringtie/>`_ to do the work.
The same principle as step 1 applies to this step, the **Snakefile** will contains the final outputs, and call to the helper script **therest.smk**. **ggal_gut.cutadapt.sam** file will contains the sequence alignment data produced by mapping reads to the reference genome. **ggal_gut.cutadapt.bam** file will contain the compressed binary data from Sam. More description on ctab files, gtf outputs, and tsv of stringtie can be found `here <http://ccb.jhu.edu/software/stringtie/index.shtml?t=manual>`_. The **ggal_gut_transcript.gtf** contains details of the transcripts that StringTie assembles from RNA-Seq data, while **ggal_gut.tsv** contains gene abundances.


Creating a visualization for your workflow
==========================================

In the step2 folder:
::

 $ snakemake --dag | dot -Tsvg > visual.svg

A file named **visual.svg** will be created in the same folder; it can be downloaded and open with any web browser. It should look like this:


.. image:: images/YourFirstWorkflow_2.jpg
   :align: center



About environment.yml
---------------------

::

 channels:
  - conda-forge
  - bioconda
  - main
  - r
 dependencies:
  #snakemake and python will be included
  - snakemake-minimal =5.10.0
  - python =3.7.6
  #all other bioinformatics tools
  - samtools =1.9
  - bowtie2 =2.3.5.1
  - hisat2 =2.2.0
  - stringtie =2.1.2
  - gffread =0.11.7
  #visualization tool
  - graphviz =2.42.3e

To see the result of this workflow, you can check it `here <https://github.com/isb-cgc/RunningWorkflows-on-the-GoogleCloud/tree/master/Results/RNAseq>`_.