Running GeneFlow RNA-seq

This GeneFlow RNA-seq workflow maps read-pairs to a reference genome and produces a transcript.

GeneFlow enables modular and reproducible scientific workflows by leveraging reusable, containerized steps. Custom workflow steps can be implemented using either Docker or Singularity containers. Additional documentation for GeneFlow can be found here.

Requirements

  • Docker

  • Python 3

  • GeneFlow

To install Docker, Python 3, and GeneFlow, see our VM Workflow Tools Installation Cheatsheet for instructions.

Note

The requirements above are crucial to running this workflow. Please make sure you have them installed properly prior to running this workflow.

Install the Workflow

$ sudo add-apt-repository universe
$ sudo apt update
$ sudo apt install subversion

# clone the workflow
$ svn checkout https://github.com/isb-cgc/RunningWorkflows-on-the-GoogleCloud/trunk/GeneFlow-RNAseq

# install the workflow's apps
$ gf install-workflow --make-apps GeneFlow-RNAseq

Running GeneFlow

You should have a GeneFlow-RNAseq directory with the following contents:

GeneFlow-RNAseq
├── apps
│   ├── bam-sort
│   ├── hisat2-align
│   ├── hisat2-build
│   └── stringtie
├── data
│   ├── gtf
│   │   └── sample.gtf
│   ├── reads
│   │   ├── sample_1.fq
│   │   └── sample_2.fq
│   └── reference
│       └── sample.fa
└── workflow.yaml

The file workflow.yaml contains the workflow definition, and the apps folder contains definitions for each of the workflow steps.

Let’s take a look at the contents of the workflow.yaml file. The first block contains metadata, including workflow name, description, source repository, and version. The final_output block lists all steps whose output should be copied to the final output directory.

%YAML 1.1
---
gfVersion: v2.0
class: workflow

# metadata
name: HISAT2 StringTie Workflow
description: RNAseq workflow using HISAT2 and StringTie
git: https://github.com/isb-cgc/RunningWorkflows-on-the-GoogleCloud/GeneFlow-RNAseq
version: '0.1'

final_output:
- sort
- quantify

# inputs
inputs:
  reads:
    label: Input Directory
    description: Input directory containing FASTQ files
    type: Directory
    default: ./data/reads
    enable: true
    visible: true
  gtf:
    label: Input GTF
    description: GTF file describing transcriptome
    type: File
    default: ./data/gtf/sample.gtf
    enable: true
    visible: true
  reference:
    label: Reference Sequence FASTA
    description: Reference sequence FASTA file
    type: File
    default: ./data/reference/sample.fa
    enable: true
    visible: true

# parameters
parameters:
  threads:
    label: CPU Threads
    description: Number of CPU threads for alignment
    type: int
    default: 2
    enable: false
    visible: true

# apps
apps:
  hisat2-build:
    git: https://github.com/geneflow-apps/hisat2-build-gf2.git
    version: '2.2.1-01'
  hisat2-align:
    git: https://github.com/geneflow-apps/hisat2-align-gf2.git
    version: '2.2.1-01'
  bam-sort:
    git: https://github.com/geneflow-apps/bam-sort-gf2.git
    version: '1.10-07'
  stringtie:
    git: https://github.com/geneflow-apps/stringtie-gf2.git
    version: '2.1.6-01'

# steps
steps:
  build:
    app: hisat2-build
    depend: []
    template:
      reference: ${workflow->reference}
      output: reference

  align:
    app: hisat2-align
    depend: [ "build" ]
    map:
      uri: ${workflow->reads}
      regex: (.*)_(R|)1(.*)\.((fastq|fq)(|\.gz))$
    template:
      input: ${workflow->reads}/${1}_${2}1${3}.${4}
      pair: ${workflow->reads}/${1}_${2}2${3}.${4}
      reference: ${build->output}/reference
      threads: ${workflow->threads}
      output: ${1}.sam

  sort:
    app: bam-sort
    depend: [ "align" ]
    map:
      uri: ${align->output}
      regex: (.*).sam
    template:
      input: ${align->output}/${1}.sam
      output: ${1}.bam

  quantify:
    app: stringtie
    depend: [ "sort" ]
    map:
      uri: ${sort->output}
      regex: (.*).bam
    template:
      bam: ${sort->output}/${1}.bam
      gtf: ${workflow->gtf}
      output: ${1}
...

The inputs and parameters blocks define the inputs and parameters that need to be passed to the workflow upon execution. Some of these inputs and parameters are optional or have default values.

The apps block lists all apps used by the workflow and links to other, reusable source repositories for each app. Learn more about how each app works by following the Git repository links below:

The steps block defines the order of app execution as well as step dependencies for each app. It also defines how apps are chained together via their inputs and outputs.

To run the workflow:

# assuming the GeneFlow Python virtual environment has been activated, view the command line help
$ gf help GeneFlow-RNAseq

# run the workflow
$ cd GeneFlow-RNAseq
$ gf run . -o output

After the workflow completes, the output folder should look similar to this:

output
└── geneflow-job-095ba2fe
    ├── quantify
    │   ├── _log
    │   │   ├── gf-0-quantify-sample.err
    │   │   ├── gf-0-quantify-sample.out
    │   │   ├── sample-stringtie.stderr
    │   │   └── sample-stringtie.stdout
    │   └── sample
    │       ├── e2t.ctab
    │       ├── e_data.ctab
    │       ├── i2t.ctab
    │       ├── i_data.ctab
    │       ├── sample.tsv
    │       ├── sample_final_reference.gtf
    │       ├── sample_final_transcript.gtf
    │       └── t_data.ctab
    └── sort
        ├── _log
        │   ├── gf-0-sort-sample-bam.err
        │   ├── gf-0-sort-sample-bam.out
        │   └── sample.bam-samtools-sort.stderr
        └── sample.bam

The script will run Docker containers for hisat2, samtools, and stringtie to do the work. sample.bam contains the sequence alignment data produced by mapping reads to the reference genome sample.bam. Additional information about gtf and tsv outputs of stringtie can be found here. The sample_final_transcript.gtf contains details of the transcripts that StringTie assembles from RNA-Seq data, while sample.tsv contains gene abundances.

View the results of this workflow here.


Have feedback or corrections? Please email us at feedback@isb-cgc.org. Follow us on BlueSky and X!