Running CWL RNA-seq

This Common Workflow Language (CWL) RNA-seq workflow maps read-pairs to a reference genome and produces a transcript.

CWL enables the user to connect command line tools to create workflows; it is a specification and is therefore portable across platforms that support CWL.

Requirements:

  • CWLtool

  • Docker

Note

The requirements above are crucial to running this workflow. Please make sure you have them installed properly prior to running this workflow.

Download this tutorial:

$ sudo add-apt-repository universe
$ sudo apt update
$ sudo apt install subversion

#cloning this tutorial
$ svn checkout https://github.com/isb-cgc/RunningWorkflows-on-the-GoogleCloud/trunk/CWL-RNAseq

To install Docker and CWL, see our VM Workflow Tools Installation Cheatsheet for instructions.

Starting folder CWL-RNAseq should look like this:

.
└── CWL-RNAseq
    ├── create_bam.cwl
    ├── create_transcript.cwl
    ├── CWL-RNAseq.cwl
    ├── CWL-RNAseq.yml
    ├── data
    │   ├── sample_1.fq
    │   ├── sample_2.fq
    │   ├── sample.fa
    │   └── sample.gtf
    ├── hisat2_align.cwl
    └── index_build.cwl

An overview of the main CWL files:

  • CWL-RNAseq.cwl is the main cwl file that connects all other cwl tools and yml file together.

  • CWL-RNAseq.yml is the file that contains all the inputs that are necessary to run the pipeline.

  • index_build.cwl builds index files from a Fasta file, using Hisat2-build.

  • hisat2_align.cwl builds a sam file from forward and reverse reads, and the indices built from previous step, using Hisat2.

  • create_bam.cwl builds a bam file from the newly built sam file, using Samtools.

  • create_transcript.cwl creates transcript from the bam file from previous step, using Stringtie.

Let’s take a look at some example of the main file CWL-RNAseq.cwl: The first block describe what is required to run this workflow, more information on this CWL requirements can be found here. Docker usage is also described in the hints section.

#!/usr/bin/env cwl-runner

cwlVersion: v1.0
class: Workflow
requirements:
  SubworkflowFeatureRequirement: {}
  StepInputExpressionRequirement: {}
  InlineJavascriptRequirement: {}
  ShellCommandRequirement: {}

hints:
  DockerRequirement:
    dockerPull: kathrinklee/rna-seq-pipeline-hisat2

Below are the inputs, outputs, and steps blocks that come after. In step1 the script index_build.cwl will be called, and its inputs (in) are taken from inputs section, the output (out) will be caught by the outputs (step1/ht and step1/log in this case). This declaration is important as it will decide which outputs to keep at the end of the step.

inputs:
  fasta_file: File
  out_name: string

outputs:
  index_files:
    type: Directory
    outputSource: step1/ht
  index_log:
    type: File
    outputSource: step1/log

steps:
  step1:
    run: index_build.cwl
    in:
      fasta_file: fasta_file
      out_name: out_name
    out:
      [ht, log]

Let’s run it by using:

$ cwltool CWL-RNAseq.cwl CWL-RNAseq.yml

If you receive this error: “docker: Got permission denied while trying to connect to the Docker daemon socket at unix”

Try:

$ sudo groupadd docker
$ sudo usermod -aG docker ${USER}
close and reopen VM then run the script again

Let’s take a look at the folder after cwltool finishes:

.
└── CWL-RNAseq
    ├── create_bam.cwl
    ├── create_transcript.cwl
    ├── CWL-RNAseq.cwl
    ├── CWL-RNAseq.yml
    ├── data
    │   ├── sample_1.fq
    │   ├── sample_2.fq
    │   ├── sample.fa
    │   └── sample.gtf
    ├── [final_ref.gtf]
    ├── [final_transcript.gtf]
    ├── [final.tsv]
    ├── hisat2_align.cwl
    ├── [hisat2_align_out]
    │   ├── [hisat2_align_out.log]
    │   └── [sample.sam]
    ├── [hisat2_build.log]
    ├── index_build.cwl
    ├── [sample]
    │   ├── [index.1.ht2]
    │   ├── [index.2.ht2]
    │   ├── [index.3.ht2]
    │   ├── [index.4.ht2]
    │   ├── [index.5.ht2]
    │   ├── [index.6.ht2]
    │   ├── [index.7.ht2]
    │   └── [index.8.ht2]
    └── [sample.bam]

The script will call hisat2 , samtools, and stringtie to do the work. sample.sam file will contains the sequence alignment data produced by mapping reads to the reference genome, sample.bam file will contains the compressed binary data from Sam. More description on gtf outputs, and tsv of stringtie can be found here. The final_transcript.gtf contains details of the transcripts that StringTie assembles from RNA-Seq data, while final.tsv contains gene abundances.

To see the result of this workflow, you can check it here.


Have feedback or corrections? Please email us at feedback@isb-cgc.org. Follow us on BlueSky and X!