Running CWL RNA-seq¶
This Common Workflow Language (CWL) RNA-seq workflow maps read-pairs to a reference genome and produces a transcript.
CWL enables the user to connect command line tools to create workflows; it is a specification and is therefore portable across platforms that support CWL.
Requirements:¶
CWLtool
Docker
Note
The requirements above are crucial to running this workflow. Please make sure you have them installed properly prior to running this workflow.
Download this tutorial:
$ sudo add-apt-repository universe
$ sudo apt update
$ sudo apt install subversion
#cloning this tutorial
$ svn checkout https://github.com/isb-cgc/RunningWorkflows-on-the-GoogleCloud/trunk/CWL-RNAseq
To install Docker and CWL, see our VM Workflow Tools Installation Cheatsheet for instructions.
Starting folder CWL-RNAseq should look like this:
.
└── CWL-RNAseq
├── create_bam.cwl
├── create_transcript.cwl
├── CWL-RNAseq.cwl
├── CWL-RNAseq.yml
├── data
│ ├── sample_1.fq
│ ├── sample_2.fq
│ ├── sample.fa
│ └── sample.gtf
├── hisat2_align.cwl
└── index_build.cwl
An overview of the main CWL files:
CWL-RNAseq.cwl is the main cwl file that connects all other cwl tools and yml file together.
CWL-RNAseq.yml is the file that contains all the inputs that are necessary to run the pipeline.
index_build.cwl builds index files from a Fasta file, using Hisat2-build.
hisat2_align.cwl builds a sam file from forward and reverse reads, and the indices built from previous step, using Hisat2.
create_bam.cwl builds a bam file from the newly built sam file, using Samtools.
create_transcript.cwl creates transcript from the bam file from previous step, using Stringtie.
Let’s take a look at some example of the main file CWL-RNAseq.cwl: The first block describe what is required to run this workflow, more information on this CWL requirements can be found here. Docker usage is also described in the hints section.
#!/usr/bin/env cwl-runner
cwlVersion: v1.0
class: Workflow
requirements:
SubworkflowFeatureRequirement: {}
StepInputExpressionRequirement: {}
InlineJavascriptRequirement: {}
ShellCommandRequirement: {}
hints:
DockerRequirement:
dockerPull: kathrinklee/rna-seq-pipeline-hisat2
Below are the inputs, outputs, and steps blocks that come after. In step1 the script index_build.cwl will be called, and its inputs (in) are taken from inputs section, the output (out) will be caught by the outputs (step1/ht and step1/log in this case). This declaration is important as it will decide which outputs to keep at the end of the step.
inputs:
fasta_file: File
out_name: string
outputs:
index_files:
type: Directory
outputSource: step1/ht
index_log:
type: File
outputSource: step1/log
steps:
step1:
run: index_build.cwl
in:
fasta_file: fasta_file
out_name: out_name
out:
[ht, log]
Let’s run it by using:
$ cwltool CWL-RNAseq.cwl CWL-RNAseq.yml
If you receive this error: “docker: Got permission denied while trying to connect to the Docker daemon socket at unix”
Try:
$ sudo groupadd docker
$ sudo usermod -aG docker ${USER}
close and reopen VM then run the script again
Let’s take a look at the folder after cwltool finishes:
.
└── CWL-RNAseq
├── create_bam.cwl
├── create_transcript.cwl
├── CWL-RNAseq.cwl
├── CWL-RNAseq.yml
├── data
│ ├── sample_1.fq
│ ├── sample_2.fq
│ ├── sample.fa
│ └── sample.gtf
├── [final_ref.gtf]
├── [final_transcript.gtf]
├── [final.tsv]
├── hisat2_align.cwl
├── [hisat2_align_out]
│ ├── [hisat2_align_out.log]
│ └── [sample.sam]
├── [hisat2_build.log]
├── index_build.cwl
├── [sample]
│ ├── [index.1.ht2]
│ ├── [index.2.ht2]
│ ├── [index.3.ht2]
│ ├── [index.4.ht2]
│ ├── [index.5.ht2]
│ ├── [index.6.ht2]
│ ├── [index.7.ht2]
│ └── [index.8.ht2]
└── [sample.bam]
The script will call hisat2 , samtools, and stringtie to do the work. sample.sam file will contains the sequence alignment data produced by mapping reads to the reference genome, sample.bam file will contains the compressed binary data from Sam. More description on gtf outputs, and tsv of stringtie can be found here. The final_transcript.gtf contains details of the transcripts that StringTie assembles from RNA-Seq data, while final.tsv contains gene abundances.
To see the result of this workflow, you can check it here.