================== Running CWL RNA-seq ================== This Common Workflow Language (`CWL `_) `RNA-seq `_ workflow maps read-pairs to a reference genome and produces a transcript. CWL enables the user to connect command line tools to create workflows; it is a specification and is therefore portable across platforms that support CWL. Requirements: ============= - CWLtool - Docker .. note:: The requirements above are crucial to running this workflow. Please make sure you have them installed properly prior to running this workflow. Download this tutorial: :: $ sudo add-apt-repository universe $ sudo apt update $ sudo apt install subversion #cloning this tutorial $ svn checkout https://github.com/isb-cgc/RunningWorkflows-on-the-GoogleCloud/trunk/CWL-RNAseq To install Docker and CWL, see our `VM Workflow Tools Installation Cheatsheet `_ for instructions. Starting folder **CWL-RNAseq** should look like this: :: . └── CWL-RNAseq ├── create_bam.cwl ├── create_transcript.cwl ├── CWL-RNAseq.cwl ├── CWL-RNAseq.yml ├── data │   ├── sample_1.fq │   ├── sample_2.fq │   ├── sample.fa │   └── sample.gtf ├── hisat2_align.cwl └── index_build.cwl An overview of the main CWL files: - **CWL-RNAseq.cwl** is the main cwl file that connects all other cwl tools and yml file together. - **CWL-RNAseq.yml** is the file that contains all the inputs that are necessary to run the pipeline. - **index_build.cwl** builds index files from a Fasta file, using Hisat2-build. - **hisat2_align.cwl** builds a sam file from forward and reverse reads, and the indices built from previous step, using Hisat2. - **create_bam.cwl** builds a bam file from the newly built sam file, using Samtools. - **create_transcript.cwl** creates transcript from the bam file from previous step, using Stringtie. Let's take a look at some example of the main file **CWL-RNAseq.cwl**: The first block describe what is required to run this workflow, more information on this CWL requirements can be found `here `_. Docker usage is also described in the **hints** section. :: #!/usr/bin/env cwl-runner cwlVersion: v1.0 class: Workflow requirements: SubworkflowFeatureRequirement: {} StepInputExpressionRequirement: {} InlineJavascriptRequirement: {} ShellCommandRequirement: {} hints: DockerRequirement: dockerPull: kathrinklee/rna-seq-pipeline-hisat2 Below are the **inputs**, **outputs**, and **steps** blocks that come after. In **step1** the script **index_build.cwl** will be called, and its inputs (**in**) are taken from **inputs** section, the output (**out**) will be caught by the **outputs** (**step1/ht** and **step1/log** in this case). This declaration is important as it will decide which outputs to keep at the end of the step. :: inputs: fasta_file: File out_name: string outputs: index_files: type: Directory outputSource: step1/ht index_log: type: File outputSource: step1/log steps: step1: run: index_build.cwl in: fasta_file: fasta_file out_name: out_name out: [ht, log] Let's run it by using: :: $ cwltool CWL-RNAseq.cwl CWL-RNAseq.yml If you receive this error: "docker: Got permission denied while trying to connect to the Docker daemon socket at unix" Try: :: $ sudo groupadd docker $ sudo usermod -aG docker ${USER} close and reopen VM then run the script again Let's take a look at the folder after cwltool finishes: :: . └── CWL-RNAseq ├── create_bam.cwl ├── create_transcript.cwl ├── CWL-RNAseq.cwl ├── CWL-RNAseq.yml ├── data │   ├── sample_1.fq │   ├── sample_2.fq │   ├── sample.fa │   └── sample.gtf ├── [final_ref.gtf] ├── [final_transcript.gtf] ├── [final.tsv] ├── hisat2_align.cwl ├── [hisat2_align_out] │   ├── [hisat2_align_out.log] │   └── [sample.sam] ├── [hisat2_build.log] ├── index_build.cwl ├── [sample] │   ├── [index.1.ht2] │   ├── [index.2.ht2] │   ├── [index.3.ht2] │   ├── [index.4.ht2] │   ├── [index.5.ht2] │   ├── [index.6.ht2] │   ├── [index.7.ht2] │   └── [index.8.ht2] └── [sample.bam] The script will call `hisat2 `_ , `samtools `_, and `stringtie `_ to do the work. **sample.sam** file will contains the sequence alignment data produced by mapping reads to the reference genome, **sample.bam** file will contains the compressed binary data from Sam. More description on gtf outputs, and tsv of stringtie can be found `here `_. The **final_transcript.gtf** contains details of the transcripts that StringTie assembles from RNA-Seq data, while **final.tsv** contains gene abundances. To see the result of this workflow, you can check it `here `_.