See here for CyVerse Guide to Launching Atmosphere Instances
echo http://$(hostname -i):8787/
echo export PATH=$PATH:/opt/miniconda3/bin >> ~/.bashrc
source ~/.bashrc
which snakemake
it should show the absolute path of snakemake as ‘/opt/miniconda3/bin/snakemake’
which singularity
It should show the absolute path of singularity ‘/usr/local/bin/singularity’
mkdir data
cd data/
curl -L https://osf.io/5daup/download -o ERR458493.fastq.gz
curl -L https://osf.io/8rvh5/download -o ERR458494.fastq.gz
curl -L https://osf.io/2wvn3/download -o ERR458495.fastq.gz
curl -L https://osf.io/xju4a/download -o ERR458500.fastq.gz
curl -L https://osf.io/nmqe6/download -o ERR458501.fastq.gz
curl -L https://osf.io/qfsze/download -o ERR458502.fastq.gz
The Snakemake workflow management system is a tool to create reproducible and scalable data analyses. It orchestrates and keeps track of all the different steps of workflows that have been run so you don’t have to! It has a lot of wonderful features that can be invoked for different applications, making it very flexible while maintaining human interpretability.
There are many different tools that researchers use to automate computational workflows. We selected snakemake for the following reasons:
Like other workflow management systems, Snakemake allows you to:
Our goal is to automate the first two steps (FastQC MultiQC) of our example workflow using snakemake!
Snakemake workflows are built around rules. The diagram below shows the anatomy of a snakemake rule:
Let’s make a rule to run fastqc
on one of our samples below. We’ll put this
rule in a file called Snakefile
.
# This rule will run fastqc on the specified input file.
rule fastqc_raw:
input: "data/ERR458493.fastq.gz"
output:
"fastqc_raw/ERR458493_fastqc.html",
"fastqc_raw/ERR458493_fastqc.zip"
shell:'''
fastqc -o fastqc_raw {input}
'''
Let’s try and run our Snakefile! Return to the command line and run snakemake
.
snakemake
You should see output that starts like this:
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 fastqc_raw
1
[Tue Jul 2 19:10:26 2019]
rule fastqc_raw:
input: data/ERR458493.fastq.gz
output: fastqc_raw/ERR458493_fastqc.html, fastqc_raw/ERR458493_fastqc.zip
jobid: 0
Let’s check that the output file is there:
ls fastqc_raw/*fastqc*
Yay! Snakemake ran the thing!
We can also use better organization. Let’s specify a different output folder for our fastqc results
# This rule will run fastqc on the specified input file
# (replace the prior fastqc_raw rule with this new rule)
rule fastqc_raw:
input: "data/ERR458493.fastq.gz"
output:
"fastqc_raw/ERR458493_fastqc.html",
"fastqc_raw/ERR458493_fastqc.zip"
shell:'''
fastqc -o fastqc_raw {input}
'''
If we look in our directory, we should now see a fastqc_raw
directory, even
though we didn’t create it:
ls
Snakemake created this directory for us. We can look inside it to see if it really ran our command:
ls fastqc_raw
We told snakemake to do something, and it did it. Let’s add another rule to our Snakefile telling snakemake to do something else. This time, we’ll run multiqc.
# Run fastqc on the specified input file
rule fastqc_raw:
input: "data/ERR458493.fastq.gz"
output:
"fastqc_raw/ERR458493_fastqc.html",
"fastqc_raw/ERR458493_fastqc.zip"
shell:'''
fastqc -o fastqc_raw {input}
'''
# Run multiqc on the results of the fastqc_raw rule
rule multiqc_raw:
input: "fastqc_raw/ERR458493_fastqc.zip"
output: "fastqc_raw/multiqc_report.html"
shell:'''
multiqc -o fastqc_raw fastqc_raw
'''
We see output like this:
Building DAG of jobs...
Nothing to be done.
Complete log: /Users/tr/2019_angus/.snakemake/log/2019-07-02T191640.002744.snakemake.log
However, when we look at the output directory fastqc_raw
, we see that our
multiqc file does not exist! Bad Snakemake! Bad!!
Snakemake looks for a rule all
in a file as the final file it needs to
produce in a workflow. Once this file is defined, it will go back through all
other rules looking for which ordered sequence of rules will produce all of the
files necessary to get the final file(s) specified in rule all
. For this point
in our workflow, this is our fastqc sample directory.. Let’s add a rule all.
rule all:
input:
"fastqc_raw/multiqc_report.html"
rule fastqc_raw:
input: "data/ERR458493.fastq.gz"
output:
"fastqc_raw/ERR458493_fastqc.html",
"fastqc_raw/ERR458493_fastqc.zip"
shell:'''
fastqc -o fastqc_raw {input}
'''
rule multiqc_raw:
input: "fastqc_raw/ERR458493_fastqc.html"
output: "fastqc_raw/multiqc_report.html"
shell:'''
multiqc -o fastqc_raw fastqc_raw
'''
And it worked! Now we see output like this:
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 all
1 multiqc_raw
2
Snakemake now has two processes it’s keeping track of.
So far we’ve been using snakemake to process one sample. However, we have 6! Snakemake is can be flexibly extended to more samples using wildcards.
We already saw wildcards previously.
When we specified the output file path with {input}
, {input}
was a
wildcard. The wildcard is equivalent to the value we specified for {input}
.
rule fastqc_raw:
input: "data/ERR458493.fastq.gz"
output:
"fastqc_raw/ERR458493_fastqc.html",
"fastqc_raw/ERR458493_fastqc.zip"
shell:'''
fastqc -o fastqc_raw {input}
'''
We can create our own wildcard too. This is really handy for running our workflow on all of our samples.
# Create a list of strings containing all of our sample names
SAMPLES=['ERR458493', 'ERR458494', 'ERR458495', 'ERR458500', 'ERR458501',
'ERR458502']
rule all:
input:
"fastqc_raw/multiqc_report.html"
rule fastqc_raw:
input: "data/{sample}.fastq.gz"
output:
"fastqc_raw/{sample}_fastqc.html",
"fastqc_raw/{sample}_fastqc.zip"
shell:'''
fastqc -o fastqc_raw {input}
'''
rule multiqc_raw:
input: expand("fastqc_raw/{sample}_fastqc.html", sample = SAMPLES)
output: "fastqc_raw/multiqc_report.html"
shell:'''
multiqc -o fastqc_raw fastqc_raw
'''
We can run this again at the terminal.
snakemake
And we have now run these rules for each of our samples!
Note that we added new syntax here as well. We define a variable at the top
of the snakefile call SAMPLES
. Snakemake solves the values for the wildcard
{sample}
the last time that see that wildcard. However, we need to expand
the wildcard using the expand
function, and tell snakemake in which variable
to look for the values.
snakemake -n
snakemake –p
snakemake -r
n
cores¶snakemake --cores n
snakemake --cluster-config cluster.yml --cluster \
"sbatch -A {cluster.account} -t {cluster.time}"
snakemake --dag | dot -Tpng > dag.png
The DAG png file should look something as shown above.
Snakemake can automatically generate detailed self-contained HTML reports that encompass runtime statistics, provenance information, workflow topology and results.
To create the report, run
snakemake --report report.html
# Create a list of strings containing all of our sample names
SAMPLES=['ERR458493', 'ERR458494', 'ERR458495', 'ERR458500', 'ERR458501',
'ERR458502']
rule all:
input:
"fastqc_raw/multiqc_report.html"
rule fastqc_raw:
input: "data/{sample}.fastq.gz"
output:
"fastqc_raw/{sample}_fastqc.html",
"fastqc_raw/{sample}_fastqc.zip"
singularity:
"docker://sateeshperi/fastqc"
shell:'''
fastqc -o fastqc_raw {input}
'''
rule multiqc_raw:
input: expand("fastqc_raw/{sample}_fastqc.html", sample = SAMPLES)
output: "fastqc_raw/multiqc_report.html"
singularity:
"docker://sateeshperi/multiqc"
shell:'''
multiqc -o fastqc_raw fastqc_raw
'''
Save the file as Snakefile and execute Snakemake in your terminal by:
snakemake --use-singularity
it will execute the job within a singularity container that is spawned from the given image. Allowed image urls entail everything supported by singularity (e.g., shub:// and docker://).
You can specify software on a per-rule basis! This is really helpful when you have incompatible software requirements for different rules, or want to run on a cluster, or want to make your workflow reproducible.
For example, if you create a file env_fastqc.yml
with the following content:
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- fastqc==0.11.8
and then change the fastqc rule to look like this:
rule fastqc_raw:
input: "data/{sample}.fastq.gz"
output:
"fastqc_raw/{sample}_fastqc.html",
"fastqc_raw/{sample}_fastqc.zip"
conda:
"env_fastqc.yml"
shell:'''
fastqc -o fastqc_raw {input}
'''
you can now run snakemake like so,
snakemake --use-conda
and for that rule, snakemake will install the specified software and and dependencies in its own environment, with the specified version.
This aids in reproducibility, in addition to the practical advantages of isolating software installs from each other.
Note: It is advisable to delete your instance if you are not planning to use it in future to save valuable resources. However if you want to use it in future, you can suspend it. See Instance Maintenace for more info
Snakemake2019 v1.0 Atmosphere Image Specifications