Under the Hood¶
elvers
runs snakemake rules in workflows that are highly customizeable. The user chooses a tool or workflow, and elvers
aggregates all of the default parameters for that tool or workflow, build end-stage target files, and passes this information into the snakefile. Snakemake then builds a DAG of the workflow and runs the workflow in an automated manner.
Program Rules¶
The unit at the heart of all workflows is a standalone snakemake rule for a program function. Rules are all placed within subdirectories of the rules
directory. For example, the trinity
rule is found at rules/trinity/trinity.rule
. Utility rules are instead found at, for example: rules/utils/get_data.rule
.
Each rule needs several files:
rulename.rule
: the snakemake rule(s) for this program.rulename-env.yaml
: a conda environment file for this program. This is passed into theconda
snakemake directive in the rule.- (optional)
rulename-wrapper.py
orrulename-script.R
: a script that is used to more easily and reproducibly run the program. rulename_params.yaml
: a parameter file to direct program output and set default program parameters
The last file (_params
) provides program defaults that can be overridden by user input from the main config file (via a nested dictionary update). The parameters do need to be exact, however. The --print_params
and --build_config
options in elvers
are designed to facilitate this.
Rule Params¶
The rule params files (above) have two components: elvers_params
and program_params
. program_params
are exposed to the user when displaying parameters or building configfiles. These are relatively safe to modify, and most get passed into the program rule itself. On the other hand, elvers_params
are used internally to build the names of output files when they need to be passed to the Snakefile as "targets". These are never exposed to the user and should be set just once when a rule is built.
For example, let's look at the trinity.yaml
file:
trinity:
elvers_params:
outdir: assembly
extensions:
assembly_extensions: # use this extension only for all output of this assembler
- _trinity
base:
- .fasta
- .fasta.gene_trans_map
program_params:
# input kmer-trimmed reads
input_kmer_trimmed: True
# input trimmed-reads
input_trimmomatic_trimmed: False
# do we want to assemble the single reads with pe reads?
add_single_to_paired: False
max_memory : 30G
seqtype: fq
extra: ''
The program_params
allow the user to choose quality-trimmed or kmer-trimmed reads as input to trinity, and pass paramters for memory and sequence type. Whereever possible, program params also include an extra
parameter that enable the user to pass any number of parameters directly to the command line of that program.
The elvers_params
specify that trinity produces two files, with the assembly_extension '_trinity'. These will be BASENAME_trinity.fasta
and BASENAME_trinity.fasta.gene_trans_map
. The assembly_extension
allows us to have downstream steps that work on different assemblies, but trinity is an assembler, and should only ever produce an assembly with the extension _trinity
. Assembly extensions are used downstream as well. For example, paladin
only works on protein assemblies, and
thus will only map to assemblies with the extension _plass
.
Assembly extensions are the most important of this type of parameter, but the differential expression steps also have a weird paramter they pass in: contrast
. Target files are generated in ep_utils/generate_all_targets.py
- have a look if you'd like to see how we build targets to pass into the Snakefile, and look for handlingassembly_extensions
and contrasts
.
Workflows¶
Rules are aggregated into Subworkflows and Workflows that include all the rules necessary to run a larger analysis. For example, the default
eel pond protocol conducts de novo transcriptome assembly, annotation, and quick differential expression analysis.
For now, workflows are specified in the ep_utils/pipeline_defaults.yaml
. Each workflow (or subworkflow, which are smaller subsets of the workflows) needs two pieces of information:
include
: rules to includetargets
: the endpoints of workflows that are not used to generate additional files. These are the "targets" we build and pass to the Snakefile.