Understanding and Configuring Eelpond Workflows¶
elvers
is designed to facilitate running standard workflows and analyses for sequence data. It integrates snakemake rules for commonly used tools, and provides several end-to-end protocols for analyzing RNAseq data. Workflows are highly customizable, and all command-line options for each tool are available for modification via the configuration file.
elvers
uses the yaml
Yet Another Markup Language format to specify data inputs and modify run parameters. The only required information for this file is the location of the data inputs (reads, assembly, or both).
Specifying Input Data¶
Let's start with the input data. There are two types of data that can go into elvers: read data (gzipped fastq files), and assembly files (fasta) files.
Read Input¶
To specify input data, we need to build a tab-separated samples file, e.g. my-samples.tsv
. This file tells elvers
a name for each samples, and provides a location for the fastq files (local file path, or downloadable link).
If we had our test data within the data
folder in the main elvers
directory, the file would look like this:
sample unit fq1 fq2 condition
0Hour 001 data/0Hour_ATCACG_L002_R1_001.extract.fastq.gz data/0Hour_ATCACG_L002_R2_001.extract.fastq.gz time0
0Hour 002 data/0Hour_ATCACG_L002_R1_002.extract.fastq.gz data/0Hour_ATCACG_L002_R2_002.extract.fastq.gz time0
6Hour 001 data/6Hour_CGATGT_L002_R1_001.extract.fastq.gz data/6Hour_CGATGT_L002_R2_001.extract.fastq.gz time6
6Hour 002 data/6Hour_CGATGT_L002_R1_002.extract.fastq.gz data/6Hour_CGATGT_L002_R2_002.extract.fastq.gz time6
If we want to download the test data instead:
sample unit fq1 fq2 condition
0Hour 001 https://osf.io/vw4dt/download https://osf.io/b47s2/download time0
0Hour 002 https://osf.io/92jr6/download https://osf.io/qzea8/download time0
6Hour 001 https://osf.io/ft3am/download https://osf.io/jqmsx/download time6
6Hour 002 https://osf.io/rnkq3/download https://osf.io/bt9vh/download time6
Note that proper formatting means all columns must be separated by tabs, not spaces!
Now, you need to provide the name and location of this file to elvers
. We do this in the config.yaml
file, like so:
get_data:
samples: /path/to/my-samples.tsv
The default functionality is to link data from another location on your computer into elvers
's output directory. However, if you want to download data instead, you'll need to provide links in the my-samples.tsv
file, and add a few lines to your config.yaml
:
get_data:
samples: /path/to/my-samples.tsv
download_data: True
use_ftp: False # set to true if you want to use FTP instead of HTTP
About the TSV and Input Reads:
- The sample names provided here are used to name file outputs throughout the workflow.
- The "unit" column is designed to facilitate combining samples that were sequenced over multiple lanes or in different batches. If you do not have "unit" information, please add a short placeholder, such as "a".
- for now, the column headers must be:
sample
,unit
,fq1
,fq2
,condition
. Additional headers are not a problem, but will not be used. - If you have single-end samples, please be sure to include the
fq2
column, but leave the column blank - the
condition
column is used for differential expression comparisions in DESeq2. The "condition" values can be anything, but they need to correspond with the "contrast" information you pass into DESeq2. If using thediffexp
workflow, see our docs here and be sure to add the differential expression information into youryaml
configuration file. - Formatting the
tsv
can be a bit annoying. It's slightly easier if you start from a working version by copying the sample data to a new file (see below). - At the moment,
elvers
assumes all input data is gzipped, so please input gzipped data!
If you'd like to start from a working version, copy (and then modify) the sample data:
cp examples/nema.samples.tsv my_samples.tsv
To run any read-based workflow, we need to get reads (and any assemblies already generated) into the right format for elvers
. We can either (1) link the data from another location on your machine, or (2) download the data via http or ftp. This is done through a utility rule called get_data
.
Reference Input¶
If you're starting a new elvers
run using a reference file (even if you have previously built a de novo assembly via elvers
), you need to help elvers
find that assembly.
Scenario 1: You're starting from your own reference file:
You need to provide the reference in your config.yaml
file:
get_reference:
reference: input reference fasta REQUIRED
gene_trans_map: OPTIONAL: provide a gene to transcript map for input transcriptome
reference_extension: '_input' OPTIONAL, changes naming
download_ref: the reference entry above is a link that needs to be downloaded
use_ftp: download via ftp instead of http
Once you add this to your configfile (e.g. my_config.yaml
), you can run a reference/assembly-based workflow such as annotate.
elvers my_config.yaml annotate
For more details on reference specification, see the get_reference documentation. For annotation configuration, see below.
Scenario 2: You've previously run an assembly program via elvers
:
You can provide the built reference in the same manner as above. However, if you're running more workflows in the same directory, you can also just specify the name of the assembly program that you used to generate the assembly. This will not rerun the assembly (unless you provide new input files). Instead, this will allow elvers
to know where to look for your previously-generated reference file. Because we enable multiple referene generation programs, we don't want to assume
which reference you'd like to use for downstream steps (in fact, if you provide multiple references, elvers
will run the downstream steps on all references, assuming you provide unique reference_extension
parameters so that the references are uniquely named.
Example: You've previously run the trinity assembly, and want to annotate it.
elvers examples/nema.yaml assemble annotate
Here, the assemble
workflow just enables elvers
to locate your assembly file for annotate
.
Choosing and running a workflow¶
We offer a number of workflows, including end-to-end workflows that conduct assembly through differential expression analysis. The default workflow is the Eel Pond RNAseq workflow, which conducts de novo transcriptome assembly, annotation, and quick differential expression analysis on a set of short-read Illumina data using a single command.
Available Workflows¶
Currently, all workflows require a properly-formatted read inputs tsv
file as input. Some workflows, e.g. annotation
, can work on either on a de novo transcriptome generated by elvers
, or on previously-generated assemblies. To add an assembly as input, specify it via get_reference
in the yaml
config file, as described above.
workflows
- preprocess: Read Quality Trimming and Filtering (fastqc, trimmomatic)
- kmer_trim: Kmer Trimming and/or Digital Normalization (khmer)
- assemble: Transcriptome Assembly (trinity)
- get_reference: Specify assembly for downstream steps
- annotate : Annotate the transcriptome (dammit)
- sourmash_compute: Build sourmash signatures for the reads and assembly (sourmash)
- quantify: Quantify transcripts (salmon)
- diffexp: Conduct differential expression (DESeq2)
- plass_assemble: assemble at the protein level with PLASS
- paladin_map: map to a protein assembly using paladin
end-to-end workflows:
- default: preprocess, kmer_trim, assemble, annotate, quantify
- protein assembly: preprocess, kmer_trim, plass_assemble, paladin_map
You can see the available workflows (and which programs they run) by using the --print_workflows
flag:
elvers examples/nema.yaml --print_workflows
Each included tool can also be run independently, if appropriate input files are provided. This is not always intuitive, so please see our documentation for running each tools for details (described as "Advanced Usage"). To see all available tools, run:
elvers examples/nema.yaml --print_rules
Configuring Parameters for a workflow¶
For any workflow, we need to provide a configuration file that specifies the path to your samples file or get_reference (discussed above).
We can generate this file either by:
- Adding just the desired parameters
- Allowing
elvers
to build a (long) full configfile for us, and modifying as desired.
Option 1: Adding just the required parameters
The configuration file primarily provides the location of the input data and/or input assembly. The simplest config file contains just this information.
get_data:
samples: samples.tsv
or
get_reference:
reference: assembly.fasta
There are a few other options we can add to customize the name of the output directory and files.
basename: NAME
: helps determine file names and output directory (by default:BASENAME_out
)experiment: EXPERIMENT
: some additional "experiment" info to add to the output directory name ( outdir:BASENAME_EXPERIMENT_out
)out_path: /full/path
: if you want to redirect the output to some location not under theelvers
directory.- Finally, to specify a specific set of workflows or tools to use, add
workflows
to youryaml
file. In this case, we just want to runfastqc
andtrimmomatic
:
workflows:
- fastqc
- trimmomatic
Now, if you'd like to run any particular program with non-default parameters, or you're running differential expression analysis, you'll need to add some info to the config. For any (each) program, follow this format to see program params:
elvers config.yaml <PROGRAM> --print_params
For example, for deseq2:
elvers config.yaml deseq2 --print_params
Then copy and paste the parameters that show up in your terminal into your config file. Please see each program's documentation for additional info on what (and how) to modify each program. For example, if running an assembly, we definitely recommend modifying the max_memory
parameter of Trinity.
Option 2: Use elvers
to add all parameters for our workflow of choice
To get a configfile for the default "eel pond" workflow, that you can modify, run:
elvers my_workflow.yaml --build_config
The output should be a yaml
configfile. At the top, you should see:
#
# elvers workflow configuration
#
basename: elvers
experiment: _experiment1
First, change the samples.tsv
name to your my_samples.tsv
file. Then, modify the basename and any experiment info you'd like to add. The default output directory will be: basename_experiment_out
within the main elvers
directory. If you'd like, you can add one more parameter to the top section: out_path: OUTPUT_PATH
, if you'd like the output to go somewhere other than the elvers
directory. Note the basename and experiment are still used to determine the output directory name.
Customizing program parameters:
Below this section, you should see some parameters for each program run in this workflow. For example, here's the first few programs: a utility to download or link your data, quality trim with trimmomtic, and assess quality with fastqc. Program parameters do not always show up in order in this file - order in this file does not affect program run order.
get_data:
samples: samples.tsv
download_data: false
use_ftp: false
trimmomatic:
adapter_file:
pe_path: ep_utils/TruSeq3-PE-2.fa
se_path: ep_utils/TruSeq3-SE.fa
trim_cmd: ILLUMINACLIP:{}:2:40:15 LEADING:2 TRAILING:2 SLIDINGWINDOW:4:15 MINLEN:25
extra: ''
fastqc:
extra: ''
Override default params for any program by modifying the values for any of the paramters under that program name. For example, if you'd like to download data instead of link it from another location on your machine, modify download_data
to True
under the get_data
program. We provide an extra
parameter wherever possible to give you access to additional command-line parameters for each program that we have not specifically enabled changes for.
For example, under the trimmomatic
section, you can modify the "extra" param to pass any extra trimmomatic parameters, e.g.:
trimmomatic:
extra: 'HEADCROP:5' # to remove the first 5 bases at the front of the read.
For more on what parameters are available, see the docs for each specific program or utility rule under the "Available Workflows" --> "Programs Used" navigation tab.