Annotating de novo transcriptomes with dammit¶
dammit is an annotation pipeline written by Camille Scott. dammit runs a relatively standard annotation protocol for transcriptomes: it begins by building gene models with Transdecoder, and then uses the following protein databases as evidence for annotation:
- Swiss-Prot (manually reviewed and curated)
- Pfam-A
- Rfam
- OrthoDB
- uniref90 (uniref is optional with
--full
). - nr (nr is optional with
--nr
).
If a protein dataset is available, this can also be supplied to the
dammit
pipeline with --user-databases
as optional evidence for
annotation.
In addition, BUSCO v3 is run, which will compare the gene content in your transcriptome with a lineage-specific data set. The output is a proportion of your transcriptome that matches with the data set, which can be used as an estimate of the completeness of your transcriptome based on evolutionary expectation (Simho et al.2015). There are several lineage-specific datasets available from the authors of BUSCO. We will use the metazoa
dataset for this transcriptome.
Computational Requirements¶
For the standard pipeline, dammit needs ~18GB of storage space to store its prepared databases, plus a few hundred MB per BUSCO database. For the standard annotation pipeline, we recommend at least 16GB of RAM. This can be reduced by editing LAST parameters via a custom configuration file.
The full pipeline, which uses uniref90, needs several hundred GB of space and considerable RAM to prepare the databases. You'll also want either a fat internet connection or a big cup of patience to download uniref.
For some species, we have found that the amount of RAM required can be proportional to the size of the transcriptome being annotated.
While dammit runs, it will print out which tasks its running to the terminal. dammit is written with a library called pydoit, which is a python workflow library similar to GNU Make and Snakemake. This not only helps organize the underlying workflow, but also means that if we interrupt it, it should properly resume! Caveat: if your job dies, without properly shutting down, snakemake will leave your directory "locked" (a safety feature to prevent two runs/programs from editing the same file simultaneously). If this happens, you'll need to run elvers
with the --unlock
flag.
Dammit Commands¶
dammit has two major subcommands: dammit databases
and dammit annotate
. databases
checks that the databases are installed and prepared, and if run with the --install
flag, will perform that installation and preparation. annotate
then performs the annotation using installed tools and databases. These commands are automated and integrated into a single rule in elvers
. At some point, these may be split into databases
and annotate
if that functionality is desired, but it's not in
our immediate plans.
By default, databases are placed at elvers/databases
.
Both databases
and annotate
have a --quick
option, that only installs or runs a "quick" version of the pipeline: transdecoder
On the command line, the commands we would run are as follows:
Databases¶
#install databases
dammit databases --install --busco-group metazoa --busco-group eukaryota
Note: if you have limited space on your instance, you can also install these databases in a different location (e.g. on an external volume) by adding --database-dir /path/to/databases
.
You can also use a custom protein database for your species. If your critter is a non-model organism, you will likely need to create your own with proteins from closely-related species. This will rely on your knowledge of your system!
Annotation¶
dammit annotate trinity.nema.fasta --busco-group metazoa --user-databases <your-database-here> --n_threads 6
If you want to run a quick version of the pipeline, add a parameter, --quick
, to omit OrthoDB, Uniref, Pfam, and Rfam. A "full" run will take longer to install and run, but you'll have access to the full annotation pipeline.
Running Dammit¶
Run Dammit via the "default" Eel Pond workflow or via the annotate subworkflow. To run Dammit as a standalone program, see "Advanced Usage" section below.
Modifying Params for Dammit:¶
Be sure to set up your sample info and build a configfile first (see Understanding and Configuring Workflows).
To see the available parameters for the dammit
rule, run
elvers config dammit --print_params
This will print the following:
#################### dammit ####################
dammit:
busco_group: # specify all busco groups below here
- metazoa
- eukaryota
db_dir: databases # specify location for databases (or previously installed databases)
db_install_only: False # just install databases, don't run annotation
db_extra: ''
annot_extra: ' --quick '
#####################################################
In addition to changing parameters we've specifically enabled, you can modify the extra
param to pass any extra parameters.
In dammit, both databases
and annotation
take an extra
param:
db_extra: '--someflag someparam --someotherflag thatotherparam'
annot_extra: '--someflag someparam --someotherflag thatotherparam'
Within the "default" Eel Pond workflow, this will annotate the Trinity assembly. Since Dammit
requires an assembly as input, see the annotate subworkflow instructions for how to run dammit as a standalone rule. See the dammit documentation for all params you can pass into Dammit.
Dammit Output¶
Your main output directory will be determined by your config file: by default it is BASENAME_out
(you specify BASENAME).
Dammit will output files in the annotation
subdirectory of this output directory. The annotated fasta file will be ASSEMBLY.dammit.fasta
and the annotation gff3
file will be ASSEMBLY.dammit.gff3
. Dammit will also produce a number of intermediate files that will be contained within an ASSEMBLY.fasta.dammit
folder.
After a successful run, you'll have a new directory called BASENAME.fasta.dammit
. If you look inside, you'll see a lot of files. For example, for a transcriptome with basename trinity.nema
, the folder trinity.nema.fasta.dammit
should contain the following files after a standard (not --quick
) run:
ls trinity.nema.fasta.dammit/
annotate.doit.db trinity.nema.fasta.dammit.namemap.csv trinity.nema.fasta.transdecoder.pep
dammit.log trinity.nema.fasta.dammit.stats.json trinity.nema.fasta.x.nema.reference.prot.faa.crbl.csv
run_trinity.nema.fasta.metazoa.busco.results trinity.nema.fasta.transdecoder.bed trinity.nema.fasta.x.nema.reference.prot.faa.crbl.gff3
tmp trinity.nema.fasta.transdecoder.cds trinity.nema.fasta.x.nema.reference.prot.faa.crbl.model.csv
trinity.nema.fasta trinity.nema.fasta.transdecoder_dir trinity.nema.fasta.x.nema.reference.prot.faa.crbl.model.plot.pdf
trinity.nema.fasta.dammit.fasta trinity.nema.fasta.transdecoder.gff3
trinity.nema.fasta.dammit.gff3 trinity.nema.fasta.transdecoder.mRNA
As part of elvers, we copy the two most important files, trinity.nema.fasta.dammit.fasta
and trinity.nema.fasta.dammit.gff3
into the main annotation
directory. trinity.nema.fasta.dammit.stats.json
also gives summary stats that are quite useful.
If the above dammit
command is run again, there will be a message:
**Pipeline is already completed!**
If you'd like to rerun the dammit pipeline, you'll need to use the --forceall
flag, like so:
elvers examples/nema.yaml annotation --forceall
Additional Notes (non-elvers): Parsing Dammit GFF3 files¶
Camille wrote dammit in Python, which includes a library to parse gff3 dammit output. If you want to work with this gff3 downstream, use htis parsing library:
First, enter in a dammit environment. you can find the one elvers uses, but it might also be easier to just make a new one using the dammit environment file:
# from within the elvers directory
conda env create -n dammit --file rules/dammit/dammit-env.yaml
source activate dammit
Remember you can exit your conda environments with source deactivate
Then:
cd trinity.nema.fasta.dammit
python
Use gff3 librarys to output a list of gene IDs:
import pandas as pd
from dammit.fileio.gff3 import GFF3Parser
gff_file = "trinity.nema.fasta.dammit.gff3"
annotations = GFF3Parser(filename=gff_file).read()
names = annotations.sort_values(by=['seqid', 'score'], ascending=True).query('score < 1e-05').drop_duplicates(subset='seqid')[['seqid', 'Name']]
new_file = names.dropna(axis=0,how='all')
new_file.head()
new_file.to_csv("nema_gene_name_id.csv")
exit()
This will output a table of genes with 'seqid' and 'Name' in a .csv file: nema_gene_name_id.csv
. Let's take a look at that file:
less nema_gene_name_id.csv
Notice there are multiple transcripts per gene model prediction. This .csv
file can be used in tximport
in downstream DE analysis.