About

This page goes a little more in depth on the software and its goals.

Motivations

Several different factors motivated dammit's development. The first of these was the sea lamprey transcriptome project, which had annotation as a primary goal. Many of dammit's core features were already implemented there, and it seemed a shame not share that work with others in a usable format. Related to this was a lack of workable and easy-to-use existing solutions; in particular, most are meant to be used as protocols and haven't been packaged in an automated format. Licensing was also a big concern -- software used for science should be open source, easily accessible, remixable, and free.

Implicit to these motivations is some idea of what a good annotator should look like, in the author's opinion:

It should be easy to install and upgrade
It should only use Free software
It should make use of standard databases
It should output in reasonable formats
It should be relatively fast
It should try to be correct, insofar as any computational approach can be "correct"
It should give the user some measure of confidence for its results.

The Obligatory Flowchart

The Workflow

Software Used

TransDecoder
BUSCO
HMMER
Infernal
LAST
crb-blast (for now)
pydoit (under the hood)

All of these are Free Software, as in freedom and beer

Databases

Pfam-A
Rfam
OrthoDB
BUSCO databases
Uniref90
User-supplied protein databases

The last one is important, and sometimes ignored.

Conditional Reciprocal Best LAST

Building off Richard and co's work on Conditional Reciprocal Best BLAST, I've implemented a new version with Python and LAST -- CRBL. The original lives here.

Why??

BLAST is too slooooooow
Ruby is yet another dependency to have users install
With Python and scikit learn, I have freedom to toy with models (and learn stuff)

And, of course, some of these databases are BIG. Doing blastx and tblastn between a reasonably sized transcriptome and Uniref90 is not an experience you want to have.

ie, practical concerns.

A brief intro to CRBB

Reciprocal Best Hits (RBH) is a standard method for ortholog detection
Transcriptomes have multiple multiple transcript isoforms, which confounds RBH
CRBB uses machine learning to get at this problem

CRBB attempts to associate those isoforms with appropriate annotations by learning an appropriate e-value cutoff for different transcript lengths.

CRBB

from http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004365#s5

CRBL

For CRBL, instead of fitting a linear model, we train a model.

SVM
Naive bayes

One limitation is that LAST has no equivalent to tblastn. So, we find the RBHs using the TransDecoder ORFs, and then use the model on the translated transcriptome versus database hits.