Some examples

Extracting unassembled reads

If we want to extract unassembled reads, we can do the following:

  • load the reads in & tag them;
  • load the assembly in and label all the matching tags;
  • walk over the reads and extract any read that doesn’t have an unlabeled tag.

This is implemented in the khmer script sandbox/extract-unassembled-reads.py.


We probably want to work with low-error reads, so, first pre-process the reads to eliminate errors:

trim-low-abund.py -C 5 -Z 20 -M 1e7 data/mixed-reads.fa.gz

Then, run extract-unassembled-reads.py; here, we’ll run it with an “assembly” that is missing the low-abundance genome (1/10th of the data).

../khmer/sandbox/extract-unassembled-reads.py data/mixed-species-1.fa.gz mixed-reads.fa.gz.abundtrim
# output: 1116 left out of assembly, of 4286 reads

In this case (simulated data) we get 0 reads if we use the true genome:

../khmer/sandbox/extract-unassembled-reads.py data/mixed-species.fa.gz mixed-reads.fa.gz.abundtrim
# output: 0 left out of assembly, of 4286 reads

If we use the un-trimmed reads, then a bunch of reads don’t match b/c of errors:

sandbox/extract-unassembled-reads.py data/mixed-species.fa.gz data/mixed-reads.fa.gz
# output: 568 left out of assembly, of 5508 reads

Characterizing the unassembled reads

Next steps could include:

  • running extract-compact-dbg.py on it & visualizing it;
  • partitioning & examining the partitions;
  • looking at the abundance distribution of the whole thing, or each individual partition;