Taxonomic Discovery with Sourmash

Until now, we've performed general pre-processing steps on our sequencing data; sequence quality analysis and trimming usually occur at the start of any sequencing data analysis pipeline. Now we will begin performing analysis that makes sense for metagenomic sequencing data.

We are working with publicly-available data, but let's pretend that this is a brand new sample that we just got back from our sequencing core. One of the first things we often want to do with new metagenome sequencing samples is figure out their approximate species composition. This allows us to tap in to all of the information known about these species and relate our community to existing literature.

We can determine the approximate composition of our sample using sourmash.

Introduction to sourmash

Please read this tutorial for an introduction to how sourmash works.

tl;dr (but actually please read it): sourmash breaks nucleotide sequences down into k-mers, systematically subsamples those k-mers into a representative "signature", and then enables searches for those k-mers in databases. This makes it really fast to make comparisons. Here, we will compare our metagenome sample against a pre-prepared database that contains all microbial sequences in GenBank.

Workspace Setup

If you're starting a new work session on FARM, be sure to follow the instructions here. You can just do the part to enter a tmux session, since we'll be using a larger srun session than usual.

Starting with sourmash

Sourmash doesn't have a big memory or CPU footprint, and can be run on most laptops. Below is a recommended srun command to start an interactive session in which to run the srun commands.

srun -p bmh -J sourmash24 -t 24:00:00 --mem=16gb -c 1 --pty bash

Install sourmash

Be sure you've set up conda channels properly, as in the Install Conda section

conda activate nsurp-env
conda install -y sourmash

Next, let's create a directory in which to store our sourmash signatures

cd ~/2020-NSURP
mkdir -p sourmash
cd sourmash

What data to use?

We could run sourmash with our adapter trimmed or k-mer trimmed data. In fact, doing so would make sourmash faster because there would be fewer k-mers in the sample.

We are currently comparing our sample against a database of trusted DNA sequences, so any k-mers in our sample that contain adapters sequence or errors will not match to the trusted reference sequences in the database. However, even though we very lightly trimmed our reads, there is a chance that we removed a very low abundance organism that was truly present in the sample. Given this trade-off, we use often raw reads for reference data comparisons, and quality-controlled reads for all other comparisons.

Generate a sourmash signature

Next, let's make sourmash signatures from our reads.

Remember from the Quick Insights from Sequencing Data with sourmash tutorial that a k-mer size of 21 is approximately specific at the genus level, a 31 is at the species level, and 51 at the strain level. We will calculate our signature with all three k-mer sizes so we can choose which one we want to use later.

sourmash compute -o CSM7KOJE.raw.sig --merge CSM7KOJE --scaled 2000 -k 21,31,51 --track-abundance ~/2020-NSURP/raw_data/CSM7KOJE_*fastq.gz

You should see output that looks like this:

== This is sourmash version 3.4.1. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

setting num_hashes to 0 because --scaled is set
computing signatures for files: /home/ntpierce/2020-NSURP/raw_data/CSM7KOJE_R1.fastq.gz, /home/ntpierce/2020-NSURP/raw_data/CSM7KOJE_R2.fastq.gz
Computing signature for ksizes: [21, 31, 51]
Computing only nucleotide (and not protein) signatures.
Computing a total of 3 signature(s).
Tracking abundance of input k-mers.
... reading sequences from /home/ntpierce/2020-NSURP/raw_data/CSM7KOJE_R1.fastq.gz
... /home/ntpierce/2020-NSURP/raw_data/CSM7KOJE_R1.fastq.gz 9704045 sequences
... reading sequences from /home/ntpierce/2020-NSURP/raw_data/CSM7KOJE_R2.fastq.gz
... /home/ntpierce/2020-NSURP/raw_data/CSM7KOJE_R2.fastq.gz 9704045 sequences
calculated 1 signatures for 19408090 sequences taken from 2 files
saved signature(s) to CSM7KOJE.raw.sig. Note: signature license is CC0.

The outputs file, CSM7KOJE.raw.sig holds a representative subset of k-mers from our original sample, as well as their abundance information. The k-mers are "hashed", or transformed, into numbers to make selecting, storing, and looking up the k-mers more efficient.

Sourmash gather

sourmash gather is a method for estimating the taxonomic composition of known sequences in a metagenome.

Please go read through the sourmash documentation on Breaking down metagenomic samples with gather and lca. Check out Appendix A and B in this documentation for a good overview of how sourmash gather works.

Running gather on our IBD samples can give us an idea of the microbes present in each sample. gather results provide strain-level specificity to matches in its output -- e.g. all strains that match any sequences (above a threshold) in your metagenome will be reported, along with the percent of each strain that matches. This is useful both to estimate the amount of metagenome sample that is known, and to estimate the closest strain relative to the organisms in your metagenomes.

Download and unzip the database:

mkdir -p ~/2020-NSURP/databases/
cd ~/2020-NSURP/databases/
curl -L https://osf.io/jgu93/download -o genbank-k31.sbt.zip
cd ~/2020-NSURP/sourmash

Run sourmash gather

First, let's run a very quick search:

sourmash gather --num-results 10 CSM7KOJE.raw.sig ~/2020-NSURP/databases/genbank-k31.sbt.zip

the --num-results 10 is a way of shortening the search. In this case, we ask for only the top 10 results

We see an output that looks like this:

== This is sourmash version 3.4.1. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

selecting default query k=31.
loaded query: CSM7KOJE... (k=31, DNA)
loaded 1 databases.


overlap     p_query p_match avg_abund
---------   ------- ------- ---------
6.4 Mbp        9.5%   63.6%      17.4    CBWF010000001.1 Klebsiella pneumoniae...
5.3 Mbp       25.6%   90.6%      56.5    KB851045.1 Clostridium clostridioform...
5.1 Mbp        3.6%   74.4%       8.5    GG668320.1 Clostridium hathewayi DSM ...
4.4 Mbp        2.0%   83.9%       5.4    LBDB01000001.1 Vibrio parahaemolyticu...
3.2 Mbp        5.0%   80.5%      18.6    JTBP01000001.1 Proteus mirabilis stra...
4.8 Mbp        1.0%   33.6%       4.2    JRSL01000930.1 Escherichia coli strai...
2.7 Mbp        0.8%   59.5%       3.5    FUNQ01000052.1 Clostridioides diffici...
2.5 Mbp        3.5%   33.9%      16.3    CZAT01000001.1 Flavonifractor plautii...
4.9 Mbp        2.3%   45.1%      11.7    KQ087951.1 Escherichia coli strain BI...
2.2 Mbp        3.3%   64.6%      18.1    FCEY01000001.1 Clostridium sp. AT5 ge...

found 10 matches total;
(truncated gather because --num-results=10)
the recovered matches hit 56.8% of the query

The shortened search will be quite quick.

The two columns to pay attention to are p_query and p_match. p_query is the percent of the metagenome sample that is (estimated to be) from the named organism. p_match is the percent of the database match that is found in the query.
These metrics` are affected by both evolutionary distance and by low coverage of the organism's gene set (low sequencing coverage, or little expression).

Now, let's run the full gather analysis: This will take a long time to run. Sourmash will also output a csv with all the results information that we will use later to visualize our results.

sourmash gather -o CSM7KOJE_x_genbank-k31.gather.csv CSM7KOJE.raw.sig ~/2020-NSURP/databases/genbank-k31.sbt.zip

When sourmash is finished running, it tells us the % of our sequence was unclassified; i.e. it doesn't match any sequence in the database.

In a later module, we may use additional steps prior to gather to improve the percent of sequence in the metagenome that is classifiable. These include, for example, using bbduk to remove additional human genome k-mers or using assembly-style programs such as megahit or spacegraphcats, to build longer contiguous gene sequences.

Other Methods for Taxonomic Discovery and Classification

There are many tools, such as Kraken and Kaiju, that can do taxonomic classification of individual reads from metagenomes. These seem to perform well (albeit with high false positive rates) in situations where you don’t necessarily have the genome sequences that are in the metagenome. Sourmash, by contrast, can estimate which known genomes are actually present, so that you can extract them and map/align to them. It seems to have a very low false positive rate and is quite sensitive to strains.

Detecting contamination or incorrect data

sourmash gather taxonomic discovery can help uncover contamination or errors in your sequencing samples. We recommend doing sourmash gather immediately after receiving your data from the sequencing facility. If your environmental metagenome has a tremendous amount of mouse sequence in it... maybe the sequencing facility sent you the wrong data?

Challenge: sourmash gather

Gather with trimmed data

Above, we ran sourmash gather on our untrimmed data. 44% of the sample did not contain sequence in any GenBank assembly. A substantial proportion of this sequence could be due to k-mers with errors. Run sourmash gather again on the adapter/ k-mer trimmed data. How much less of the sequence is unclassifiable when the errors and adapters are removed? How many species are no longer detected after k-mer and error trimming?

Gather at different ksizes

The genbank reference databases for signatures of ksize k=21 and k=51 are available for download.

k=21

cd ~/2020-NSURP/databases/
curl -L https://osf.io/dm7n4/download -o genbank-k21.sbt.zip

k=51

cd ~/2020-NSURP/databases/
curl -L https://osf.io/2uvsc/download -o genbank-k51.sbt.zip

How do you expect the gather results to differ for each? Why?

Test Gather parameters

By running sourmash gather --help, you can see all the options for the gather program.

scaled

The scaled option provides a chaces to downample the query to the specified scaled factor.

  --scaled FLOAT        downsample query to the specified scaled factor

Try running gather with a scaled value of 50000. How do the results change, and why?

base pair threshold for matches

The threshold-bp option lets you only find matches that have at least this many base pairs in common (default 50,000 bp)

Increasing the threshold makes gather quicker (at the expense of losing some of the smaller matches):

--threshold-bp 10000000

What happens if you run gather with the threshold above?

Decreasing the threshold will take more time, but be more thorough. A threshold of 0bp does an exhaustive search for all matches

--threshold-bp 0

The full gather took quite a long time on our samples, so there's no need to run this one! But do keep it in mind as a way to make sure we get absolutely all of the matches we can get using gather.