Comparing Samples with Sourmash

Many metagenomics projects are designed to assess the differences between microorganism composition between samples. There are many ways to get at this question, but we can start by using k-mer profiles of the reads to quickly compare samples using sourmash compare.

Workspace Setup

If you're starting a new work session on FARM, be sure to follow the instructions here.

First, let's make a directory that we will be working in:

cd ~/2020-NSURP
mkdir -p sourmash-compare
cd sourmash-compare

Note: we made a directory called sourmash for the taxonomic discovery module. It helps to name files and folders with details that will help you remember what results are contained therein. How could the prior module's folder name be changed to be more informative?

Calculate sourmash signatures

Now we can calculate signatures for each of the files. This will take 5 or 10 minutes to run

for infile in ~/2020-NSURP/kmer-trim/*.kmertrim.fq.gz
do
    name=$(basename ${infile} .kmertrim.fq.gz)
    echo $name
    sourmash compute -k 21,31,51 --scaled 2000 --track-abundance --merge ${name} -o ${name}.kmertrim.sig ${infile}
done

Note: Here we used bash for loop to compute signatures on each file with a single set of commands. Go through this tutorial to learn about loops!

Compare sample signatures

Using these signatures, we can compare our samples.

sourmash compare -k 31 -o IBD.kmertrim.compare.np --csv IBD.kmertrim.compare.csv --ignore-abundance *sig

Now let's plot! Sourmash has a built in plot utility that we can take advantage of. The output is a heatmap.

Visualize the comparison using sourmash plot

sourmash plot --labels IBD.kmertrim.compare.np

This command produces three png files:

IBD.kmertrim.compare.np.hist.png
IBD.kmertrim.compare.np.dendro.png
IBD.kmertrim.compare.np.matrix.png

As usual, these files can be downloaded to your local computer with scp

scp -P 2022  -i /path/to/key/file username@farm.cse.ucdavis.edu:~/2020-NSURP/sourmash-compare/*.png ./

If you're on a mac using zsh, you may need to replace the scp with noglob scp in the command above.

If you're on windows, you may need to move the the files from the download location on your Linux shell over to the windows side of your computer before opening.

Once the files are on your local computer, double click to open each file. The .matrix.png is the heatmap file, which will show the pattern of similarity between samples

It should look like this:

sourmash compare matrix .

What does this heatmap tell you about your samples? For example, does it provide any information about which samples are from IBD patients, and which are from non-IBD patients?

Visualize the comparison in an MDS plot

We can use this output to make a Multidimensional Scaling plot. MDS plots are commonly used in visualize similarities and differences between samples. Here the strength is we used the k-mer content of all of our reads to calculate similarity.

Install the R packages ggplot2 and ggrepel

Since this is conda, it will recognize that it needs to install R alongside these, and take care of that for you! Usually you'll want to be careful of which version of R you're installing, but since we're just doing this one R command, we'll be a little lax about it.

conda install r-ggplot2 r-ggrepel

Download an R script to make the MDS plot

The script source is here if you are interested!

wget https://raw.githubusercontent.com/dib-lab/2020-NSURP/master/scripts/mds_plot.R

Run the R script

Rscript mds_plot.R IBD.kmertrim.compare.csv IBD.kmertrim.compare.mds.pdf

This outputs a file IBD.kmertrim.compare.mds.pdf. You can see that file by downloading to your computer. It should look something like this:

mds plot of sourmash compare .

How do the samples cluster? How does this compare to our heatmap, generated by sourmash plot, above?