Getting started with Sourmash: a tutorial

Let's go through a sourmash tutorial.

Commands (as of 07/31/2020) reproduced here. If doing this at a later date and these commands don't work, run the tutorial using the link above instead!!

Making signatures, comparing, and searching

7/31/20 You'll need about 5 GB of free disk space, and about 5 GB of RAM to search GenBank.

First activate your conda envionment:

conda activate tutorial

Because we installed sourmash into our tutorial environment, you should now be able to use the sourmash command:

sourmash info

Download the data and put it in a folder

Lets make some folders to keep our project organized:

mkdir smtut
cd smtut

mkdir data
cd data

Now we can download some data into our data folder

wget https://s3.amazonaws.com/public.ged.msu.edu/ecoli_ref-5m.fastq.gz
wget https://s3.amazonaws.com/public.ged.msu.edu/ecoliMG1655.fa.gz

OR

wget https://bit.ly/2CXj13R -O ecoli_ref-5m.fastq.gz
wget https://bit.ly/2PdRbCJ -O ecoliMG1655.fa.gz

The file that ends in .fastq.gz is a zipped file with DNA sequences in fastq format

Lets take a look at our data:

first we can unzip the fastq file,

gunzip --keep ecoli_ref-5m.fastq.gz

and use ls -lh to compare the sizes the zipped file should be smaller than the unzipped file

we can look at the unzipped file using

less ecoli_ref-5m.fastq

it will have this fastq format:

Computing a sourmash signature

Compute a scaled signature from our reads:

first, lets make a folder to keep our signatures in

mkdir ~/smtut/sigs
cd ~/smtut/sigs

Compare reads to assemblies

how much of the read content is contained in the reference genome?

Build a signature for the E. coli reads with sourmash compute,

sourmash compute \
--scaled 10000 \
~/smtut/data/ecoli_ref*.fastq.gz \
-o ~/smtut/sigs/ecoli-reads.sig \
-k 31

Next build a signature for the E. coli genome with sourmash compute,

sourmash compute \
--scaled 10000 \
~/smtut/data/ecoliMG1655.fa.gz  \
-o ~/smtut/sigs/ecoli-genome.sig \
-k 31

this command will call the software to make a kmer signature, keep only 1 in 10000 kmers from the sequence, store the signature where we ask it to, and use a k size of 31

Now evaluate containment, that is, what fraction of the read content is contained in the genome:

sourmash search -k 31 ecoli-reads.sig ecoli-genome.sig --containment

and you should see:

# running sourmash subcommand: search
loaded query: /home/ubuntu/data/ecoli_ref-5m... (k=31, DNA)
loaded 1 signatures from ecoli-genome.sig
1 matches:
similarity   match
----------   -----
 10.6%       /home/ubuntu/data/ecoliMG1655.fa.gz

Try the reverse - why is it bigger?

sourmash search -k 31 ecoli-genome.sig ecoli-reads.sig --containment