Getting started with Sourmash: a tutorial
Let's go through a sourmash tutorial.
Commands (as of 07/31/2020) reproduced here. If doing this at a later date and these commands don't work, run the tutorial using the link above instead!!
Making signatures, comparing, and searching
7/31/20 You'll need about 5 GB of free disk space, and about 5 GB of RAM to search GenBank.
First activate your conda envionment:
conda activate tutorial
Because we installed sourmash into our tutorial
environment, you should now be able to use the sourmash
command:
sourmash info
Download the data and put it in a folder
Lets make some folders to keep our project organized:
mkdir smtut
cd smtut
mkdir data
cd data
Now we can download some data into our data
folder
wget https://s3.amazonaws.com/public.ged.msu.edu/ecoli_ref-5m.fastq.gz
wget https://s3.amazonaws.com/public.ged.msu.edu/ecoliMG1655.fa.gz
OR
wget https://bit.ly/2CXj13R -O ecoli_ref-5m.fastq.gz
wget https://bit.ly/2PdRbCJ -O ecoliMG1655.fa.gz
The file that ends in .fastq.gz
is a zipped file with DNA sequences in fastq format
Lets take a look at our data:
first we can unzip the fastq file,
gunzip --keep ecoli_ref-5m.fastq.gz
and use ls -lh
to compare the sizes
the zipped file should be smaller than the unzipped file
we can look at the unzipped file using
less ecoli_ref-5m.fastq
it will have this fastq format:
Computing a sourmash signature
Compute a scaled signature from our reads:
first, lets make a folder to keep our signatures in
mkdir ~/smtut/sigs
cd ~/smtut/sigs
Compare reads to assemblies
how much of the read content is contained in the reference genome?
Build a signature for the E. coli reads with sourmash compute
,
sourmash compute \
--scaled 10000 \
~/smtut/data/ecoli_ref*.fastq.gz \
-o ~/smtut/sigs/ecoli-reads.sig \
-k 31
Next build a signature for the E. coli genome with sourmash compute
,
sourmash compute \
--scaled 10000 \
~/smtut/data/ecoliMG1655.fa.gz \
-o ~/smtut/sigs/ecoli-genome.sig \
-k 31
this command will call the software to make a kmer signature, keep only 1 in 10000 kmers from the sequence, store the signature where we ask it to, and use a k size of 31
Now evaluate containment, that is, what fraction of the read content is contained in the genome:
sourmash search -k 31 ecoli-reads.sig ecoli-genome.sig --containment
and you should see:
# running sourmash subcommand: search
loaded query: /home/ubuntu/data/ecoli_ref-5m... (k=31, DNA)
loaded 1 signatures from ecoli-genome.sig
1 matches:
similarity match
---------- -----
10.6% /home/ubuntu/data/ecoliMG1655.fa.gz
Try the reverse - why is it bigger?
sourmash search -k 31 ecoli-genome.sig ecoli-reads.sig --containment