Experiment Challenge
Thus far, we've run through a set of commands with six metagenome samples. These have been from two patients, one with Crohn's disease, one without. But there's not much we can say with just two patients (other than "they look different!").
Now, we'll add samples from more patients and try to understand the differences between samples.
Workspace Setup
If you're starting a new work session on FARM, be sure to follow the instructions here.
Download Additional Files
Move into the raw data folder
cd ~/2020-NSURP/raw_data
Download the files
wget https://ibdmdb.org/downloads/raw/HMP2/MGX/2018-05-04/CSM7KOJO.tar
wget https://ibdmdb.org/downloads/raw/HMP2/MGX/2018-05-04/HSMA33R1.tar
wget https://ibdmdb.org/downloads/raw/HMP2/MGX/2018-05-04/HSMA33R5.tar
wget https://ibdmdb.org/downloads/raw/HMP2/MGX/2018-05-04/MSM6J2QP.tar
wget https://ibdmdb.org/downloads/raw/HMP2/MGX/2018-05-04/MSM6J2QF.tar
wget https://ibdmdb.org/downloads/raw/HMP2/MGX/2018-05-04/MSM6J2QH.tar
Untar each read set
tar xf HSMA33S4.tar
Trim and compute sourmash signatures for these files
Using your HackMD notes, run the commands for trimming (both adapter and k-mer trimming) on these samples.
For reference, the Quality Control contains code for running fastp
and khmer
trimming; the Comparing Samples with Sourmash contains code for computing sourmash signatures.
Run Sourmash Compare
Run sourmash compare
and sourmash plot
(as in Comparing Samples with Sourmash).
What do you notice about the sourmash comparison heatmap?
Which samples are more similar to each other? Can you guess which patients have Crohn's disease or no IBD by comparing them to your prior samples? How do samples from the same patient compare to samples from different patients?
Assess Taxonomic Diversity
Run sourmash gather
with the genbank-k31
database on these new samples.
Count the total number of species found in each sample. Does it differ between Crohn's disease and non-IBD patients?
Look at the sample metadata
What additional information can you glean from looking at the metadata (the data about the data)?
As usual, let's start by creating a directory for this
mkdir -p ~/2020-NSURP/metadata
cd ~/2020-NSURP/metadata
All information about this project can be found here.
Download the metadata file here.
This file contains information for the metagenomics sequencing (which we looked at), but also a number of other assessments.
This file is a spreadsheet that can be opened in Google docs or viewed with less
.
For example, view this file with less like so:
less -S hmp2_metadata_2018-08-20.csv
This is a very large file. You can get information about a specific sample by searching out the specific sample id's we used. For example:
grep HSMA33S4 hmp2_metadata_2018-08-20.csv
That's still a lot of info - let's get only the info for metagenomics samples:
grep metagenomics hmp2_metadata_2018-08-20.csv | grep HSMA33S4
The formatting is still a litle ugly.
Let's direct the output to a file, and then open it with less -S
:
grep metagenomics hmp2_metadata_2018-08-20.csv | grep HSMA33S4 > HSMA33S4.csv
less -S HSMA33S4.csv