Experiment Challenge

Thus far, we've run through a set of commands with six metagenome samples. These have been from two patients, one with Crohn's disease, one without. But there's not much we can say with just two patients (other than "they look different!").

Now, we'll add samples from more patients and try to understand the differences between samples.

Workspace Setup

If you're starting a new work session on FARM, be sure to follow the instructions here.

Download Additional Files

Move into the raw data folder

cd ~/2020-NSURP/raw_data

Download the files

wget https://ibdmdb.org/downloads/raw/HMP2/MGX/2018-05-04/CSM7KOJO.tar
wget https://ibdmdb.org/downloads/raw/HMP2/MGX/2018-05-04/HSMA33R1.tar
wget https://ibdmdb.org/downloads/raw/HMP2/MGX/2018-05-04/HSMA33R5.tar

wget https://ibdmdb.org/downloads/raw/HMP2/MGX/2018-05-04/MSM6J2QP.tar
wget https://ibdmdb.org/downloads/raw/HMP2/MGX/2018-05-04/MSM6J2QF.tar
wget https://ibdmdb.org/downloads/raw/HMP2/MGX/2018-05-04/MSM6J2QH.tar

Untar each read set

tar xf HSMA33S4.tar

Trim and compute sourmash signatures for these files

Using your HackMD notes, run the commands for trimming (both adapter and k-mer trimming) on these samples.

For reference, the Quality Control contains code for running fastp and khmer trimming; the Comparing Samples with Sourmash contains code for computing sourmash signatures.

Run Sourmash Compare

Run sourmash compare and sourmash plot (as in Comparing Samples with Sourmash).

What do you notice about the sourmash comparison heatmap?

Which samples are more similar to each other? Can you guess which patients have Crohn's disease or no IBD by comparing them to your prior samples? How do samples from the same patient compare to samples from different patients?

Assess Taxonomic Diversity

Run sourmash gather with the genbank-k31 database on these new samples.

Count the total number of species found in each sample. Does it differ between Crohn's disease and non-IBD patients?

Look at the sample metadata

What additional information can you glean from looking at the metadata (the data about the data)?

As usual, let's start by creating a directory for this

mkdir -p ~/2020-NSURP/metadata
cd ~/2020-NSURP/metadata

All information about this project can be found here.

Download the metadata file here.

This file contains information for the metagenomics sequencing (which we looked at), but also a number of other assessments.

This file is a spreadsheet that can be opened in Google docs or viewed with less.

For example, view this file with less like so:

less -S hmp2_metadata_2018-08-20.csv

This is a very large file. You can get information about a specific sample by searching out the specific sample id's we used. For example:

grep HSMA33S4 hmp2_metadata_2018-08-20.csv

That's still a lot of info - let's get only the info for metagenomics samples:

grep metagenomics hmp2_metadata_2018-08-20.csv | grep HSMA33S4

The formatting is still a litle ugly. Let's direct the output to a file, and then open it with less -S:

grep metagenomics hmp2_metadata_2018-08-20.csv | grep HSMA33S4 > HSMA33S4.csv
less -S HSMA33S4.csv