Workshop: 2022-05-10 CFDE/iHMP walkthrough

dib-lab/ribbity-workshops#3

CFDE/iHMP walkthrough - 2022 May CFDE Hackathon

Permanent link: github.com/ctb/2022-may-cfde-demo/DEMO.md

Using the portal to find and export file information

(See Jessica and Amanda's tutorial for more detail!)

Create a new personal collection

Go to the CFDE data portal at app.nih-cfde.org/.

Under your username (upper right), create a new personal collection.

For name, you can use "Tuesday demo" or anything else. You can leave description blank.

Find some files

Go back to the CFDE data portal main page.

Select "File" (upper left).

Use the facets on the left to select: * Common Fund Program: HMP * Project: "Longitudinal multi 'omics" * "has persistent ID" - True * Uncompressed size in Bytes - 50000000 to 60000000 (50 MB to 60 MB).

With these selections, the first result should have "Filename" of SRR5935743_1.fastq.

Add files to your personal collection

Click on the first result to get a detail view. Then add it to your personal collection: * scroll down to "Part of personal collection" and click "Link records." * Select your personal collection, click "Link" (upper right).

Click back in your browser, to get back to your filtered search.

Repeat linking to a collection with the third result (Filename: SRR5950647_1.fastq). Add it to the same personal collection.

:::info For today, here are the direct links to the two files we'll be using: * file 1, SRR5935743_1.fastq * file 2, SRR5950647_1.fastq :::

(Note, you could select any files you like, but these are small enough to work and I know what the results will be. So it's good for today's demo; I suggest trying new/different files as a Thursday exercise!)

Export your personal collection

Go to your collection, and select "export" and choose "NCPI manifest format."

:::info What is NCPI? "NCPI" stands for "NIH Cloud Platform Interoperability", an effort by the NIH to convene around interoperation for cloud workbenches. :::

Examine the NCPI manifest file

You should now have a CSV file in your Downloads that, when examined with a spreadsheet program, looks like this:

The key piece of information in here is the drs_uri column, which provides a Data Repository Service location from which to download the files.

Connect to your AWS instance/on JupyterHub

We've installed some software for you on AWS, and you should receive a Web link to your AWS instance Web site in zoom chat (during today's demo, at least ;).

Log in to JupyterHub

You should see:

Use username cfde. (We'll give you the sekret password in Zoom chat.)

You should now be at a JupyerHub console.

Open Jupyter notebook

Go to the demo/ folder and open demo-walkthrough.ipynb by clicking on it.

You can also see a static version of the walkthrough.

For exercise at end of Jupyter Notebook:

This will run through sample SRR5935743_1!

# run a sample against a database with sourmash gather
# take exactly 5.000 minutes
!sourmash gather SRR5935743_1.fastq.sig \
    gtdb-rs207.genomic-reps.dna.k31.zip \
      -o SRR5935743_1.gather.csv

# run the metacoder taxonomy plot
!./sourmash_gather_to_metacoder_plot.R \
    SRR5935743_1.gather.csv \
    SRR5935743_1.tax.png

Ideas for Thursday sessions

During the Thursday co-working sessions, we will provide AWS instances with all the software pre-installed.

Here are two ideas for things to do during that session:

re-do this analysis!
do the download and taxonomic analysis with new/different/bigger samples!
tackle other kinds of metagenome analyses - happy to chat about your interests!
do same analysis with larger (complete) gtdb database

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search