A genome-grist quickstart
Installation
We suggest installing in an isolated conda environment. The following will create a new environment, activate it, and install the latest version of genome-grist from PyPI (which is ).
Run:
conda create -y -n grist python=3.9 pip
conda activate grist
python -m pip install genome-grist
Note: genome-grist should run in Python 3.8 onwards (as of Sep 2022).
Running genome-grist
We currently recommend running genome-grist in its own directory, for several reasons; in particular, genome-grist uses snakemake and conda to install software under the working directory, and it's nice to have all of files be shared.
Within the current working directory, genome-grist will create a
genbank_cache/
subdir, and any outputs.NAME/
subdirectories
requested by the configuration. We recommend always running
genome-grist in this directory and naming the output directories after
the different projects using genome-grist.
So, create a subdirectory and change into it:
mkdir grist/
cd grist/
Note, genome-grist works entirely within the current working directory and temp directories.
Download a small example database
Download the GTDB R07 rs207 set of ~48,000 guide genomes, in a pre-prepared sourmash database format:
curl -LO https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db/gtdb-rs207/gtdb-rs207.genomic-reps.dna.k31.zip
You can use any sourmash database with Genbank identifiers with genome-grist; see available databases for more info. You can also use private databases; see the configuration docs for more info.
Make a configuration file
Put the following in a config file named conf-tutorial.yml
:
samples:
- SRR5950647
outdir: outputs.tutorial/
sourmash_databases:
- gtdb-rs207.genomic-reps.dna.k31.zip
Do your first real run!
Execute:
genome-grist run conf-tutorial.yml summarize_gather summarize_mapping
This will perform the following steps:
- download the SRR5950647 metagenome from the Sequence Read Archive (target
download_reads
). - preprocess it to remove adapters and low-quality reads (target
trim_reads
). - build a sourmash signature from the preprocess reads. (target
smash_reads
). - perform a
sourmash gather
against the specified database (targetgather_reads
). - download the matching genomes from GenBank into
genbank_cache/
(targetdownload_matching_genomes
). - map the metagenome reads to the various genomes (target
map_reads
). - produce two summary notebooks (targets
summarize_gather
andsummarize_mapping
).
You can put one or more targets on the command line as above with summarize_gather
and summarize_mapping
.
Output files
Some key output files under the outputs directory are:
gather/{sample}.gather.out
- human-readable output from sourmash gather.gather/{sample}.gather.csv
- sourmash gather CSV output.gather/genomes/
- all of the genomes found across all of the samples.gather/{sample}.genomes.info.csv
- information about the matching genomes from genbank.mapping/{sample}.summary.csv
- summary information about mapped readsreports/report-{sample}.html
- a summary report.trim/{sample}.trim.fq.gz
- trimmed and preprocessed reads.sigs/{sample}.trim.sig.zip
- sourmash signature for the preprocessed reads.
Note that genome-grist run <config.yml> zip
will create a file named <output_dir>.zip
with the above files in it.
Please see the guide to genome-grist output files for more information!