A genome-grist quickstart

Installation

We suggest installing in an isolated conda environment. The following will create a new environment, activate it, and install the latest version of genome-grist from PyPI (which is ).

Run:

conda create -y -n grist python=3.12 pip
conda activate grist
python -m pip install genome-grist

Note: genome-grist should run in Python 3.11 onwards (as of June 2025).

Running genome-grist

We currently recommend running genome-grist in its own directory, for several reasons; in particular, genome-grist uses snakemake and conda to install software under the working directory, and it's nice to have all of files be shared.

Within the current working directory, genome-grist will create a genbank_cache/ subdir, and any outputs.NAME/ subdirectories requested by the configuration. We recommend always running genome-grist in this directory and naming the output directories after the different projects using genome-grist.

So, create a subdirectory and change into it:

mkdir grist/
cd grist/

Note, genome-grist works entirely within the current working directory and temp directories.

Download a small example database

Download the GTDB rs226 set of 143,614 species-level genomes, in a pre-prepared sourmash database format:

curl -LO https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db/gtdb-rs226/gtdb-rs226-reps.k31.sig.zip

You can use any sourmash database with Genbank identifiers with genome-grist; see available databases for more info. You can also use private databases; see the configuration docs for more info.

Make a configuration file

Put the following in a config file named conf-tutorial.yml:

samples:
- SRR5950647
outdir: outputs.tutorial/

sourmash_databases:
- gtdb-rs226-reps.k31.sig.zip

Do your first real run!

Execute:

genome-grist run conf-tutorial.yml summarize_gather summarize_mapping

This will perform the following steps:

download the SRR5950647 metagenome from the Sequence Read Archive (target download_reads).
preprocess it to remove adapters and low-quality reads (target trim_reads).
build a sourmash signature from the preprocess reads. (target smash_reads).
perform a sourmash gather against the specified database (target gather_reads).
download the matching genomes from GenBank into genbank_cache/ (target download_matching_genomes).
map the metagenome reads to the various genomes (target map_reads).
produce two summary notebooks (targets summarize_gather and summarize_mapping).

You can put one or more targets on the command line as above with summarize_gather and summarize_mapping.

Output files

Some key output files under the outputs directory are:

gather/{sample}.gather.out - human-readable output from sourmash gather.
gather/{sample}.gather.csv - sourmash gather CSV output.
gather/genomes/ - all of the genomes found across all of the samples.
gather/{sample}.genomes.info.csv - information about the matching genomes from genbank.
mapping/{sample}.summary.csv - summary information about mapped reads
reports/report-{sample}.html - a summary report.
trim/{sample}.trim.fq.gz - trimmed and preprocessed reads.
sigs/{sample}.trim.sig.zip - sourmash signature for the preprocessed reads.

Note that genome-grist run <config.yml> zip will create a file named <output_dir>.zip with the above files in it.

Please see the guide to genome-grist output files for more information!

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search