A genome-grist quickstart
Installation
We suggest installing in an isolated conda environment. The following will create a new environment, activate it, and install the latest version of genome-grist from PyPI (which is ).
Run:
conda create -y -n grist python=3.12 pip
conda activate grist
python -m pip install genome-grist
Note: genome-grist should run in Python 3.11 onwards (as of June 2025).
Running genome-grist
We currently recommend running genome-grist in its own directory, for several reasons; in particular, genome-grist uses snakemake and conda to install software under the working directory, and it's nice to have all of files be shared.
Within the current working directory, genome-grist will create a
genbank_cache/ subdir, and any outputs.NAME/ subdirectories
requested by the configuration. We recommend always running
genome-grist in this directory and naming the output directories after
the different projects using genome-grist.
So, create a subdirectory and change into it:
mkdir grist/
cd grist/
Note, genome-grist works entirely within the current working directory and temp directories.
Download a small example database
Download the GTDB rs226 set of 143,614 species-level genomes, in a pre-prepared sourmash database format:
curl -LO https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db/gtdb-rs226/gtdb-rs226-reps.k31.sig.zip
You can use any sourmash database with Genbank identifiers with genome-grist; see available databases for more info. You can also use private databases; see the configuration docs for more info.
Make a configuration file
Put the following in a config file named conf-tutorial.yml:
samples:
- SRR5950647
outdir: outputs.tutorial/
sourmash_databases:
- gtdb-rs226-reps.k31.sig.zip
Do your first real run!
Execute:
genome-grist run conf-tutorial.yml summarize_gather summarize_mapping
This will perform the following steps:
- download the SRR5950647 metagenome from the Sequence Read Archive (target
download_reads). - preprocess it to remove adapters and low-quality reads (target
trim_reads). - build a sourmash signature from the preprocess reads. (target
smash_reads). - perform a
sourmash gatheragainst the specified database (targetgather_reads). - download the matching genomes from GenBank into
genbank_cache/(targetdownload_matching_genomes). - map the metagenome reads to the various genomes (target
map_reads). - produce two summary notebooks (targets
summarize_gatherandsummarize_mapping).
You can put one or more targets on the command line as above with summarize_gather and summarize_mapping.
Output files
Some key output files under the outputs directory are:
gather/{sample}.gather.out- human-readable output from sourmash gather.gather/{sample}.gather.csv- sourmash gather CSV output.gather/genomes/- all of the genomes found across all of the samples.gather/{sample}.genomes.info.csv- information about the matching genomes from genbank.mapping/{sample}.summary.csv- summary information about mapped readsreports/report-{sample}.html- a summary report.trim/{sample}.trim.fq.gz- trimmed and preprocessed reads.sigs/{sample}.trim.sig.zip- sourmash signature for the preprocessed reads.
Note that genome-grist run <config.yml> zip will create a file named <output_dir>.zip with the above files in it.
Please see the guide to genome-grist output files for more information!