Merging Read Pairs with PEAR¶
In order to map PE nucleotide reads to protein assemblies with Paladin
, which cannot yet use paired end reads, we use PEAR
to merge PE reads prior to mapping.
From the PEAR
documentation:
PEAR is an ultrafast, memory-efficient and highly accurate pair-end read merger. It is fully parallelized and can run with as low as just a few kilobytes of memory.
PEAR evaluates all possible paired-end read overlaps and without requiring the target fragment size as input. In addition, it implements a statistical test for minimizing false-positive results. Together with a highly optimized implementation, it can merge millions of paired end reads within a couple of minutes on a standard desktop computer.
Quickstart: Running PEAR with elvers¶
We recommend you run pear
as part of the paladin_map subworkflow
elvers examples/nema.yaml paladin_map
This will run trimmomatic trimming prior to PEAR merging of paired end reads and then paladin mapping. If you'd like to just run PEAR
, see "Advanced Usage" below.
PEAR Command¶
On the command line, the command elvers runs for each set of files is approximately:
pear -f forward_reads -r reverse_reads \
-p p_value -j snakemake.threads -y max_memory \
extra -o output_basename
Output files:¶
Your main output directory will be determined by your config file: by default it is BASENAME_out
(you specify BASENAME).
PEAR will output files in the preprocess/pear
subdirectory of this output directory. PEAR output will have the same sample name as input, but end with .pear.fq.gz
.
Modifying Params for PEAR:¶
Be sure to set up your sample info and build a configfile first (see Understanding and Configuring Workflows).
To see the available parameters for the PEAR
rule, run
elvers config pear --print_params
This will print the following:
#################### pear ####################
pear:
input_kmer_trimmed: false
input_trimmomatic_trimmed: true
max_memory: 4G
pval: 0.01
extra: ''
#####################################################
Please modify the max_memory
parameter to fit the needs of your reads and system.
Within the Protein Assembly workflow or the paladin_map subworkflow, these options enable you to choose kmer-trimmed, quality-trimmed, or raw sequencing data as input. We recommend using quality-trimmed reads as input. If both input_kmer_trimmed
and input_trimmomatic_trimmed
are False
, we will just use raw reads from the samples.tsv
file.
In addition to changing parameters we've specifically enabled, you can modify the extra
param to pass in additional parameters, e.g.:
extra: ' --some_param that_param '
Please see the PEAR documentation for info on the params you can pass into PEAR.
Be sure the modified lines go into the config file you're using to run elvers
(see Understanding and Configuring Workflows).
Advanced Usage: Running PEAR as a standalone rule¶
You can run pear as a standalone rule, instead of withing a larger elvers
workflow. However, to do this, you need to make sure the input files are available.
For pear, the default input files are quality-trimmed input data (e.g. output of trimmomatic).
If you've already done this, you can run:
elvers my_config pear
If not, you can run the prior steps at the same time to make sure pear can find these input files:
elvers my_config get_data trimmomatic pear
Note, PEAR
only works on paired end reads, as there's nothing to do for single end reads!
Snakemake Rule¶
We wrote a new PEAR snakemake wrapper to run PEAR via snakemake. This wrapper has not been added to the official snakemake-wrappers repo yet, but feel free to use as needed.
For snakemake afficionados, see our pear rule on github.