FastQC¶

We use FastQC to assess quality of sequencing data before and after adapter trimming.

Modern high throughput sequencers can generate hundreds of millions of sequences in a single run. Before analysing this sequence to draw biological conclusions you should always perform some simple quality control checks to ensure that the raw data looks good and there are no problems or biases in your data which may affect how you can usefully use it.

Most sequencers will generate a QC report as part of their analysis pipeline, but this is usually only focused on identifying problems which were generated by the sequencer itself. FastQC aims to provide a QC report which can spot problems which originate either in the sequencer or in the starting library material.

FastQC can be run in one of two modes. It can either run as a stand alone interactive application for the immediate analysis of small numbers of FastQ files, or it can be run in a non-interactive mode where it would be suitable for integrating into a larger analysis pipeline for the systematic processing of large numbers of files.

Check out these examples of good and bad Illumina data.

Note that FastQC only calculates certain statistics (like duplicated sequences) for subsets of the data (e.g. duplicate sequences are only analyzed for the first 100,000 sequences in each file).

Quickstart¶

Run FastQC via the "default" Eel Pond workflow or via the preprocess subworkflow. To run FastQC as a standalone program, see "Advanced Usage" section below.

Output files:¶

Your main output directory will be determined by your config file: by default it is BASENAME_out (you specify BASENAME).

FastQC will output quality control files in the preprocess/fastqc subdirectory of this output directory. All outputs will contain *.fastqc.html or *.fastqc.zip.

Modifying Params for FastQC:¶

Be sure to set up your sample info and build a configfile first (see Understanding and Configuring Workflows).

To see the available parameters for the fastqc rule, run

elvers config fastqc --print_params

In here, you'll see a section for "fastqc" parameters that looks like this:

  ####################  fastqc  ####################
fastc:
  extra:
  #####################################################

There's almost nothing in here because we use default params. However, you can modify the extra param to pass any extra fastqc parameters, e.g.:

  extra: '--someflag someparam --someotherflag thatotherparam'

See the FastQC documentation for any options you could add. Be sure the modified lines go into the config file you're using to run elvers (see Understanding and Configuring Workflows).

Advanced Usage: Running FastQC as a standalone rule¶

You can run fastqc as a standalone rule, instead of withing a larger elvers workflow. However, to do this, you need to make sure the input files are available.

For FastQC, the input files are your input data - either downloaded or linked into the input_data directory via get_data, and the files that have been quality trimmed.

If you've already done this, you can run:

elvers my_config fastqc

If not, you can run both at once to make sure fastqc can run properly.

elvers my_config get_data trimmomatic fastqc

Snakemake rule¶

We use a local copy of the fastqc snakemake wrapper to run FastQC.

For snakemake afficionados, see the rule on github.