Pipelines

The entire Field Pathogenomics project is broken down into a series of Luigi pipelines. The Library pipeline is run by library and the Callset pipeline is run once per variant callset.

Library Pipeline

The Library pipeline contains all the tasks that should be run on a per sample basis wrapped up in PerLibPipeline and a batch task to run PerLibPipeline on each library in lib-list

The key steps are:
  • Locating and concatenating FASTQ files
  • Read trimming for quality and TrueSeq adaptors using Trimommatic
  • Read QC with FastQC
  • Alignment to reference with STAR
  • Variant calling with HaplotypeCaller
  • Plotting allele frequency

Inputs

Reads .fastq.gz

The first task in the pipeline FetchFastqGZ searches read-dir for files matching the pattern *library*_R1.fastq.gz and *library*_R2.fastq.gz. If multiple matches are found they are concatenated together in filename sorted order. Symbolic links are ignored

Outputs

dedupped.bam

The aligned, cleaned, annotated de-duplicated BAM file and index.

Log.final.out

The STAR alignment log

AlignmentStats

A row is inserted into the SQL table AlignmentStats containing the alignment statistics.

LIBxxxxx.g.vcf

gVCF file and index output by HaplotypeCaller.

QC/

Directory containing nucleotide distributions, quality scores and allele frequency distributions.

Flags

--base-dir
Base of the output directory structure::
|–BASE_DIR
|–libraries
|–LIBxxxxxx
--scratch-dir

Scratch directory

--read-dir

File glob to pass to find to search for reads in.

--star-genome

STAR genome index

--snp-db

VCF of high quality a priori known SNPs to use for base quality score recalibration (BQSR). If none BQSR is not run

--reference

Reference genome FASTA (must have index in same folder)

--lib-list

JSON formatted list of librariries. The PerLibPipeline is run once for each library in lib-list.

Typically the Library pipeline will be run indirect as part of the Callset pipeline, though could be run directly to just do QC/alignments.

When running the :py:module:`Library` as script like this it takes as its first argument a text file containing libraries one per line which is used to construct lib-list

Callset Pipeline

The Callset pipeline take the individually called gVCF files produced by HaplotypeCaller and co-calls variants to produce the callset, which is then filtered and evaluated.

Key steps:
  • Use CombineGVCF to collapse the single sample gVCFs into a set of multisample gVCFs. Broad recommend ~200 samples per gVCF
  • Run GenotypeGVCF to call variants.
  • Filter with vcftools
  • Separate out SNP, INDELs and non-variant sites.
  • Use SNPeff to get synonymous variant
  • Store the callset in HD5 format for easy access
  • Compute QC statistics and output to Jupyter notebook reports

Inputs

gVCF .g.vcf
Individually called gVCFs output from Library pipeline For each library in –lib-list the pipeline will expect a gVCF –gvcf-dir/LIBxxxxxxx/LIBxxxxxxx.g.vcf
Exon Mask .bed
To speed up the analysis of RNAseq data use the existing gene annotations and only call variants in exon regions.

Outputs

_SNPs[.vcf.gz,.h5]
Filtered single nucleotide variants
_RefSNPs[.vcf.gz,.h5]
Filtered single nucleotide variants and sites called as homokaryotic reference-like
_INDELs.vcf.gz
Filtered multinucleotide variants
_SNPs_syn[.vcf.gz,.h5]
Synonymous sites only
Raw.ipynb
Notebook reporting on QC metrics of the raw callset
Filtered.ipynb
Notebook reporting on QC metrics of the filtered callset
SNPs.ipynb
Notebook reporting on QC metrics of just the SNPs

Flags

--output-prefix

Name to attach to callset

--mask

BED file of regions on which to perform genotyping e.g. exons for RNAseq

--base-dir
Base of the output directory structure::
|–BASE_DIR
|–callsets
|–(output-prefix)
--scratch-dir

Scratch directory

--gVCF-dir

Folder to look for gVCFs in

--reference

Reference genome FASTA (must have index in same folder)

--lib-list

JSON formatted list of librariries to jointly genotype

NB: To speed up the analysis this pipeline makes heavy use of ScatterGather. It scatters the regions in –mask and passes these to GenotypeGVCF though the -L flag. It then scatters the resulting vcf.gz files line-by-line for filtering. The number of scatter gather spits to make for this is set in Callset.py, if running as module as below this parameter is exposed as the second arg.

Callset.py is designed to be run as a script taking a text file containing as list of libraries as the first arg and number of SG splits as the second. The name of the libraries file is taken as the output-prefix for the callset.

./scripts/Callset.sh my_callset.txt 10

Tree Pipeline

This follows on from the Callset pipeline and runs RAxML to make a phylogenetic tree. It takes a list of libraries as an input and will create a new callset containing exactly these libraries if one does not already exist.

Key steps:
  • Apply the SNPs to the reference to create a pair of pseudohaplotype sequences
  • Extract the coding sequence regions described in the GFF
  • For each gene decide whether it has enough coverage (>80%) to be used then concatenate all genes from a sample
  • Decide if each sample has enough coverage (>80%) to be included
  • Get the 3rd codon position
  • Run RAxML and bootstrap

Flags

--gff

GFF file describing the gene annotations

Outputs

./scripts/Tree.sh my_callset.txt

Transcripts Pipeline

This uses various programs for RNAseq transcript assembly to create a set of consensus transcript annotations.

Key steps:
  • Stringtie reference guided transcript assembly
  • Cufflinks reference guided transcript assembly
  • Trinity de novo transcript assembly
  • Portcullis splice junction identification
  • Transdecoder ORF finder
  • BLAST against Uniprot
  • Mikado unification

Flags

Outputs

./scripts/Transcripts.sh my_callset.txt