Introduction¶
My aim is to write a set of pipelines completely covering the field pathogenomics project from reads to paper!
The Luigi Workflow Manager¶
Luigi is library for running complex workflows. A luigi workflow is a graph of Task objects. The workflow is then executed by a worker under the direction of the central scheduler, we can then view the task graph and monitor execution from the web interface. Using a luigi increases robustness and flexibility above using shell scripts etc
- Easy to insert/remove tasks
- Scheduler will retry tasks that fail
- Recovery from incomplete state - each task is atomic so can resume execution after failure
- Scheduler lazily executes the task graph - will reuse existing data if possible
- Easy to interface with SQL
- It’s really fast to scaffold pipelines like this, allows for rapid prototyping
- SLURM accounting data is captured and stored allowing for a postmortem, plotting statistics about ram, cpu, disk use etc throughout the pipeline
General Set-up¶
All the pipelines are wrapped up in a python package FieldPathogenomics and can be installed directly from Github using pip. This can be combined with virtualenvs to make it very simple to, starting from nothing, checkout a working version of the pipeline and run it. It also means it’s possible to have separate development and production environments
Next Steps¶
- In rough order of decreasing priority:
- Write pipelines for downstream analyses DAPC/popgen/STRUCTURE
- Use Mikado and Portcullis to improve low qual SNPs around splice junctions <—- This would make a really nice project
3. Comparison with other SNP calling programs and the original pipeline 5. Phaser <- Just got published in Nature, very nice method for phasing rnaseq data 6. Explore exciting new downstream analysis methods, I have a list
Need to generally increase test and documentation coverage