Databases

MySQL server is tgac-db1.nbi.ac.uk at the moment the database is buntingd_fieldpathgenomics.

Sample Metadata Table

This needs to happen. I currently have about 6 different versions of the excel files with sample information in. I tried to unify them but it’s non-trivial as there is no standardisation. Diane/Antoine please advise!

Alignments Statistics Table

This table stores the STAR alignments logs and is populated by the Library.AlignmentStats task. Schema currently is

columns = [
    (["Library", sqlalchemy.String(64)], {}),
    (["input_reads", sqlalchemy.INTEGER], {}),
    (["input_len", sqlalchemy.FLOAT], {}),
    (["mapped_reads", sqlalchemy.INTEGER], {}),
    (["mapped_reads_pc", sqlalchemy.String(10)], {}),
    (["mapped_len", sqlalchemy.FLOAT], {}),
    (["mismatch_pc", sqlalchemy.String(10)], {}),
    (["datetime", sqlalchemy.String(25)], {}),
    (["genome", sqlalchemy.String(25)], {}),
    (["git_commit", sqlalchemy.String(40)], {}),
    (["pipeline_hash", sqlalchemy.String(40)], {}),
]

I find the easiest way to access the database is to use a Jupyter notebook and pandas to query. For example to plot the percentage of mapped reads per library:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline
connection_string = "mysql+pymysql://tgac:tgac_bioinf@tgac-db1.hpccluster/buntingd_fieldpathogenomics"
df = pd.read_sql('AlignmentStats', connection_string).apply(pd.to_numeric, args=('ignore',))
df.plot(kind='bar', x='Library', y='mapped_reads_pc')