I'm a data engineer with a background in bioinformatics. I currently work for the Next-Generation Precision Medicine Program (NGPMP) at the Hudson Institute's Centre for Cancer Research (CCR).
Fewer than 1 in 5 children with cancer are found to have actionable mutations that are targetable with existing drug therapies. To improve these odds, our program has developed an extensive collection of paediatric cancer cell line models which then undergo a comprehensive set of functional genomic screens to identify novel drivers of low-survival paediatric cancers. High-thoroughput drug screens are then used to identify potential treatments that precisely target these novel mutations. Published data is then made available through the Childhood Cancer Model Atlas (CCMA).
Our program produces terabytes of data that has to be stored, processed, cleaned, and annotated before being disseminated to researchers for downstream analysis. My role as data engineer is to effectively manage the above data lifecycle so that researchers can spend more time on analysis and less time on data wrangling.
This was a challenge when I first started the role, as no data strategy or
governance plan was in place, so it was my responsibility to draft and implement
such plans. The approach I chose for our small team of three was to leverage
existing open-source genomics pipelines (e.g. nf-core
community
pipelines) where possible, minimising the overhead
maintenance and development associated with in-house pipelines. Costs are
further reduced by using freely available resources such as MASSIVE
M3 as our cluster environment and the ARDC
Nectar Research Cloud
as our database host. Across our codebases, I also try to restrict the number of
frameworks and languages used to common competencies in the fields of biology
and bioinformatics (E.g. Bash, R, Python, SQL) to minimise the cost associated
with training.
In the last two years, I have made significant strides in moving our organisation's data management processes from data awareness to data proficiency:
- Shifted data processing from ad-hoc, local processing with R scripts to a consolidated Extract Transform Load (ETL) pipeline where data from genomic pipelines are loaded onto a PostgreSQL database and transformed using DBT into cleaned, annotated, and tested datasets.
- Improved dataset accessibility by allowing researchers to
directly query the CCMA database through
dbplyr
inR
, without any required knowledge of SQL. This minimised the need for researchers to download gigabytes of data and process it locally. - Consolidated post-processing scripts that generate datasets into a git controlled codebase, focusing on modular, reusable scripts that follow the basics of UNIX philosophy.
- Reorganised raw and processed data storage according to a standardised
schema, improving accessibility, reducing quota usage (with a reduction of
~60 TB
from deduplication), and optimising file formats for tape recall. - Standardised the naming schemes for cell lines across the organisation to improve the searchability of cell lines and ensure that they can act as functional natural keys across datasets, while remaining recognisable for easy use in lab environments.
- Developed several critical Google sheets used by various non-technical teams to track the progress of various assays, keep track of inventories, and manage metadata on cell lines. Data validation and protection rules are used to ensure data reliability.
- Consolidated organisational documents scattered across local files and personal drives into a managed Google shared drive with an ordered file structure and tag system that facilities easy file discovery.
I am well versed in bioinformatics, which is a requirement in order to effectively and accurately process a broad variety of genomic data. Apart from my love of data engineering, I am also passionate about using statistics and effective data visualisation to make data-driven decisions. My other hobbies include:
- 🪓 Woodworking (i.e. collecting tools that I may someday use)
- 🕹️ Gaming (the more byzantine, the better e.g. Dwarf Fortress, Underrail)
- 📷 Photography (I was particularly prolific when I studied agriculture, and there were plenty of canola fields...)
I would love to hear from you! I'm always happy to discuss my experiences and to hear more about any opportunities.
Note: These statistics only apply to commits on public repos. Most of my recent commits would be to private work-related projects.