Skip to content
View SpikyClip's full-sized avatar

Block or report SpikyClip

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
SpikyClip/README.md

header

Hi there, I'm Vikesh Ajith (SpikyClip)

I'm a data engineer with a background in bioinformatics. I currently work for the Next-Generation Precision Medicine Program (NGPMP) at the Hudson Institute's Centre for Cancer Research (CCR).

Fewer than 1 in 5 children with cancer are found to have actionable mutations that are targetable with existing drug therapies. To improve these odds, our program has developed an extensive collection of paediatric cancer cell line models which then undergo a comprehensive set of functional genomic screens to identify novel drivers of low-survival paediatric cancers. High-thoroughput drug screens are then used to identify potential treatments that precisely target these novel mutations. Published data is then made available through the Childhood Cancer Model Atlas (CCMA).

My Contributions

Our program produces terabytes of data that has to be stored, processed, cleaned, and annotated before being disseminated to researchers for downstream analysis. My role as data engineer is to effectively manage the above data lifecycle so that researchers can spend more time on analysis and less time on data wrangling.

This was a challenge when I first started the role, as no data strategy or governance plan was in place, so it was my responsibility to draft and implement such plans. The approach I chose for our small team of three was to leverage existing open-source genomics pipelines (e.g. nf-core community pipelines) where possible, minimising the overhead maintenance and development associated with in-house pipelines. Costs are further reduced by using freely available resources such as MASSIVE M3 as our cluster environment and the ARDC Nectar Research Cloud as our database host. Across our codebases, I also try to restrict the number of frameworks and languages used to common competencies in the fields of biology and bioinformatics (E.g. Bash, R, Python, SQL) to minimise the cost associated with training.

In the last two years, I have made significant strides in moving our organisation's data management processes from data awareness to data proficiency:

  1. Shifted data processing from ad-hoc, local processing with R scripts to a consolidated Extract Transform Load (ETL) pipeline where data from genomic pipelines are loaded onto a PostgreSQL database and transformed using DBT into cleaned, annotated, and tested datasets.
  2. Improved dataset accessibility by allowing researchers to directly query the CCMA database through dbplyr in R, without any required knowledge of SQL. This minimised the need for researchers to download gigabytes of data and process it locally.
  3. Consolidated post-processing scripts that generate datasets into a git controlled codebase, focusing on modular, reusable scripts that follow the basics of UNIX philosophy.
  4. Reorganised raw and processed data storage according to a standardised schema, improving accessibility, reducing quota usage (with a reduction of ~60 TB from deduplication), and optimising file formats for tape recall.
  5. Standardised the naming schemes for cell lines across the organisation to improve the searchability of cell lines and ensure that they can act as functional natural keys across datasets, while remaining recognisable for easy use in lab environments.
  6. Developed several critical Google sheets used by various non-technical teams to track the progress of various assays, keep track of inventories, and manage metadata on cell lines. Data validation and protection rules are used to ensure data reliability.
  7. Consolidated organisational documents scattered across local files and personal drives into a managed Google shared drive with an ordered file structure and tag system that facilities easy file discovery.

I am well versed in bioinformatics, which is a requirement in order to effectively and accurately process a broad variety of genomic data. Apart from my love of data engineering, I am also passionate about using statistics and effective data visualisation to make data-driven decisions. My other hobbies include:

  • 🪓 Woodworking (i.e. collecting tools that I may someday use)
  • 🕹️ Gaming (the more byzantine, the better e.g. Dwarf Fortress, Underrail)
  • 📷 Photography (I was particularly prolific when I studied agriculture, and there were plenty of canola fields...)

Connect

I would love to hear from you! I'm always happy to discuss my experiences and to hear more about any opportunities.

LinkedIn

Languages

GNU Bash R Python PostgreSQL Nextflow

Software

dbt Tidyverse Slurm Docker

Visual Studio Code DBeaver

OS

Windows Ubuntu

Statistics

Note: These statistics only apply to commits on public repos. Most of my recent commits would be to private work-related projects.

Top Langs Top Langs

Pinned Loading

  1. llrnaseq llrnaseq Public

    SpikyClip/llrnaseq is a simple RNA-seq pipeline adapted to the Latrobe Institute of Molecular Science (LIMS) High Performance Computing Cluster (HPCC).

    Nextflow 2

  2. llrnaseq-rna-features-pipeline llrnaseq-rna-features-pipeline Public

    This readme explains how to use the Nextflow llrnaseq in conjunction with the rna-features python package to generate transfer learning expression features.

    R 1

  3. rna-features rna-features Public

    `rna-features` is a package used to generate machine-learning features from RNAseq data.

    Python 1 1

  4. rosalind-solutions rosalind-solutions Public

    Repository for my solutions to rosalind problems.

    Python 1

  5. advent-of-code advent-of-code Public

    Solutions to problems on advent of code.

    Python