Skip to content

Tools and Services

Mitch Bekritsky edited this page Oct 17, 2018 · 4 revisions

Introduction

Although this project's primary location is BaseSpace Sequence Hub, many features are provided by tools and services outside BSSH. These tools and services are able to take advantage of BSSH's S3 storage to provide their functionality without replicating the data.

This page attempts to list these tools and services that are resting on top of the BSSH data:

BaseSpace CLI tools

BaseSpace CLI tools allow you to access your BaseSpace data from the linux command line.

The 2 main tools to use with this Polaris project are:

  • BaseMount: allows you to mount the Polaris data on your linux Amazon instance as a virtual filesystem.
    When linux tools (cat, grep, samtools, bcftools, etc.) access mounted BSSH files, BaseMount only downloads the required parts of the files, making it possible to get results faster than if you had downloaded the entire files.

  • bs cp: allows you to efficiently copy files (or entire appResults) from BSSH to your local drive. You should only do this from an Amazon instance located in the same region as the BSSH server (which is Frankfurt for this Polaris project), to get acceptable transfer times and low latency.
    Copying entire files is useful when BaseMount is not good enough, which is when you plan to repeatedly query files that won't fit in the disk cache. For example we use this technique in the Hail tutorial because large Hail queries easily fill up the RAM, flushing VDS file data out of the disk cache, and forcing it to be redownloaded at the next query.

Hail

Hail is an open-source, scalable framework for exploring and analyzing genomic data.
We replicated the data for Hail to work: The Hail appResult is a conversion of the generic GVCF Genotyper appResult to Hail's VDS format.

Hail's big feature is its ability to manipulate and query a genetic dataset via its comprehensive query language.
Although it requires Java and Spark to be installed, these give Hail the ability to run in parallel on any number of cores and machines (as opposed to bcftools, which is faster on a single core, but will hardly take advantage of more than 1 or 2 cores).

This tutorial describes how to query the Polaris 1 Diversity Cohort data with Hail.