The Great Repertoire Project

This repository contains code and data used in our study of the baseline human antibody repertoire. Briefly, we performed ultra-deep sequencing of the antibody repertoires of 10 healthy, adult subjects (approxmately 3 billion total antibody sequences). The Great Repertoire Project revealed a massively diverse repertoire and, while the repertoires of individual subjects were clearly distinguishable, we found a surprisingly high level of repertoire overlap between individuals.

Code

The code used in this project is assembled into a series of Juypter notecooks. There are two sets of notebooks, those containing code used for DATA PROCESSING and those containing code used to MAKE FIGURES. GitHub will render each of the notebooks, but the code cannot be executed from within GitHub. If you'd like to actually run the code contained in the notebooks, you must clone the repository.

NOTE: Whenever possible, the intermediate datasets required to run the code are included in this repository, however, many intermediate datasets are too large to be included. In such cases, links to the required datasets are provided in the appropriate notebook.

Datasets

We have generated several large datasets, in two primary groups: antibody sequences from healthy adult subjects, and synthetic antibody sequences using statistical models of V(D)J recombination.

Antibody sequencing data

Raw and processed datasets from each subject can be downloaded using the following links. Some of these datasets are quite large (the compressed raw FASTQs are roughly 100GB per subject, and the uncompressed JSON datasets range from ~100GB to nearly 1TB).

316188
- Sequences: raw FASTQs, consensus FASTAs
- FASTQC: pre-trimming, post-trimming
- Annotated data: consensus CSVs, consensus JSONs
326650
- Sequences: raw FASTQs, consensus FASTAs
- FASTQC: pre-trimming, post-trimming
- Annotated data: consensus CSVs, consensus JSONs
326651
- Sequences: raw FASTQs, consensus FASTAs
- FASTQC: pre-trimming, post-trimming
- Annotated data: consensus CSVs, consensus JSONs
326713
- Sequences: raw FASTQs, consensus FASTAs
- FASTQC: pre-trimming, post-trimming
- Annotated data: consensus CSVs, consensus JSONs
326737
- Sequences: raw FASTQs, consensus FASTAs
- FASTQC: pre-trimming, post-trimming
- Annotated data: consensos CSVs, consensus JSONs
326780
- Sequences: raw FASTQs, consensus FASTAs
- FASTQC: pre-trimming, post-trimming
- Annotated data: consensus CSVs, consensus JSONs
326797
- Sequences: raw FASTQs 1 2, consensus FASTAs
- FASTQC: pre-trimming 1 2, post-trimming 1 2
- Annotated data: consensus CSVs, consensus JSONs
326907
- Sequences: raw FASTQs, consensus FASTAs
- FASTQC: pre-trimming, post-trimming
- Annotated data: consensus CSVs, consensus JSONs
327059
- Sequences: raw FASTQs, consensus FASTAs
- FASTQC: pre-trimming, post-trimming
- Annotated data: consensus CSVs, consensus JSONs
D103
- Sequences: raw FASTQs, consensus FASTAs
- FASTQC: pre-trimming, post-trimming
- Annotated data: consensus CSVs, consensus JSONs

For each subject, there are a total of 18 samples: 3 technical replicates of each of 6 biological replicates. Biological replicates refer to different aliquots of peripheral blood monomuclear cells (PBMCs), from which total RNA was separately isolated and processed. Thus, sequences or clonotypes found in multiple biological replicates are assumed to have independently occurred in different cells. Technical relicates refer to independent library preparations using the same aliquot of PBMC-derived RNA. In each of the above datasets, samples 1-6 are biological replicates. Samples 7-12 and 13-18 are technical replicates of samples 1-6.

Due to technical issues, the sequence data for subject 326797 was spread across two HiSeq flowcells. Thus, the raw FASTQs and FASTQC results can be downloaded in two separate batches. Starting with the first processed dataset (UMI-corrected consensus FASTAs), reads from both flowcells were pooled.

Synthetic antibody sequences

We generated synthetic antibody sequences using IGoR. Three datasets of synthetic sequences are available. As with the repertoire sequencing datasets above, the annotated datasets are quite large (uncompressed, each exceeds 1TB in size).

Ten batches of 100M synthetic sequences, generated with IGoR's default V(D)J recombination model:
- FASTAs
- Annotated CSVs
Ten batches of 100M synthetic sequences, generated with subject-specific recombination models, inferred by IGoR using 500,000 unmutated antibody sequences from each subject:
Ten batches of 100M synthetic sequences, generated with a single "combined subject" recombination model, in which a pool of 50,000 unmutated antibody sequences from each subject were used to infer the model:

Requirements

Python 3.3+ (although Python 2.7 may work for many or most notebooks, this has not been tested)
Jupyter Notebook

Additionally, each notebook may require additional third-party Python packages. Any notebook-specific requirements, as well as instructions for package installation with pip, are provided in each notebook.

If you're new to Python, a great way to get started is to install the Anaconda Python distribution, which includes pip as well as a ton of useful scientific Python packages.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
data_processing		data_processing
make_figures		make_figures
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Great Repertoire Project

Code

Datasets

Antibody sequencing data

Synthetic antibody sequences

Requirements

About

Releases

Packages

Languages

License

brineylab/grp-paper

Folders and files

Latest commit

History

Repository files navigation

The Great Repertoire Project

Code

Datasets

Antibody sequencing data

Synthetic antibody sequences

Requirements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages