pip install git+https://github.com/cobioda/longreadtools.git
For detailed instructions on how to use LongReadTools, please refer to the documentation.
This section provides a practical example of how to apply LongReadTools in a bioinformatics workflow. We will demonstrate the process of converting isomatrix text files into Anndata objects, which are suitable for high-throughput single-cell genomics analysis. The example will cover the necessary steps from data retrieval and processing to the final conversion using LongReadTools’ specialized functions.
In this section, we will retrieve a list of isomatrix files for
conversion into Anndata objects. The isomatrix_tools
module within the
LongReadTools library provides a function
multiple_isomatrix_conversion
,
which allows for batch conversion of isomatrix text files into Anndata
objects, a binary format for representing large datasets in the context
of single-cell genomics.
import os
import re
directory = '/data/analysis/data_mcandrew/000-sclr-discovair/'
pattern = re.compile('.*(_BIOP_INT|BIOP_NAS)$')
matching_files = [os.path.join(directory, f) for f in os.listdir(directory) if pattern.match(f)]
print(matching_files)
individual_runs = [f + '_isomatrix.txt' for f in matching_files]
isomatrix_paths = [os.path.join(f, os.path.basename(f) + '_isomatrix.txt') for f in matching_files]
['/data/analysis/data_mcandrew/000-sclr-discovair/D498_BIOP_INT', '/data/analysis/data_mcandrew/000-sclr-discovair/D492_BIOP_NAS', '/data/analysis/data_mcandrew/000-sclr-discovair/D494_BIOP_INT', '/data/analysis/data_mcandrew/000-sclr-discovair/D500_BIOP_INT', '/data/analysis/data_mcandrew/000-sclr-discovair/D494_BIOP_NAS', '/data/analysis/data_mcandrew/000-sclr-discovair/D496_BIOP_INT', '/data/analysis/data_mcandrew/000-sclr-discovair/D499_BIOP_INT', '/data/analysis/data_mcandrew/000-sclr-discovair/D493_BIOP_INT', '/data/analysis/data_mcandrew/000-sclr-discovair/D493_BIOP_NAS', '/data/analysis/data_mcandrew/000-sclr-discovair/D534_BIOP_INT', '/data/analysis/data_mcandrew/000-sclr-discovair/D490_BIOP_INT', '/data/analysis/data_mcandrew/000-sclr-discovair/D500_BIOP_NAS', '/data/analysis/data_mcandrew/000-sclr-discovair/D495_BIOP_INT', '/data/analysis/data_mcandrew/000-sclr-discovair/D492_BIOP_INT']
In this section, we will leverage the isomatool
module from the
LongReadTools library to convert the isomatrix files, which we have
previously identified and listed in isomatrix_paths
, into Anndata
objects. Anndata objects are a binary format designed for large-scale
single-cell genomics data, which facilitates efficient data handling and
manipulation, making them ideal for high-throughput computational
analysis. The
multiple_isomatrix_conversion
function from isomatool
will be used to perform this batch conversion.
from longreadtools.isomatool import *
import scanpy as sc
converted_isomatrix_paths = multiple_isomatrix_conversion(isomatrix_paths, verbose=True, return_paths = True)
File /data/analysis/data_mcandrew/000-sclr-discovair/D498_BIOP_INT/D498_BIOP_INT_isomatrix.h5ad was successfully written to disk.
File /data/analysis/data_mcandrew/000-sclr-discovair/D500_BIOP_NAS/D500_BIOP_NAS_isomatrix.h5ad was successfully written to disk.
File /data/analysis/data_mcandrew/000-sclr-discovair/D500_BIOP_INT/D500_BIOP_INT_isomatrix.h5ad was successfully written to disk.
File /data/analysis/data_mcandrew/000-sclr-discovair/D493_BIOP_NAS/D493_BIOP_NAS_isomatrix.h5ad was successfully written to disk.
File /data/analysis/data_mcandrew/000-sclr-discovair/D494_BIOP_NAS/D494_BIOP_NAS_isomatrix.h5ad was successfully written to disk.
File /data/analysis/data_mcandrew/000-sclr-discovair/D493_BIOP_INT/D493_BIOP_INT_isomatrix.h5ad was successfully written to disk.
File /data/analysis/data_mcandrew/000-sclr-discovair/D499_BIOP_INT/D499_BIOP_INT_isomatrix.h5ad was successfully written to disk.
File /data/analysis/data_mcandrew/000-sclr-discovair/D494_BIOP_INT/D494_BIOP_INT_isomatrix.h5ad was successfully written to disk.
File /data/analysis/data_mcandrew/000-sclr-discovair/D492_BIOP_INT/D492_BIOP_INT_isomatrix.h5ad was successfully written to disk.
File /data/analysis/data_mcandrew/000-sclr-discovair/D495_BIOP_INT/D495_BIOP_INT_isomatrix.h5ad was successfully written to disk.
File /data/analysis/data_mcandrew/000-sclr-discovair/D490_BIOP_INT/D490_BIOP_INT_isomatrix.h5ad was successfully written to disk.
File /data/analysis/data_mcandrew/000-sclr-discovair/D496_BIOP_INT/D496_BIOP_INT_isomatrix.h5ad was successfully written to disk.
File /data/analysis/data_mcandrew/000-sclr-discovair/D534_BIOP_INT/D534_BIOP_INT_isomatrix.h5ad was successfully written to disk.
File /data/analysis/data_mcandrew/000-sclr-discovair/D492_BIOP_NAS/D492_BIOP_NAS_isomatrix.h5ad was successfully written to disk.
andata_concat = concatenate_anndata(converted_isomatrix_paths, verbose = True)
Reading .h5ad files...
Applying feature set standardization...
Concatenating AnnData objects and adding batch keys with scanpy...
Setting .var attribute...
Final Check...
Concatenation complete.
Standardizing anndata features via union: 100%|██████████| 14/14 [01:05<00:00, 4.68s/it]
Now that we have concatenated the Anndata objects, let’s examine the resulting object to ensure it’s structured correctly and ready for downstream analysis. We will display the shape of the matrix, the metadata associated with observations (cells), and the variables (genes) to get an overview of the dataset.
# Display the shape of the concatenated Anndata object
print(f"The Anndata object has {andata_concat.n_obs} observations (cells) and {andata_concat.n_vars} variables (genes).")
# Display the first few entries of the observation metadata to inspect batch information and other annotations
print("Observation metadata (first 5 entries):")
print(andata_concat.obs.head())
# Display the first few entries of the variable metadata to inspect gene and transcript information
print("Variable metadata (first 5 entries):")
print(andata_concat.var.head())
# Check for unique observation names and make them unique if necessary
if not andata_concat.obs_names.is_unique:
andata_concat.obs_names_make_unique()
print("Observation names were not unique; they have been made unique.")
The Anndata object has 122872 observations (cells) and 89177 variables (genes).
Observation metadata (first 5 entries):
batch
AGGAAATGTACAAGCG D498_BIOP_INT
GCCATTCGTCGGAACA D498_BIOP_INT
TCGACCTCAGTGTGCC D498_BIOP_INT
CGTAGTATCAGTGTGT D498_BIOP_INT
GCCAGGTGTCTAACTG D498_BIOP_INT
Variable metadata (first 5 entries):
geneId transcriptId nbExons
transcriptId
ENST00000548501 CYP4F12 ENST00000548501 4
ENST00000324229 CALCB ENST00000324229 5
ENST00000371489 MYOF ENST00000371489 15
ENST00000368659 SLC27A3 ENST00000368659 2
ENST00000669353 TMEM161B-AS1 ENST00000669353 4
Observation names were not unique; they have been made unique.
Access the count matrix from the concatenated Anndata object to analyze the transcript count data.
andata_concat.X
array([[0., 0., 0., ..., 0., 0., 0.],
[1., 0., 0., ..., 0., 0., 0.],
[1., 1., 2., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]], dtype=float32)
andata_concat.var
geneId | transcriptId | nbExons | |
---|---|---|---|
transcriptId | |||
ENST00000548501 | CYP4F12 | ENST00000548501 | 4 |
ENST00000324229 | CALCB | ENST00000324229 | 5 |
ENST00000371489 | MYOF | ENST00000371489 | 15 |
ENST00000368659 | SLC27A3 | ENST00000368659 | 2 |
ENST00000669353 | TMEM161B-AS1 | ENST00000669353 | 4 |
... | ... | ... | ... |
ENST00000548209 | LETMD1 | ENST00000548209 | 5 |
ENST00000490703 | TBC1D10B | ENST00000490703 | 6 |
ENST00000617887 | TMEM200A | ENST00000617887 | 2 |
ENST00000442834 | YY1AP1 | ENST00000442834 | 4 |
ENST00000394260 | PRICKLE4 | ENST00000394260 | 5 |
89177 rows × 3 columns
andata_concat.obs
batch | |
---|---|
AGGAAATGTACAAGCG | D498_BIOP_INT |
GCCATTCGTCGGAACA | D498_BIOP_INT |
TCGACCTCAGTGTGCC | D498_BIOP_INT |
CGTAGTATCAGTGTGT | D498_BIOP_INT |
GCCAGGTGTCTAACTG | D498_BIOP_INT |
... | ... |
AGTGACTTCTAAGCCA | D492_BIOP_INT |
CATTGTTCATCACCAA | D492_BIOP_INT |
GATGATCCACACAGAG | D492_BIOP_INT |
TCGAACATCAGTGCGC | D492_BIOP_INT |
GTTGCGGCACCTGCTT | D492_BIOP_INT |
122872 rows × 1 columns
Utilizing Scanpy, this function call will serialize the andata_concat
object to an HDF5 file, a format widely adopted for storing extensive
scientific data. The chosen filename
‘discovair_long_read_transcript_matrix.h5ad’ clearly reflects the file’s
contents, representing the transcript matrix obtained from long-read
sequencing data.
andata_concat.write_h5ad('discovair_long_read_transcript_matrix.h5ad')
Here we employ the sc.read_h5ad
function to import Anndata objects
encapsulating transcriptomic data derived from long-read and short-read
sequencing approaches. Long-read sequencing data, renowned for capturing
full-length transcripts that unveil isoform diversity, is encapsulated
within the Anndata object from the file
‘discovair_long_read_transcript_matrix.h5ad’. Conversely, short-read
sequencing data, with its larger cell number and potentially more
accurate gene-level quantification, is contained within the Anndata
object from the file ‘integrated_V10.h5ad’.
isoform_anndata_from_long_reads = sc.read_h5ad("discovair_long_read_transcript_matrix.h5ad")
gene_anndata_from_short_reads = sc.read_h5ad("/data/analysis/data_mcandrew/000-sclr-discovair/integrated_V10.h5ad")
Examining the long-read transcript matrix:
isoform_anndata_from_long_reads
AnnData object with n_obs × n_vars = 122872 × 89177
obs: 'batch'
var: 'geneId', 'transcriptId', 'nbExons'
Next, we examine the much larger short-read gene-level dataset:
gene_anndata_from_short_reads
AnnData object with n_obs × n_vars = 414609 × 36602
obs: 'manip', 'donor', 'method', 'position', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'louvain', 'n_genes', 'nCount_SCT', 'nFeature_SCT', 'batch', 'age', 'gender', 'phenotype', 'respifinder', 'TRACvsNAS', 'sixty_plus', 'smoker', 'smoking_years', 'leiden', 'leiden_Endothelial', 'leiden_Stromal', 'leiden_Immune', 'leiden_Epithelial', 'log1p_n_genes_by_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'celltype_lv2_V4', 'celltype_lv0_V4', 'celltype_lv1_V4', 'celltype_lv2_V5', 'celltype_lv0_V5', 'celltype_lv1_V5', 'leiden_scANVI', 'disease_score', 'smoker_phenotype', 'leiden_scANVI_hvg_10000', 'leiden_scANVI_nl_50', 'leiden_scANVI_hvg_10000_nl_50', 'celltype_lv3_V5'
var: 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'mt', 'ribo'
uns: 'Adventitial Fibroblast_colors', 'DE_ct_lv2', 'DE_ct_lv3', 'celltype_lv0_V4_colors', 'celltype_lv0_V5_colors', 'celltype_lv1_V4_colors', 'celltype_lv1_V5_colors', 'celltype_lv2_V4_colors', 'celltype_lv2_V5_colors', 'celltype_lv3_V5_colors', 'donor_colors', 'leiden', 'neighbors', 'neighbors_scanvi', 'pca', 'phenotype_colors', 'position_colors', 'rank_genes_groups_leiden', 'umap'
obsm: 'X_pca', 'X_scANVI', 'X_scANVI_hvg_10000', 'X_scANVI_hvg_10000_nl_50', 'X_scANVI_nl_50', 'X_umap', 'dorothea_mlm_estimate', 'dorothea_mlm_pvals', 'mlm_estimate', 'mlm_pvals'
varm: 'PCs', 'gini_celltype', 'n_cells_celltype_lv2_V3'
obsp: 'connectivities', 'distances', 'neighbors_scanvi_connectivities', 'neighbors_scanvi_distances'
The short-read gene quantification dataset contains a significantly higher number of cells compared to the long-read dataset. Notably, the short-read dataset is annotated, whereas the long-read dataset lacks annotations. Given that both datasets originate from the same library, which was subsequently divided and sequenced on different platforms, there is an expected overlap in cell identities. This commonality provides an opportunity to transfer annotations from the short-read to the long-read dataset by matching corresponding cells.
gene_anndata_from_short_reads.obs
manip | donor | method | position | n_genes_by_counts | total_counts | total_counts_mt | pct_counts_mt | total_counts_ribo | pct_counts_ribo | ... | celltype_lv2_V5 | celltype_lv0_V5 | celltype_lv1_V5 | leiden_scANVI | disease_score | smoker_phenotype | leiden_scANVI_hvg_10000 | leiden_scANVI_nl_50 | leiden_scANVI_hvg_10000_nl_50 | celltype_lv3_V5 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
D460_BIOP_PRO1GGCTTGGAGCGCCTCA-1 | D460_BIOP_PRO1 | D460 | BIOP | PRO | 2150 | 5919.0 | 283.0 | 4.782021 | 1510.0 | 25.515377 | ... | Veinous | Endothelial | Endothelial | 11 | GAP Stage 1 | non-smoker_IPF | 9 | 9 | 8 | Veinous |
D463_BIOP_NAS1TCACTCGCATTGGGAG-1 | D463_BIOP_NAS1 | D463 | BIOP | NAS | 1927 | 4979.0 | 474.0 | 9.519984 | 1357.0 | 27.254469 | ... | Veinous | Endothelial | Endothelial | 11 | GAP Stage 1 | non-smoker_IPF | 9 | 9 | 8 | Veinous |
D534_BIOP_PROAATCGACAGCAAGTCG-1 | D534_BIOP_PRO | D534 | BIOP | PRO | 1264 | 3013.0 | 311.0 | 10.321939 | 779.0 | 25.854630 | ... | Capillary | Endothelial | Endothelial | 11 | Healthy | non-smoker_CTRL | 9 | 9 | 8 | Capillary |
D463_BIOP_NAS1TCGCTTGTCACTTGGA-1 | D463_BIOP_NAS1 | D463 | BIOP | NAS | 3691 | 11794.0 | 1314.0 | 11.141258 | 2867.0 | 24.308971 | ... | Veinous | Endothelial | Endothelial | 11 | GAP Stage 1 | non-smoker_IPF | 9 | 9 | 8 | Veinous |
D489_BIOP_PROAGGGAGTTCGGTCTGG-1 | D489_BIOP_PRO | D489 | BIOP | PRO | 738 | 1096.0 | 57.0 | 5.200730 | 127.0 | 11.587591 | ... | Capillary | Endothelial | Endothelial | 11 | GOLD 1 | non-smoker_BPCO | 9 | 9 | 8 | Capillary |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
D460_BRUS_NAS1TCTATACCAATGGGTG-1 | D460_BRUS_NAS1 | D460 | BRUS | NAS | 1500 | 4263.0 | 447.0 | 10.485574 | 1342.0 | 31.480179 | ... | Suprabasal | Epithelial | Suprabasal | 0 | GAP Stage 1 | non-smoker_IPF | 2 | 1 | 1 | Suprabasal |
D460_BRUS_NAS1GTTATGGCAATGGCAG-1 | D460_BRUS_NAS1 | D460 | BRUS | NAS | 2422 | 6089.0 | 774.0 | 12.711448 | 740.0 | 12.153063 | ... | Ionocyte | Epithelial | Ionocyte | 24 | GAP Stage 1 | non-smoker_IPF | 29 | 27 | 27 | Ionocyte |
D460_BRUS_NAS1ATGAGTCAGCCGTTGC-1 | D460_BRUS_NAS1 | D460 | BRUS | NAS | 2784 | 11638.0 | 1460.0 | 12.545111 | 2642.0 | 22.701494 | ... | Goblet | Epithelial | Goblet | 5 | GAP Stage 1 | non-smoker_IPF | 13 | 5 | 4 | Goblet |
D460_BRUS_NAS1TCATACTAGCAGTAAT-1 | D460_BRUS_NAS1 | D460 | BRUS | NAS | 2563 | 8025.0 | 919.0 | 11.451714 | 1619.0 | 20.174454 | ... | Goblet | Epithelial | Goblet | 5 | GAP Stage 1 | non-smoker_IPF | 13 | 5 | 4 | Goblet |
D460_BRUS_NAS1TTGTTGTCAAGATGTA-1 | D460_BRUS_NAS1 | D460 | BRUS | NAS | 1380 | 3443.0 | 255.0 | 7.406332 | 724.0 | 21.028173 | ... | Goblet | Epithelial | Goblet | 5 | GAP Stage 1 | non-smoker_IPF | 13 | 5 | 4 | Goblet |
414609 rows × 47 columns
To ensure a coherent and integrated analysis of the transcriptomic data derived from both long-read and short-read sequencing technologies, it is imperative to harmonize the indexes of the corresponding Anndata objects. This step is crucial as it aligns the observations (cells) across the datasets, enabling a direct comparison and subsequent operations such as data integration, differential expression analysis, and visualization.
isoform_anndata_from_long_reads.obs['batch'] = isoform_anndata_from_long_reads.obs['batch'].astype(str)
isoform_anndata_from_long_reads.obs_names = isoform_anndata_from_long_reads.obs['batch'] + isoform_anndata_from_long_reads.obs_names + "-1"
After the standardization of the Anndata objects’ indexes, we can confirm that the indexes are now aligned and ready for comparative analysis. This alignment is crucial for the integration of the long-read and short-read transcriptomic data, as it ensures that the same cells are represented in both datasets can be identified.
isoform_anndata_from_long_reads.obs_names
Index(['D498_BIOP_INTAGGAAATGTACAAGCG-1', 'D498_BIOP_INTGCCATTCGTCGGAACA-1',
'D498_BIOP_INTTCGACCTCAGTGTGCC-1', 'D498_BIOP_INTCGTAGTATCAGTGTGT-1',
'D498_BIOP_INTGCCAGGTGTCTAACTG-1', 'D498_BIOP_INTTGTGTGAGTGTTGACT-1',
'D498_BIOP_INTCAGATACTCCAACTGA-1', 'D498_BIOP_INTGCCGATGTCTCATTAC-1',
'D498_BIOP_INTGGAGAACTCTCGAGTA-1', 'D498_BIOP_INTAAGCATCTCGTGGTAT-1',
...
'D492_BIOP_INTAAAGTGAAGGTTACAA-1', 'D492_BIOP_INTTACGGGCGTGAGACCA-1',
'D492_BIOP_INTACAGGGAGTCAACATC-1', 'D492_BIOP_INTTTTCGATCAGGCCTGT-1',
'D492_BIOP_INTAACAACCTCATCAGTG-1', 'D492_BIOP_INTAGTGACTTCTAAGCCA-1',
'D492_BIOP_INTCATTGTTCATCACCAA-1', 'D492_BIOP_INTGATGATCCACACAGAG-1',
'D492_BIOP_INTTCGAACATCAGTGCGC-1', 'D492_BIOP_INTGTTGCGGCACCTGCTT-1'],
dtype='object', length=122872)
In this section, we are going to utilize the
subset_common_cells
function from the longreadtools library to harmonize our datasets. This
function is crucial for ensuring that we are comparing the same cells
across the two Anndata objects - one derived from long-read sequencing
and the other from short-read sequencing. By importing and applying this
function, we can identify the intersection of cells present in both
datasets, allowing for a consistent and integrated analysis.
from longreadtools.Standardization import *
isoform_matrix = subset_common_cells(isoform_anndata_from_long_reads, gene_anndata_from_short_reads)
In the previous steps, we have successfully standardized the indexes of
our Anndata objects and utilized the
subset_common_cells
function to refine the isoform Anndata object derived from long-read
sequencing data. The next logical step is to apply the same subsetting
process to the gene Anndata object from short-read sequencing data. This
ensures that both datasets are synchronized and contain only the cells
common to both, which is a prerequisite for accurate annotation
transfer.
gene_matrtrix = subset_common_cells(gene_anndata_from_short_reads, isoform_matrix)
The next step in our analysis pipeline is to transfer the observation
annotations from the gene_matrix
Anndata object, which contains the
short-read sequencing data, to the isoform_matrix
Anndata object,
which contains the long-read sequencing data. The
transfer_obs
function from the longreadtools library is instrumental in this process.
It meticulously maps the .obs
attributes from one Anndata object to
another based on the shared cell identifiers, thus preserving the
integrity of the data and enabling a seamless integration.
annotated_isoform_matrix = transfer_obs(gene_matrtrix, isoform_matrix)
In this step, we delve into the annotated isoform matrix, which is a
product of the meticulous standardization and subsetting processes we
have applied to our Anndata objects. The annotated_isoform_matrix
is a
rich dataset that combines the detailed isoform data obtained from
long-read sequencing with the comprehensive annotations transferred from
the gene matrix derived from short-read sequencing.
annotated_isoform_matrix.X
array([[1., 0., 0., ..., 0., 0., 0.],
[1., 1., 2., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[3., 0., 0., ..., 0., 0., 0.],
[1., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]], dtype=float32)
By examining this matrix, we gain insights into the transcriptomic landscape at an isoform resolution, which is crucial for understanding the complexity of gene expression patterns. The annotations included in this matrix, such as cell type, donor information, and technical attributes, are pivotal for subsequent analyses that aim to unravel the biological and clinical significance of the data within the context of the longreadtools framework. Lets save it to disk for later use!
annotated_isoform_matrix.write('/data/analysis/data_mcandrew/000-sclr-discovair/discovair_long_read_transcript_matrix_annotated.h5ad')