Skip to content

cobioda/longreadtools

Repository files navigation

LongReadTools

LongReadTools

Install

pip install git+https://github.com/cobioda/longreadtools.git

How to use

For detailed instructions on how to use LongReadTools, please refer to the documentation.

Example Usage of LongReadTools

This section provides a practical example of how to apply LongReadTools in a bioinformatics workflow. We will demonstrate the process of converting isomatrix text files into Anndata objects, which are suitable for high-throughput single-cell genomics analysis. The example will cover the necessary steps from data retrieval and processing to the final conversion using LongReadTools’ specialized functions.

In this section, we will retrieve a list of isomatrix files for conversion into Anndata objects. The isomatrix_tools module within the LongReadTools library provides a function multiple_isomatrix_conversion, which allows for batch conversion of isomatrix text files into Anndata objects, a binary format for representing large datasets in the context of single-cell genomics.

import os
import re

directory = '/data/analysis/data_mcandrew/000-sclr-discovair/'
pattern = re.compile('.*(_BIOP_INT|BIOP_NAS)$')
matching_files = [os.path.join(directory, f) for f in os.listdir(directory) if pattern.match(f)]
print(matching_files)

individual_runs = [f + '_isomatrix.txt' for f in matching_files]
isomatrix_paths = [os.path.join(f, os.path.basename(f) + '_isomatrix.txt') for f in matching_files]
['/data/analysis/data_mcandrew/000-sclr-discovair/D498_BIOP_INT', '/data/analysis/data_mcandrew/000-sclr-discovair/D492_BIOP_NAS', '/data/analysis/data_mcandrew/000-sclr-discovair/D494_BIOP_INT', '/data/analysis/data_mcandrew/000-sclr-discovair/D500_BIOP_INT', '/data/analysis/data_mcandrew/000-sclr-discovair/D494_BIOP_NAS', '/data/analysis/data_mcandrew/000-sclr-discovair/D496_BIOP_INT', '/data/analysis/data_mcandrew/000-sclr-discovair/D499_BIOP_INT', '/data/analysis/data_mcandrew/000-sclr-discovair/D493_BIOP_INT', '/data/analysis/data_mcandrew/000-sclr-discovair/D493_BIOP_NAS', '/data/analysis/data_mcandrew/000-sclr-discovair/D534_BIOP_INT', '/data/analysis/data_mcandrew/000-sclr-discovair/D490_BIOP_INT', '/data/analysis/data_mcandrew/000-sclr-discovair/D500_BIOP_NAS', '/data/analysis/data_mcandrew/000-sclr-discovair/D495_BIOP_INT', '/data/analysis/data_mcandrew/000-sclr-discovair/D492_BIOP_INT']

In this section, we will leverage the isomatool module from the LongReadTools library to convert the isomatrix files, which we have previously identified and listed in isomatrix_paths, into Anndata objects. Anndata objects are a binary format designed for large-scale single-cell genomics data, which facilitates efficient data handling and manipulation, making them ideal for high-throughput computational analysis. The multiple_isomatrix_conversion function from isomatool will be used to perform this batch conversion.

from longreadtools.isomatool import *
import scanpy as sc
converted_isomatrix_paths = multiple_isomatrix_conversion(isomatrix_paths, verbose=True, return_paths = True)
File /data/analysis/data_mcandrew/000-sclr-discovair/D498_BIOP_INT/D498_BIOP_INT_isomatrix.h5ad was successfully written to disk.
File /data/analysis/data_mcandrew/000-sclr-discovair/D500_BIOP_NAS/D500_BIOP_NAS_isomatrix.h5ad was successfully written to disk.
File /data/analysis/data_mcandrew/000-sclr-discovair/D500_BIOP_INT/D500_BIOP_INT_isomatrix.h5ad was successfully written to disk.
File /data/analysis/data_mcandrew/000-sclr-discovair/D493_BIOP_NAS/D493_BIOP_NAS_isomatrix.h5ad was successfully written to disk.
File /data/analysis/data_mcandrew/000-sclr-discovair/D494_BIOP_NAS/D494_BIOP_NAS_isomatrix.h5ad was successfully written to disk.
File /data/analysis/data_mcandrew/000-sclr-discovair/D493_BIOP_INT/D493_BIOP_INT_isomatrix.h5ad was successfully written to disk.
File /data/analysis/data_mcandrew/000-sclr-discovair/D499_BIOP_INT/D499_BIOP_INT_isomatrix.h5ad was successfully written to disk.
File /data/analysis/data_mcandrew/000-sclr-discovair/D494_BIOP_INT/D494_BIOP_INT_isomatrix.h5ad was successfully written to disk.
File /data/analysis/data_mcandrew/000-sclr-discovair/D492_BIOP_INT/D492_BIOP_INT_isomatrix.h5ad was successfully written to disk.
File /data/analysis/data_mcandrew/000-sclr-discovair/D495_BIOP_INT/D495_BIOP_INT_isomatrix.h5ad was successfully written to disk.
File /data/analysis/data_mcandrew/000-sclr-discovair/D490_BIOP_INT/D490_BIOP_INT_isomatrix.h5ad was successfully written to disk.
File /data/analysis/data_mcandrew/000-sclr-discovair/D496_BIOP_INT/D496_BIOP_INT_isomatrix.h5ad was successfully written to disk.
File /data/analysis/data_mcandrew/000-sclr-discovair/D534_BIOP_INT/D534_BIOP_INT_isomatrix.h5ad was successfully written to disk.
File /data/analysis/data_mcandrew/000-sclr-discovair/D492_BIOP_NAS/D492_BIOP_NAS_isomatrix.h5ad was successfully written to disk.
andata_concat = concatenate_anndata(converted_isomatrix_paths, verbose = True)
Reading .h5ad files...
Applying feature set standardization...
Concatenating AnnData objects and adding batch keys with scanpy...
Setting .var attribute...
Final Check...
Concatenation complete.

Standardizing anndata features via union: 100%|██████████| 14/14 [01:05<00:00,  4.68s/it]

Now that we have concatenated the Anndata objects, let’s examine the resulting object to ensure it’s structured correctly and ready for downstream analysis. We will display the shape of the matrix, the metadata associated with observations (cells), and the variables (genes) to get an overview of the dataset.

# Display the shape of the concatenated Anndata object
print(f"The Anndata object has {andata_concat.n_obs} observations (cells) and {andata_concat.n_vars} variables (genes).")

# Display the first few entries of the observation metadata to inspect batch information and other annotations
print("Observation metadata (first 5 entries):")
print(andata_concat.obs.head())

# Display the first few entries of the variable metadata to inspect gene and transcript information
print("Variable metadata (first 5 entries):")
print(andata_concat.var.head())

# Check for unique observation names and make them unique if necessary
if not andata_concat.obs_names.is_unique:
    andata_concat.obs_names_make_unique()
    print("Observation names were not unique; they have been made unique.")
The Anndata object has 122872 observations (cells) and 89177 variables (genes).
Observation metadata (first 5 entries):
                          batch
AGGAAATGTACAAGCG  D498_BIOP_INT
GCCATTCGTCGGAACA  D498_BIOP_INT
TCGACCTCAGTGTGCC  D498_BIOP_INT
CGTAGTATCAGTGTGT  D498_BIOP_INT
GCCAGGTGTCTAACTG  D498_BIOP_INT
Variable metadata (first 5 entries):
                       geneId     transcriptId nbExons
transcriptId                                          
ENST00000548501       CYP4F12  ENST00000548501       4
ENST00000324229         CALCB  ENST00000324229       5
ENST00000371489          MYOF  ENST00000371489      15
ENST00000368659       SLC27A3  ENST00000368659       2
ENST00000669353  TMEM161B-AS1  ENST00000669353       4
Observation names were not unique; they have been made unique.

Access the count matrix from the concatenated Anndata object to analyze the transcript count data.

andata_concat.X
array([[0., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 1., 2., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)
andata_concat.var
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style>
geneId transcriptId nbExons
transcriptId
ENST00000548501 CYP4F12 ENST00000548501 4
ENST00000324229 CALCB ENST00000324229 5
ENST00000371489 MYOF ENST00000371489 15
ENST00000368659 SLC27A3 ENST00000368659 2
ENST00000669353 TMEM161B-AS1 ENST00000669353 4
... ... ... ...
ENST00000548209 LETMD1 ENST00000548209 5
ENST00000490703 TBC1D10B ENST00000490703 6
ENST00000617887 TMEM200A ENST00000617887 2
ENST00000442834 YY1AP1 ENST00000442834 4
ENST00000394260 PRICKLE4 ENST00000394260 5

89177 rows × 3 columns

andata_concat.obs
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style>
batch
AGGAAATGTACAAGCG D498_BIOP_INT
GCCATTCGTCGGAACA D498_BIOP_INT
TCGACCTCAGTGTGCC D498_BIOP_INT
CGTAGTATCAGTGTGT D498_BIOP_INT
GCCAGGTGTCTAACTG D498_BIOP_INT
... ...
AGTGACTTCTAAGCCA D492_BIOP_INT
CATTGTTCATCACCAA D492_BIOP_INT
GATGATCCACACAGAG D492_BIOP_INT
TCGAACATCAGTGCGC D492_BIOP_INT
GTTGCGGCACCTGCTT D492_BIOP_INT

122872 rows × 1 columns

Utilizing Scanpy, this function call will serialize the andata_concat object to an HDF5 file, a format widely adopted for storing extensive scientific data. The chosen filename ‘discovair_long_read_transcript_matrix.h5ad’ clearly reflects the file’s contents, representing the transcript matrix obtained from long-read sequencing data.

andata_concat.write_h5ad('discovair_long_read_transcript_matrix.h5ad')

Here we employ the sc.read_h5ad function to import Anndata objects encapsulating transcriptomic data derived from long-read and short-read sequencing approaches. Long-read sequencing data, renowned for capturing full-length transcripts that unveil isoform diversity, is encapsulated within the Anndata object from the file ‘discovair_long_read_transcript_matrix.h5ad’. Conversely, short-read sequencing data, with its larger cell number and potentially more accurate gene-level quantification, is contained within the Anndata object from the file ‘integrated_V10.h5ad’.

isoform_anndata_from_long_reads = sc.read_h5ad("discovair_long_read_transcript_matrix.h5ad")
gene_anndata_from_short_reads = sc.read_h5ad("/data/analysis/data_mcandrew/000-sclr-discovair/integrated_V10.h5ad")

Examining the long-read transcript matrix:

isoform_anndata_from_long_reads
AnnData object with n_obs × n_vars = 122872 × 89177
    obs: 'batch'
    var: 'geneId', 'transcriptId', 'nbExons'

Next, we examine the much larger short-read gene-level dataset:

gene_anndata_from_short_reads
AnnData object with n_obs × n_vars = 414609 × 36602
    obs: 'manip', 'donor', 'method', 'position', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'louvain', 'n_genes', 'nCount_SCT', 'nFeature_SCT', 'batch', 'age', 'gender', 'phenotype', 'respifinder', 'TRACvsNAS', 'sixty_plus', 'smoker', 'smoking_years', 'leiden', 'leiden_Endothelial', 'leiden_Stromal', 'leiden_Immune', 'leiden_Epithelial', 'log1p_n_genes_by_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'celltype_lv2_V4', 'celltype_lv0_V4', 'celltype_lv1_V4', 'celltype_lv2_V5', 'celltype_lv0_V5', 'celltype_lv1_V5', 'leiden_scANVI', 'disease_score', 'smoker_phenotype', 'leiden_scANVI_hvg_10000', 'leiden_scANVI_nl_50', 'leiden_scANVI_hvg_10000_nl_50', 'celltype_lv3_V5'
    var: 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'mt', 'ribo'
    uns: 'Adventitial Fibroblast_colors', 'DE_ct_lv2', 'DE_ct_lv3', 'celltype_lv0_V4_colors', 'celltype_lv0_V5_colors', 'celltype_lv1_V4_colors', 'celltype_lv1_V5_colors', 'celltype_lv2_V4_colors', 'celltype_lv2_V5_colors', 'celltype_lv3_V5_colors', 'donor_colors', 'leiden', 'neighbors', 'neighbors_scanvi', 'pca', 'phenotype_colors', 'position_colors', 'rank_genes_groups_leiden', 'umap'
    obsm: 'X_pca', 'X_scANVI', 'X_scANVI_hvg_10000', 'X_scANVI_hvg_10000_nl_50', 'X_scANVI_nl_50', 'X_umap', 'dorothea_mlm_estimate', 'dorothea_mlm_pvals', 'mlm_estimate', 'mlm_pvals'
    varm: 'PCs', 'gini_celltype', 'n_cells_celltype_lv2_V3'
    obsp: 'connectivities', 'distances', 'neighbors_scanvi_connectivities', 'neighbors_scanvi_distances'

The short-read gene quantification dataset contains a significantly higher number of cells compared to the long-read dataset. Notably, the short-read dataset is annotated, whereas the long-read dataset lacks annotations. Given that both datasets originate from the same library, which was subsequently divided and sequenced on different platforms, there is an expected overlap in cell identities. This commonality provides an opportunity to transfer annotations from the short-read to the long-read dataset by matching corresponding cells.

gene_anndata_from_short_reads.obs
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style>
manip donor method position n_genes_by_counts total_counts total_counts_mt pct_counts_mt total_counts_ribo pct_counts_ribo ... celltype_lv2_V5 celltype_lv0_V5 celltype_lv1_V5 leiden_scANVI disease_score smoker_phenotype leiden_scANVI_hvg_10000 leiden_scANVI_nl_50 leiden_scANVI_hvg_10000_nl_50 celltype_lv3_V5
D460_BIOP_PRO1GGCTTGGAGCGCCTCA-1 D460_BIOP_PRO1 D460 BIOP PRO 2150 5919.0 283.0 4.782021 1510.0 25.515377 ... Veinous Endothelial Endothelial 11 GAP Stage 1 non-smoker_IPF 9 9 8 Veinous
D463_BIOP_NAS1TCACTCGCATTGGGAG-1 D463_BIOP_NAS1 D463 BIOP NAS 1927 4979.0 474.0 9.519984 1357.0 27.254469 ... Veinous Endothelial Endothelial 11 GAP Stage 1 non-smoker_IPF 9 9 8 Veinous
D534_BIOP_PROAATCGACAGCAAGTCG-1 D534_BIOP_PRO D534 BIOP PRO 1264 3013.0 311.0 10.321939 779.0 25.854630 ... Capillary Endothelial Endothelial 11 Healthy non-smoker_CTRL 9 9 8 Capillary
D463_BIOP_NAS1TCGCTTGTCACTTGGA-1 D463_BIOP_NAS1 D463 BIOP NAS 3691 11794.0 1314.0 11.141258 2867.0 24.308971 ... Veinous Endothelial Endothelial 11 GAP Stage 1 non-smoker_IPF 9 9 8 Veinous
D489_BIOP_PROAGGGAGTTCGGTCTGG-1 D489_BIOP_PRO D489 BIOP PRO 738 1096.0 57.0 5.200730 127.0 11.587591 ... Capillary Endothelial Endothelial 11 GOLD 1 non-smoker_BPCO 9 9 8 Capillary
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
D460_BRUS_NAS1TCTATACCAATGGGTG-1 D460_BRUS_NAS1 D460 BRUS NAS 1500 4263.0 447.0 10.485574 1342.0 31.480179 ... Suprabasal Epithelial Suprabasal 0 GAP Stage 1 non-smoker_IPF 2 1 1 Suprabasal
D460_BRUS_NAS1GTTATGGCAATGGCAG-1 D460_BRUS_NAS1 D460 BRUS NAS 2422 6089.0 774.0 12.711448 740.0 12.153063 ... Ionocyte Epithelial Ionocyte 24 GAP Stage 1 non-smoker_IPF 29 27 27 Ionocyte
D460_BRUS_NAS1ATGAGTCAGCCGTTGC-1 D460_BRUS_NAS1 D460 BRUS NAS 2784 11638.0 1460.0 12.545111 2642.0 22.701494 ... Goblet Epithelial Goblet 5 GAP Stage 1 non-smoker_IPF 13 5 4 Goblet
D460_BRUS_NAS1TCATACTAGCAGTAAT-1 D460_BRUS_NAS1 D460 BRUS NAS 2563 8025.0 919.0 11.451714 1619.0 20.174454 ... Goblet Epithelial Goblet 5 GAP Stage 1 non-smoker_IPF 13 5 4 Goblet
D460_BRUS_NAS1TTGTTGTCAAGATGTA-1 D460_BRUS_NAS1 D460 BRUS NAS 1380 3443.0 255.0 7.406332 724.0 21.028173 ... Goblet Epithelial Goblet 5 GAP Stage 1 non-smoker_IPF 13 5 4 Goblet

414609 rows × 47 columns

To ensure a coherent and integrated analysis of the transcriptomic data derived from both long-read and short-read sequencing technologies, it is imperative to harmonize the indexes of the corresponding Anndata objects. This step is crucial as it aligns the observations (cells) across the datasets, enabling a direct comparison and subsequent operations such as data integration, differential expression analysis, and visualization.

isoform_anndata_from_long_reads.obs['batch'] = isoform_anndata_from_long_reads.obs['batch'].astype(str)
isoform_anndata_from_long_reads.obs_names = isoform_anndata_from_long_reads.obs['batch'] + isoform_anndata_from_long_reads.obs_names + "-1"

After the standardization of the Anndata objects’ indexes, we can confirm that the indexes are now aligned and ready for comparative analysis. This alignment is crucial for the integration of the long-read and short-read transcriptomic data, as it ensures that the same cells are represented in both datasets can be identified.

isoform_anndata_from_long_reads.obs_names
Index(['D498_BIOP_INTAGGAAATGTACAAGCG-1', 'D498_BIOP_INTGCCATTCGTCGGAACA-1',
       'D498_BIOP_INTTCGACCTCAGTGTGCC-1', 'D498_BIOP_INTCGTAGTATCAGTGTGT-1',
       'D498_BIOP_INTGCCAGGTGTCTAACTG-1', 'D498_BIOP_INTTGTGTGAGTGTTGACT-1',
       'D498_BIOP_INTCAGATACTCCAACTGA-1', 'D498_BIOP_INTGCCGATGTCTCATTAC-1',
       'D498_BIOP_INTGGAGAACTCTCGAGTA-1', 'D498_BIOP_INTAAGCATCTCGTGGTAT-1',
       ...
       'D492_BIOP_INTAAAGTGAAGGTTACAA-1', 'D492_BIOP_INTTACGGGCGTGAGACCA-1',
       'D492_BIOP_INTACAGGGAGTCAACATC-1', 'D492_BIOP_INTTTTCGATCAGGCCTGT-1',
       'D492_BIOP_INTAACAACCTCATCAGTG-1', 'D492_BIOP_INTAGTGACTTCTAAGCCA-1',
       'D492_BIOP_INTCATTGTTCATCACCAA-1', 'D492_BIOP_INTGATGATCCACACAGAG-1',
       'D492_BIOP_INTTCGAACATCAGTGCGC-1', 'D492_BIOP_INTGTTGCGGCACCTGCTT-1'],
      dtype='object', length=122872)

In this section, we are going to utilize the subset_common_cells function from the longreadtools library to harmonize our datasets. This function is crucial for ensuring that we are comparing the same cells across the two Anndata objects - one derived from long-read sequencing and the other from short-read sequencing. By importing and applying this function, we can identify the intersection of cells present in both datasets, allowing for a consistent and integrated analysis.

from longreadtools.Standardization import *
isoform_matrix = subset_common_cells(isoform_anndata_from_long_reads, gene_anndata_from_short_reads)

In the previous steps, we have successfully standardized the indexes of our Anndata objects and utilized the subset_common_cells function to refine the isoform Anndata object derived from long-read sequencing data. The next logical step is to apply the same subsetting process to the gene Anndata object from short-read sequencing data. This ensures that both datasets are synchronized and contain only the cells common to both, which is a prerequisite for accurate annotation transfer.

gene_matrtrix  = subset_common_cells(gene_anndata_from_short_reads, isoform_matrix)

The next step in our analysis pipeline is to transfer the observation annotations from the gene_matrix Anndata object, which contains the short-read sequencing data, to the isoform_matrix Anndata object, which contains the long-read sequencing data. The transfer_obs function from the longreadtools library is instrumental in this process. It meticulously maps the .obs attributes from one Anndata object to another based on the shared cell identifiers, thus preserving the integrity of the data and enabling a seamless integration.

annotated_isoform_matrix = transfer_obs(gene_matrtrix, isoform_matrix)

In this step, we delve into the annotated isoform matrix, which is a product of the meticulous standardization and subsetting processes we have applied to our Anndata objects. The annotated_isoform_matrix is a rich dataset that combines the detailed isoform data obtained from long-read sequencing with the comprehensive annotations transferred from the gene matrix derived from short-read sequencing.

annotated_isoform_matrix.X
array([[1., 0., 0., ..., 0., 0., 0.],
       [1., 1., 2., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [3., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)

By examining this matrix, we gain insights into the transcriptomic landscape at an isoform resolution, which is crucial for understanding the complexity of gene expression patterns. The annotations included in this matrix, such as cell type, donor information, and technical attributes, are pivotal for subsequent analyses that aim to unravel the biological and clinical significance of the data within the context of the longreadtools framework. Lets save it to disk for later use!

annotated_isoform_matrix.write('/data/analysis/data_mcandrew/000-sclr-discovair/discovair_long_read_transcript_matrix_annotated.h5ad')

et voilà, nous avons terminé !

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published