LongReadTools

Install

pip install git+https://github.com/cobioda/longreadtools.git

How to use

For detailed instructions on how to use LongReadTools, please refer to the documentation.

Example Usage of LongReadTools

This section provides a practical example of how to apply LongReadTools in a bioinformatics workflow. We will demonstrate the process of converting isomatrix text files into Anndata objects, which are suitable for high-throughput single-cell genomics analysis. The example will cover the necessary steps from data retrieval and processing to the final conversion using LongReadTools’ specialized functions.

In this section, we will retrieve a list of isomatrix files for conversion into Anndata objects. The isomatrix_tools module within the LongReadTools library provides a function multiple_isomatrix_conversion, which allows for batch conversion of isomatrix text files into Anndata objects, a binary format for representing large datasets in the context of single-cell genomics.

import os
import re

directory = '/data/analysis/data_mcandrew/000-sclr-discovair/'
pattern = re.compile('.*(_BIOP_INT|BIOP_NAS)$')
matching_files = [os.path.join(directory, f) for f in os.listdir(directory) if pattern.match(f)]
print(matching_files)

individual_runs = [f + '_isomatrix.txt' for f in matching_files]
isomatrix_paths = [os.path.join(f, os.path.basename(f) + '_isomatrix.txt') for f in matching_files]

['/data/analysis/data_mcandrew/000-sclr-discovair/D498_BIOP_INT', '/data/analysis/data_mcandrew/000-sclr-discovair/D492_BIOP_NAS', '/data/analysis/data_mcandrew/000-sclr-discovair/D494_BIOP_INT', '/data/analysis/data_mcandrew/000-sclr-discovair/D500_BIOP_INT', '/data/analysis/data_mcandrew/000-sclr-discovair/D494_BIOP_NAS', '/data/analysis/data_mcandrew/000-sclr-discovair/D496_BIOP_INT', '/data/analysis/data_mcandrew/000-sclr-discovair/D499_BIOP_INT', '/data/analysis/data_mcandrew/000-sclr-discovair/D493_BIOP_INT', '/data/analysis/data_mcandrew/000-sclr-discovair/D493_BIOP_NAS', '/data/analysis/data_mcandrew/000-sclr-discovair/D534_BIOP_INT', '/data/analysis/data_mcandrew/000-sclr-discovair/D490_BIOP_INT', '/data/analysis/data_mcandrew/000-sclr-discovair/D500_BIOP_NAS', '/data/analysis/data_mcandrew/000-sclr-discovair/D495_BIOP_INT', '/data/analysis/data_mcandrew/000-sclr-discovair/D492_BIOP_INT']

In this section, we will leverage the isomatool module from the LongReadTools library to convert the isomatrix files, which we have previously identified and listed in isomatrix_paths, into Anndata objects. Anndata objects are a binary format designed for large-scale single-cell genomics data, which facilitates efficient data handling and manipulation, making them ideal for high-throughput computational analysis. The multiple_isomatrix_conversion function from isomatool will be used to perform this batch conversion.

from longreadtools.isomatool import *
import scanpy as sc

converted_isomatrix_paths = multiple_isomatrix_conversion(isomatrix_paths, verbose=True, return_paths = True)

File /data/analysis/data_mcandrew/000-sclr-discovair/D498_BIOP_INT/D498_BIOP_INT_isomatrix.h5ad was successfully written to disk.
File /data/analysis/data_mcandrew/000-sclr-discovair/D500_BIOP_NAS/D500_BIOP_NAS_isomatrix.h5ad was successfully written to disk.
File /data/analysis/data_mcandrew/000-sclr-discovair/D500_BIOP_INT/D500_BIOP_INT_isomatrix.h5ad was successfully written to disk.
File /data/analysis/data_mcandrew/000-sclr-discovair/D493_BIOP_NAS/D493_BIOP_NAS_isomatrix.h5ad was successfully written to disk.
File /data/analysis/data_mcandrew/000-sclr-discovair/D494_BIOP_NAS/D494_BIOP_NAS_isomatrix.h5ad was successfully written to disk.
File /data/analysis/data_mcandrew/000-sclr-discovair/D493_BIOP_INT/D493_BIOP_INT_isomatrix.h5ad was successfully written to disk.
File /data/analysis/data_mcandrew/000-sclr-discovair/D499_BIOP_INT/D499_BIOP_INT_isomatrix.h5ad was successfully written to disk.
File /data/analysis/data_mcandrew/000-sclr-discovair/D494_BIOP_INT/D494_BIOP_INT_isomatrix.h5ad was successfully written to disk.
File /data/analysis/data_mcandrew/000-sclr-discovair/D492_BIOP_INT/D492_BIOP_INT_isomatrix.h5ad was successfully written to disk.
File /data/analysis/data_mcandrew/000-sclr-discovair/D495_BIOP_INT/D495_BIOP_INT_isomatrix.h5ad was successfully written to disk.
File /data/analysis/data_mcandrew/000-sclr-discovair/D490_BIOP_INT/D490_BIOP_INT_isomatrix.h5ad was successfully written to disk.
File /data/analysis/data_mcandrew/000-sclr-discovair/D496_BIOP_INT/D496_BIOP_INT_isomatrix.h5ad was successfully written to disk.
File /data/analysis/data_mcandrew/000-sclr-discovair/D534_BIOP_INT/D534_BIOP_INT_isomatrix.h5ad was successfully written to disk.
File /data/analysis/data_mcandrew/000-sclr-discovair/D492_BIOP_NAS/D492_BIOP_NAS_isomatrix.h5ad was successfully written to disk.

andata_concat = concatenate_anndata(converted_isomatrix_paths, verbose = True)

Reading .h5ad files...
Applying feature set standardization...
Concatenating AnnData objects and adding batch keys with scanpy...
Setting .var attribute...
Final Check...
Concatenation complete.

Standardizing anndata features via union: 100%|██████████| 14/14 [01:05<00:00,  4.68s/it]

Now that we have concatenated the Anndata objects, let’s examine the resulting object to ensure it’s structured correctly and ready for downstream analysis. We will display the shape of the matrix, the metadata associated with observations (cells), and the variables (genes) to get an overview of the dataset.

# Display the shape of the concatenated Anndata object
print(f"The Anndata object has {andata_concat.n_obs} observations (cells) and {andata_concat.n_vars} variables (genes).")

# Display the first few entries of the observation metadata to inspect batch information and other annotations
print("Observation metadata (first 5 entries):")
print(andata_concat.obs.head())

# Display the first few entries of the variable metadata to inspect gene and transcript information
print("Variable metadata (first 5 entries):")
print(andata_concat.var.head())

# Check for unique observation names and make them unique if necessary
if not andata_concat.obs_names.is_unique:
    andata_concat.obs_names_make_unique()
    print("Observation names were not unique; they have been made unique.")

The Anndata object has 122872 observations (cells) and 89177 variables (genes).
Observation metadata (first 5 entries):
                          batch
AGGAAATGTACAAGCG  D498_BIOP_INT
GCCATTCGTCGGAACA  D498_BIOP_INT
TCGACCTCAGTGTGCC  D498_BIOP_INT
CGTAGTATCAGTGTGT  D498_BIOP_INT
GCCAGGTGTCTAACTG  D498_BIOP_INT
Variable metadata (first 5 entries):
                       geneId     transcriptId nbExons
transcriptId                                          
ENST00000548501       CYP4F12  ENST00000548501       4
ENST00000324229         CALCB  ENST00000324229       5
ENST00000371489          MYOF  ENST00000371489      15
ENST00000368659       SLC27A3  ENST00000368659       2
ENST00000669353  TMEM161B-AS1  ENST00000669353       4
Observation names were not unique; they have been made unique.

Access the count matrix from the concatenated Anndata object to analyze the transcript count data.

andata_concat.X

array([[0., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 1., 2., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)

andata_concat.var

	geneId	transcriptId	nbExons
transcriptId
ENST00000548501	CYP4F12	ENST00000548501	4
ENST00000324229	CALCB	ENST00000324229	5
ENST00000371489	MYOF	ENST00000371489	15
ENST00000368659	SLC27A3	ENST00000368659	2
ENST00000669353	TMEM161B-AS1	ENST00000669353	4
...	...	...	...
ENST00000548209	LETMD1	ENST00000548209	5
ENST00000490703	TBC1D10B	ENST00000490703	6
ENST00000617887	TMEM200A	ENST00000617887	2
ENST00000442834	YY1AP1	ENST00000442834	4
ENST00000394260	PRICKLE4	ENST00000394260	5

89177 rows × 3 columns

andata_concat.obs

	batch
AGGAAATGTACAAGCG	D498_BIOP_INT
GCCATTCGTCGGAACA	D498_BIOP_INT
TCGACCTCAGTGTGCC	D498_BIOP_INT
CGTAGTATCAGTGTGT	D498_BIOP_INT
GCCAGGTGTCTAACTG	D498_BIOP_INT
...	...
AGTGACTTCTAAGCCA	D492_BIOP_INT
CATTGTTCATCACCAA	D492_BIOP_INT
GATGATCCACACAGAG	D492_BIOP_INT
TCGAACATCAGTGCGC	D492_BIOP_INT
GTTGCGGCACCTGCTT	D492_BIOP_INT

122872 rows × 1 columns

Utilizing Scanpy, this function call will serialize the andata_concat object to an HDF5 file, a format widely adopted for storing extensive scientific data. The chosen filename ‘discovair_long_read_transcript_matrix.h5ad’ clearly reflects the file’s contents, representing the transcript matrix obtained from long-read sequencing data.

andata_concat.write_h5ad('discovair_long_read_transcript_matrix.h5ad')

Here we employ the sc.read_h5ad function to import Anndata objects encapsulating transcriptomic data derived from long-read and short-read sequencing approaches. Long-read sequencing data, renowned for capturing full-length transcripts that unveil isoform diversity, is encapsulated within the Anndata object from the file ‘discovair_long_read_transcript_matrix.h5ad’. Conversely, short-read sequencing data, with its larger cell number and potentially more accurate gene-level quantification, is contained within the Anndata object from the file ‘integrated_V10.h5ad’.

isoform_anndata_from_long_reads = sc.read_h5ad("discovair_long_read_transcript_matrix.h5ad")
gene_anndata_from_short_reads = sc.read_h5ad("/data/analysis/data_mcandrew/000-sclr-discovair/integrated_V10.h5ad")

Examining the long-read transcript matrix:

isoform_anndata_from_long_reads

AnnData object with n_obs × n_vars = 122872 × 89177
    obs: 'batch'
    var: 'geneId', 'transcriptId', 'nbExons'

Next, we examine the much larger short-read gene-level dataset:

gene_anndata_from_short_reads

AnnData object with n_obs × n_vars = 414609 × 36602
    obs: 'manip', 'donor', 'method', 'position', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'louvain', 'n_genes', 'nCount_SCT', 'nFeature_SCT', 'batch', 'age', 'gender', 'phenotype', 'respifinder', 'TRACvsNAS', 'sixty_plus', 'smoker', 'smoking_years', 'leiden', 'leiden_Endothelial', 'leiden_Stromal', 'leiden_Immune', 'leiden_Epithelial', 'log1p_n_genes_by_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'celltype_lv2_V4', 'celltype_lv0_V4', 'celltype_lv1_V4', 'celltype_lv2_V5', 'celltype_lv0_V5', 'celltype_lv1_V5', 'leiden_scANVI', 'disease_score', 'smoker_phenotype', 'leiden_scANVI_hvg_10000', 'leiden_scANVI_nl_50', 'leiden_scANVI_hvg_10000_nl_50', 'celltype_lv3_V5'
    var: 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'mt', 'ribo'
    uns: 'Adventitial Fibroblast_colors', 'DE_ct_lv2', 'DE_ct_lv3', 'celltype_lv0_V4_colors', 'celltype_lv0_V5_colors', 'celltype_lv1_V4_colors', 'celltype_lv1_V5_colors', 'celltype_lv2_V4_colors', 'celltype_lv2_V5_colors', 'celltype_lv3_V5_colors', 'donor_colors', 'leiden', 'neighbors', 'neighbors_scanvi', 'pca', 'phenotype_colors', 'position_colors', 'rank_genes_groups_leiden', 'umap'
    obsm: 'X_pca', 'X_scANVI', 'X_scANVI_hvg_10000', 'X_scANVI_hvg_10000_nl_50', 'X_scANVI_nl_50', 'X_umap', 'dorothea_mlm_estimate', 'dorothea_mlm_pvals', 'mlm_estimate', 'mlm_pvals'
    varm: 'PCs', 'gini_celltype', 'n_cells_celltype_lv2_V3'
    obsp: 'connectivities', 'distances', 'neighbors_scanvi_connectivities', 'neighbors_scanvi_distances'

The short-read gene quantification dataset contains a significantly higher number of cells compared to the long-read dataset. Notably, the short-read dataset is annotated, whereas the long-read dataset lacks annotations. Given that both datasets originate from the same library, which was subsequently divided and sequenced on different platforms, there is an expected overlap in cell identities. This commonality provides an opportunity to transfer annotations from the short-read to the long-read dataset by matching corresponding cells.

gene_anndata_from_short_reads.obs

	manip	donor	method	position	n_genes_by_counts	total_counts	total_counts_mt	pct_counts_mt	total_counts_ribo	pct_counts_ribo	...	celltype_lv2_V5	celltype_lv0_V5	celltype_lv1_V5	leiden_scANVI	disease_score	smoker_phenotype	leiden_scANVI_hvg_10000	leiden_scANVI_nl_50	leiden_scANVI_hvg_10000_nl_50	celltype_lv3_V5
D460_BIOP_PRO1GGCTTGGAGCGCCTCA-1	D460_BIOP_PRO1	D460	BIOP	PRO	2150	5919.0	283.0	4.782021	1510.0	25.515377	...	Veinous	Endothelial	Endothelial	11	GAP Stage 1	non-smoker_IPF	9	9	8	Veinous
D463_BIOP_NAS1TCACTCGCATTGGGAG-1	D463_BIOP_NAS1	D463	BIOP	NAS	1927	4979.0	474.0	9.519984	1357.0	27.254469	...	Veinous	Endothelial	Endothelial	11	GAP Stage 1	non-smoker_IPF	9	9	8	Veinous
D534_BIOP_PROAATCGACAGCAAGTCG-1	D534_BIOP_PRO	D534	BIOP	PRO	1264	3013.0	311.0	10.321939	779.0	25.854630	...	Capillary	Endothelial	Endothelial	11	Healthy	non-smoker_CTRL	9	9	8	Capillary
D463_BIOP_NAS1TCGCTTGTCACTTGGA-1	D463_BIOP_NAS1	D463	BIOP	NAS	3691	11794.0	1314.0	11.141258	2867.0	24.308971	...	Veinous	Endothelial	Endothelial	11	GAP Stage 1	non-smoker_IPF	9	9	8	Veinous
D489_BIOP_PROAGGGAGTTCGGTCTGG-1	D489_BIOP_PRO	D489	BIOP	PRO	738	1096.0	57.0	5.200730	127.0	11.587591	...	Capillary	Endothelial	Endothelial	11	GOLD 1	non-smoker_BPCO	9	9	8	Capillary
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
D460_BRUS_NAS1TCTATACCAATGGGTG-1	D460_BRUS_NAS1	D460	BRUS	NAS	1500	4263.0	447.0	10.485574	1342.0	31.480179	...	Suprabasal	Epithelial	Suprabasal	0	GAP Stage 1	non-smoker_IPF	2	1	1	Suprabasal
D460_BRUS_NAS1GTTATGGCAATGGCAG-1	D460_BRUS_NAS1	D460	BRUS	NAS	2422	6089.0	774.0	12.711448	740.0	12.153063	...	Ionocyte	Epithelial	Ionocyte	24	GAP Stage 1	non-smoker_IPF	29	27	27	Ionocyte
D460_BRUS_NAS1ATGAGTCAGCCGTTGC-1	D460_BRUS_NAS1	D460	BRUS	NAS	2784	11638.0	1460.0	12.545111	2642.0	22.701494	...	Goblet	Epithelial	Goblet	5	GAP Stage 1	non-smoker_IPF	13	5	4	Goblet
D460_BRUS_NAS1TCATACTAGCAGTAAT-1	D460_BRUS_NAS1	D460	BRUS	NAS	2563	8025.0	919.0	11.451714	1619.0	20.174454	...	Goblet	Epithelial	Goblet	5	GAP Stage 1	non-smoker_IPF	13	5	4	Goblet
D460_BRUS_NAS1TTGTTGTCAAGATGTA-1	D460_BRUS_NAS1	D460	BRUS	NAS	1380	3443.0	255.0	7.406332	724.0	21.028173	...	Goblet	Epithelial	Goblet	5	GAP Stage 1	non-smoker_IPF	13	5	4	Goblet

414609 rows × 47 columns

To ensure a coherent and integrated analysis of the transcriptomic data derived from both long-read and short-read sequencing technologies, it is imperative to harmonize the indexes of the corresponding Anndata objects. This step is crucial as it aligns the observations (cells) across the datasets, enabling a direct comparison and subsequent operations such as data integration, differential expression analysis, and visualization.

isoform_anndata_from_long_reads.obs['batch'] = isoform_anndata_from_long_reads.obs['batch'].astype(str)
isoform_anndata_from_long_reads.obs_names = isoform_anndata_from_long_reads.obs['batch'] + isoform_anndata_from_long_reads.obs_names + "-1"

After the standardization of the Anndata objects’ indexes, we can confirm that the indexes are now aligned and ready for comparative analysis. This alignment is crucial for the integration of the long-read and short-read transcriptomic data, as it ensures that the same cells are represented in both datasets can be identified.

isoform_anndata_from_long_reads.obs_names

Index(['D498_BIOP_INTAGGAAATGTACAAGCG-1', 'D498_BIOP_INTGCCATTCGTCGGAACA-1',
       'D498_BIOP_INTTCGACCTCAGTGTGCC-1', 'D498_BIOP_INTCGTAGTATCAGTGTGT-1',
       'D498_BIOP_INTGCCAGGTGTCTAACTG-1', 'D498_BIOP_INTTGTGTGAGTGTTGACT-1',
       'D498_BIOP_INTCAGATACTCCAACTGA-1', 'D498_BIOP_INTGCCGATGTCTCATTAC-1',
       'D498_BIOP_INTGGAGAACTCTCGAGTA-1', 'D498_BIOP_INTAAGCATCTCGTGGTAT-1',
       ...
       'D492_BIOP_INTAAAGTGAAGGTTACAA-1', 'D492_BIOP_INTTACGGGCGTGAGACCA-1',
       'D492_BIOP_INTACAGGGAGTCAACATC-1', 'D492_BIOP_INTTTTCGATCAGGCCTGT-1',
       'D492_BIOP_INTAACAACCTCATCAGTG-1', 'D492_BIOP_INTAGTGACTTCTAAGCCA-1',
       'D492_BIOP_INTCATTGTTCATCACCAA-1', 'D492_BIOP_INTGATGATCCACACAGAG-1',
       'D492_BIOP_INTTCGAACATCAGTGCGC-1', 'D492_BIOP_INTGTTGCGGCACCTGCTT-1'],
      dtype='object', length=122872)

In this section, we are going to utilize the subset_common_cells function from the longreadtools library to harmonize our datasets. This function is crucial for ensuring that we are comparing the same cells across the two Anndata objects - one derived from long-read sequencing and the other from short-read sequencing. By importing and applying this function, we can identify the intersection of cells present in both datasets, allowing for a consistent and integrated analysis.

from longreadtools.Standardization import *
isoform_matrix = subset_common_cells(isoform_anndata_from_long_reads, gene_anndata_from_short_reads)

In the previous steps, we have successfully standardized the indexes of our Anndata objects and utilized the subset_common_cells function to refine the isoform Anndata object derived from long-read sequencing data. The next logical step is to apply the same subsetting process to the gene Anndata object from short-read sequencing data. This ensures that both datasets are synchronized and contain only the cells common to both, which is a prerequisite for accurate annotation transfer.

gene_matrtrix  = subset_common_cells(gene_anndata_from_short_reads, isoform_matrix)

The next step in our analysis pipeline is to transfer the observation annotations from the gene_matrix Anndata object, which contains the short-read sequencing data, to the isoform_matrix Anndata object, which contains the long-read sequencing data. The transfer_obs function from the longreadtools library is instrumental in this process. It meticulously maps the .obs attributes from one Anndata object to another based on the shared cell identifiers, thus preserving the integrity of the data and enabling a seamless integration.

annotated_isoform_matrix = transfer_obs(gene_matrtrix, isoform_matrix)

In this step, we delve into the annotated isoform matrix, which is a product of the meticulous standardization and subsetting processes we have applied to our Anndata objects. The annotated_isoform_matrix is a rich dataset that combines the detailed isoform data obtained from long-read sequencing with the comprehensive annotations transferred from the gene matrix derived from short-read sequencing.

annotated_isoform_matrix.X

array([[1., 0., 0., ..., 0., 0., 0.],
       [1., 1., 2., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [3., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)

By examining this matrix, we gain insights into the transcriptomic landscape at an isoform resolution, which is crucial for understanding the complexity of gene expression patterns. The annotations included in this matrix, such as cell type, donor information, and technical attributes, are pivotal for subsequent analyses that aim to unravel the biological and clinical significance of the data within the context of the longreadtools framework. Lets save it to disk for later use!

annotated_isoform_matrix.write('/data/analysis/data_mcandrew/000-sclr-discovair/discovair_long_read_transcript_matrix_annotated.h5ad')

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
.github/workflows		.github/workflows
index_files/figure-commonmark		index_files/figure-commonmark
longreadtools		longreadtools
nbs		nbs
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
file.txt.gz		file.txt.gz
sample_isomatrix.txt		sample_isomatrix.txt
settings.ini		settings.ini
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LongReadTools

Install

How to use

For detailed instructions on how to use LongReadTools, please refer to the documentation.

Example Usage of LongReadTools

et voilà, nous avons terminé !

About

Releases

Packages

Languages

License

cobioda/longreadtools

Folders and files

Latest commit

History

Repository files navigation

LongReadTools

Install

How to use

For detailed instructions on how to use LongReadTools, please refer to the documentation.

Example Usage of LongReadTools

et voilà, nous avons terminé !

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages