support transposed assay data #42

mtmorgan · 2021-02-27T21:08:33Z

My understanding (from @hpages) is that transposed assay data matrices should be supported in HDF files, but after downloading Human Cell Atlas h5ad data at https://data.humancellatlas.org/explore/projects/f83165c5-e2ea-4d15-a5cf-33f3550bffde/expression-matrices?catalog=dcp2 (look for the h5ad files) I see

> sce <- zellkonverter::readH5AD(file_path, use_hdf5 = TRUE)
Warning message:
In .extract_or_skip_assay(skip_assays = skip_assays, hdf5_backed = hdf5_backed,  :
  'X' matrix does not support transposition and has been skipped

and indeed the matrix is all zeros.

This can be reproduced programmatically with

BiocManager::install("Bioconductor/hca")
library("hca"); library("dplyr")

filters <- filters(
    projectId = list(is = "f83165c5-e2ea-4d15-a5cf-33f3550bffde"),
    fileFormat = list(is = "h5ad")
)
files <-
    files(filters) %>%
    head(1)

file_path <- files_download(files)

sce <- zellkonverter::readH5AD(file_path, use_hdf5 = TRUE)

The text was updated successfully, but these errors were encountered:

hpages · 2021-02-27T21:21:53Z

Works fine as far as HDF5Array::H5ADMatrix() is concerned:

file_path
# 6d4fedcf-857d-5fbb-9928-8b9605500a69-2020-11-20T09:03:11.285229Z 
#               "/tmp/Rtmpvwuwxz/a5be939a99f58_a5be939a99f58.h5ad" 

library(HDF5Array)

H5ADMatrix(file_path)
# <30191 x 4891> sparse matrix of class H5ADMatrix and type "double":
#                [,1]       [,2]       [,3] ... [,4890] [,4891]
#     [1,]          0          0          0   .       0       0
#     [2,]          0          0          0   .       0       0
#     [3,]          0          0          0   .       0       0
#     [4,]          0          0          0   .       0       0
#     [5,]          0          0          0   .       0       0
#      ...          .          .          .   .       .       .
# [30187,] 0.00000000 0.00000000 0.00000000   .       0       0
# [30188,] 0.00000000 0.00000000 0.00000000   .       0       0
# [30189,] 0.06667092 0.00000000 0.00000000   .       0       0
# [30190,] 0.00000000 0.00000000 0.00000000   .       0       0
# [30191,] 0.00000000 0.00000000 0.00000000   .       0       0
# Warning message:
# In .load_h5ad_dimnames(filepath) :
#   could not find dimnames (normally stored in datasets '/var/_index' and
#   '/obs/_index') in this .h5ad file

(I still need to work on the dimnames part.)

sessionInfo():

> sessionInfo()
R Under development (unstable) (2021-02-05 r79941)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.10

Matrix products: default
BLAS:   /home/hpages/R/R-4.1.r79941/lib/libRblas.so
LAPACK: /home/hpages/R/R-4.1.r79941/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] HDF5Array_1.19.6     rhdf5_2.35.2         DelayedArray_0.17.9 
 [4] IRanges_2.25.6       S4Vectors_0.29.7     MatrixGenerics_1.3.1
 [7] matrixStats_0.58.0   BiocGenerics_0.37.1  Matrix_1.3-2        
[10] hca_0.99.0          

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6           pillar_1.5.0         compiler_4.1.0      
 [4] dbplyr_2.1.0         rhdf5filters_1.3.4   tools_4.1.0         
 [7] bit_4.0.4            lattice_0.20-41      jsonlite_1.7.2      
[10] BiocFileCache_1.15.1 RSQLite_2.2.3        memoise_2.0.0       
[13] lifecycle_1.0.0      tibble_3.1.0         pkgconfig_2.0.3     
[16] rlang_0.4.10         DBI_1.1.1            cli_2.3.1           
[19] rstudioapi_0.13      filelock_1.0.2       curl_4.3            
[22] fastmap_1.1.0        withr_2.4.1          dplyr_1.0.4         
[25] httr_1.4.2           generics_0.1.0       vctrs_0.3.6         
[28] rappdirs_0.3.3       grid_4.1.0           bit64_4.0.5         
[31] tidyselect_1.1.0     glue_1.4.2           R6_2.5.0            
[34] fansi_0.4.2          Rhdf5lib_1.13.4      purrr_0.3.4         
[37] tidyr_1.1.2          blob_1.2.1           magrittr_2.0.1      
[40] ps_1.5.0             ellipsis_0.3.1       assertthat_0.2.1    
[43] utf8_1.1.4           cachem_1.0.4         crayon_1.4.1

LTLA · 2021-02-28T01:19:40Z

I believe this issue is discussed in #37 and #41.

hpages · 2021-02-28T03:21:24Z

FWIW, starting with HDF5Array 1.19.7, H5ADMatrix() should always be able to find the dimnames (unless the AnnData folks have another sneaky way to hide them in the file):

file_path
# 6d4fedcf-857d-5fbb-9928-8b9605500a69-2020-11-20T09:03:11.285229Z 
#                 "/tmp/RtmpxFYLjF/a8e7d957870a_a8e7d957870a.h5ad" 

library(HDF5Array)

H5ADMatrix(file_path)
# <30191 x 4891> sparse matrix of class H5ADMatrix and type "double":
#             24087_4#281-vento18 ...  23728_8#64-vento18
# MIR1302-2HG                   0   .                   0
#     FAM138A                   0   .                   0
#  FO538757.2                   0   .                   0
#      FAM87B                   0   .                   0
#   LINC00115                   0   .                   0
#         ...                   .   .                   .
#  AP001469.3          0.00000000   .                   0
#  AC136352.2          0.00000000   .                   0
#  BX004987.1          0.06667092   .                   0
#  AC145212.1          0.00000000   .                   0
#  AC213203.2          0.00000000   .                   0

lazappi · 2021-03-05T14:26:19Z

Should be fixed in devel with v1.1.5

royfrancis · 2022-12-06T19:19:23Z

I have the same issue with bioconductor-zellkonverter 1.8.0.

lazappi · 2022-12-07T07:38:14Z

@royfrancis Can you please provide some code/file to reproduce the issue? I'm hoping this exact issue is solved but you could be getting the same warning for other reasons.

royfrancis · 2022-12-07T16:23:14Z

Many hours later...

Now I have a docker image to reproduce the error.

docker run --rm --user "$(id -u):$(id -g)" -v ${PWD}:/zell/work royfrancis/zellkonverter:1.8.0

The h5ad file that I am converting is unfortunately huge. Extended GBmap (11GB) from here. Rename local.h5ad to file.h5ad. Then run docker:

docker run --user "$(id -u):$(id -g)" --rm -v ${PWD}:/zell/work zell:1.8

Warning message:
'X' matrix does not support transposition and has been skipped

It worked with a few other smaller files that I tried. For example 1295 cells or 7999 cells or 141401 cells.

Maybe it's an issue with the h5ad file? 🤔

To build image from scratch, place the three files below into a directory:

Dockerfile

FROM r-base:4.1.1
LABEL Description="Docker image for zellkonverter"

RUN apt-get update \
   && apt-get install --no-install-recommends -y build-essential \
   && apt-get clean \
   && rm -rf /var/lib/apt/lists/* \
   && mkdir /zell \
   && mkdir /zell/scripts /zell/work

COPY env.yml convert.R /zell/scripts/

ENV CONDA_DIR=$HOME/miniconda3
RUN wget --quiet https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh \
  && /bin/bash ~/miniconda.sh -b -p $CONDA_DIR \
  && rm ~/miniconda.sh
ENV PATH=$CONDA_DIR/bin:$PATH

RUN conda install mamba -n base -c conda-forge
RUN mamba env create -f /zell/scripts/env.yml
WORKDIR /zell/work
CMD conda run -n zell Rscript /zell/scripts/convert.R

env.yml

name: zell
channels:
 - bioconda
 - conda-forge
dependencies:
 - python=3.7
 - bioconductor-zellkonverter
 - pip
 - pip:
    - anndata

convert.R

library(zellkonverter)
anndata <- reticulate::import("anndata")
if(!file.exists("file.h5ad")) stop(file.path(getwd(),"file.h5ad does not exist."))
message(paste0("Reading ",getwd(),"/file.h5ad..."))
adata <- anndata$read_h5ad("file.h5ad")
message("Converting ANN to SCE...")
sce <- zellkonverter::AnnData2SCE(adata)
message(paste0("Exporting ",getwd(),"/sce.rds..."))
saveRDS(sce,"sce.Rds")
message("Completed.")

Run this in that directory to build the docker image.
docker build -t zellkonverter:1.8.0 .

lazappi · 2022-12-12T10:30:17Z

I have had a quick look at this. I think what is happening is that the dataset is too big and transposing the X matrix in R fails (probably because of running out of memory). The warning message isn't very helpful in this case so that can be improved but maybe we should also look again at doing the transposition in Python as I suggested here #9.

mtmorgan · 2022-12-12T12:56:00Z

I don't know the context but if the goal is to support Bioconductor work flows does it make sense to

HDF5Array::H5SparseMatrix(glioblastoma_h5ad_file, "/X")

which has done the transformation anyway? And is more-or-less instant, since it does not actually load the entire data into memory?

Oops I see a similar comment from @hpages upstream. And also perhaps that the problem is in AnnData2SCE(), but one should just be going for the R representation directly, without python?

lazappi · 2022-12-15T15:53:40Z

Yeah I think this would be the correct approach for the R reader but not sure it helps with the Python reader where we already have the data in memory. This kind of scalability is probably another reason to work on the R reader more, at the very least you avoid having two copies of everything.

jenellewallace · 2023-05-18T19:06:35Z

Was this issue ever solved? I have the same problem:

sce = readH5AD(file, use_hdf5=TRUE)
Warning: 'X' matrix does not support transposition and has been skippedℹ Using stored X_name value 'counts'

And the resulting counts matrix is all zeroes.

royfrancis · 2023-05-18T19:34:14Z

Zellkonverter didn't work for my file of interest, so my solution was simply to export as text and convert into R format. The code I used is below. This solution is extremely memory intensive depending on the size of your dataset. There are probably better ways to do this.

H5AD -> CSV -> R

# python

import scanpy as sc
import numpy as np
import pandas as pd

print("Reading data...")
adata = sc.read_h5ad("file.h5ad")

print("Writing data...")
t=adata.raw.X.toarray()
pd.DataFrame(data=t, index=adata.obs_names, columns=adata.raw.var_names).to_csv('raw-counts.csv')
# comma separated csv with header, no rownames
pd.DataFrame(adata.obs).to_csv("metadata.csv")

# R

library(Seurat)

message("Reading counts...")
x <- read.csv("raw-counts.csv",header=TRUE)
rownames(x) <- x[,1]
x[,1] <- NULL
print(dim(x))
print(x[1:5,1:5])

message("Reading metadata...")
m <- read.csv("metadata.csv",header=TRUE)
rownames(m) <- m[,1]
colnames(m)[1] <- "sample"
print(dim(m))
print(head(m))

message("Writing seurat object...")
saveRDS(
  CreateSeuratObject(counts=t(x),meta.data=m,project="extended",min.cells=0,min.features=0),
  "seurat.Rds"
)

jenellewallace · 2023-05-19T18:13:09Z

@royfrancis Thanks for your response! I ended up being able to just download the raw data I needed from a public repository into R format so skipping the h5 step. Hope this issue will be fixed soon!

lazappi · 2023-05-22T08:25:52Z

Was this issue ever solved? I have the same problem:

sce = readH5AD(file, use_hdf5=TRUE)
Warning: 'X' matrix does not support transposition and has been skippedℹ Using stored X_name value 'counts'

And the resulting counts matrix is all zeroes.

@jenellewallace It seems like you came up with a solution for your case but for anybody else we made updates to the R reader in the v1.10.0 release which should help with these kinds of issues (and should be further improved in the next release).

jgilis · 2023-05-22T13:17:48Z

Hi everyone,

I would like to jump in on this. I am working with the dataset from GSE174188 (GSE174188_CLUES1_adjusted.h5ad.gz).

In R/4.1.3, zellkonverter 1.4.0, I was able to simply obtain the data with

lupus <- readH5AD(file=lupus_file, use_hdf5 = TRUE, raw = TRUE, verbose = TRUE)

However, after switching to R/4.2.0, zellkoverter 1.8.0, the same code leads to following error:

Error in validObject(.Object) :
 invalid class “SummarizedExperiment” object:
  'x@assays' is not parallel to 'x'
Calls: readH5AD ... new_SummarizedExperiment -> new -> initialize -> initialize -> validObject
In addition: Warning messages:
1: raw 'X' matrix does not support transposition and has been skipped
2: The names of these selected raw var columns have been modified to match R
conventions:
'feature_types-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0'
->
'feature_types.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0'
Execution halted

It turns out that that the warning "raw 'X' matrix does not support transposition and has been skipped" is the underlying cause of the assays and rowdata dimensions being mismatched, which finally results in SummarizedExperiment complaining.

In the light of this thread, I also tried with zellkonverter 1.11.0, but this did not resolve my issues.

jgilis · 2023-05-22T13:20:48Z

> sessionInfo()
R version 4.2.0 alpha (2022-04-04 r82084)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 11.6

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    stats     graphics  grDevices datasets  utils     methods  
[8] base     

other attached packages:
 [1] dplyr_1.1.2                 scuttle_1.8.4              
 [3] SingleCellExperiment_1.20.1 SummarizedExperiment_1.28.0
 [5] Biobase_2.58.0              GenomicRanges_1.50.2       
 [7] GenomeInfoDb_1.34.9         IRanges_2.32.0             
 [9] S4Vectors_0.36.2            BiocGenerics_0.44.0        
[11] MatrixGenerics_1.10.0       matrixStats_0.63.0         
[13] reticulate_1.28             zellkonverter_1.11.0       
[15] basilisk.utils_1.10.0       basilisk_1.10.2            

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.10               here_1.0.1               
 [3] dir.expiry_1.6.0          lattice_0.21-8           
 [5] png_0.1-8                 rprojroot_2.0.3          
 [7] digest_0.6.31             utf8_1.2.3               
 [9] R6_2.5.1                  evaluate_0.20            
[11] pillar_1.9.0              sparseMatrixStats_1.10.0 
[13] zlibbioc_1.44.0           rlang_1.1.0              
[15] rstudioapi_0.14           Matrix_1.5-4             
[17] rmarkdown_2.21            BiocParallel_1.32.6      
[19] RCurl_1.98-1.12           beachmat_2.14.2          
[21] HDF5Array_1.26.0          DelayedArray_0.24.0      
[23] compiler_4.2.0            xfun_0.39                
[25] pkgconfig_2.0.3           htmltools_0.5.5          
[27] tidyselect_1.2.0          tibble_3.2.1             
[29] GenomeInfoDbData_1.2.9    codetools_0.2-19         
[31] fansi_1.0.4               withr_2.5.0              
[33] rhdf5filters_1.10.1       bitops_1.0-7             
[35] grid_4.2.0                jsonlite_1.8.4           
[37] lifecycle_1.0.3           magrittr_2.0.3           
[39] cli_3.6.1                 XVector_0.38.0           
[41] renv_0.17.3               fs_1.6.2                 
[43] DelayedMatrixStats_1.20.0 filelock_1.0.2           
[45] generics_0.1.3            vctrs_0.6.2              
[47] Rhdf5lib_1.20.0           tools_4.2.0              
[49] forcats_1.0.0             glue_1.6.2               
[51] parallel_4.2.0            fastmap_1.1.1            
[53] yaml_2.3.7                rhdf5_2.42.1             
[55] BiocManager_1.30.20       knitr_1.42

> env <- zellkonverterAnnDataEnv(version)
> env
An object of class "BasiliskEnvironment"
Slot "envname":
[1] "zellkonverterAnnDataEnv-0.8.0"

Slot "pkgname":
[1] "zellkonverter"

Slot "packages":
 [1] "anndata==0.8.0"  "h5py==3.6.0"     "hdf5==1.12.1"    "natsort==8.1.0" 
 [5] "numpy==1.22.3"   "packaging==21.3" "pandas==1.4.2"   "python==3.8.13" 
 [9] "scipy==1.7.3"    "sqlite==3.38.2" 

Slot "channels":
[1] "conda-forge"

Slot "pip":
character(0)

Slot "paths":
character(0)

jgilis · 2023-05-22T13:33:32Z

Note that I am still able to load the data with older versions on the same machine, so I do not expect memory issues to be the culprit here...

One more hint: it seems that this line

raw_x <- zellkonverter:::.extract_or_skip_assay(skip_assays = skip_assays, 
    hdf5_backed = hdf5_backed, dims = dims, mat = adata$raw$X, 
    name = "raw 'X' matrix")
dim(raw_x$mat)

results in a 32K x 1.2M matrix in the old versions, but now only returns a 1999 x 1.2M matrix.

Let me know if I can provide any more info to resolve this!

lazappi · 2023-05-23T06:44:56Z

@jgilis I think this is a separate issue so I have moved it to #96

lazappi added the bug Something isn't working label Mar 5, 2021

lazappi closed this as completed Mar 5, 2021

royfrancis mentioned this issue Dec 7, 2022

udunits2-deactivate.sh: [[: not found #84

Closed

lazappi mentioned this issue May 23, 2023

'x@assays' is not parallel to 'x' #96

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support transposed assay data #42

support transposed assay data #42

mtmorgan commented Feb 27, 2021

hpages commented Feb 27, 2021

LTLA commented Feb 28, 2021

hpages commented Feb 28, 2021

lazappi commented Mar 5, 2021

royfrancis commented Dec 6, 2022

lazappi commented Dec 7, 2022

royfrancis commented Dec 7, 2022 •

edited

Loading

lazappi commented Dec 12, 2022

mtmorgan commented Dec 12, 2022 •

edited

Loading

lazappi commented Dec 15, 2022

jenellewallace commented May 18, 2023

royfrancis commented May 18, 2023 •

edited

Loading

jenellewallace commented May 19, 2023

lazappi commented May 22, 2023

jgilis commented May 22, 2023

jgilis commented May 22, 2023

jgilis commented May 22, 2023 •

edited

Loading

lazappi commented May 23, 2023

support transposed assay data #42

support transposed assay data #42

Comments

mtmorgan commented Feb 27, 2021

hpages commented Feb 27, 2021

LTLA commented Feb 28, 2021

hpages commented Feb 28, 2021

lazappi commented Mar 5, 2021

royfrancis commented Dec 6, 2022

lazappi commented Dec 7, 2022

royfrancis commented Dec 7, 2022 • edited Loading

lazappi commented Dec 12, 2022

mtmorgan commented Dec 12, 2022 • edited Loading

lazappi commented Dec 15, 2022

jenellewallace commented May 18, 2023

royfrancis commented May 18, 2023 • edited Loading

jenellewallace commented May 19, 2023

lazappi commented May 22, 2023

jgilis commented May 22, 2023

jgilis commented May 22, 2023

jgilis commented May 22, 2023 • edited Loading

lazappi commented May 23, 2023

royfrancis commented Dec 7, 2022 •

edited

Loading

mtmorgan commented Dec 12, 2022 •

edited

Loading

royfrancis commented May 18, 2023 •

edited

Loading

jgilis commented May 22, 2023 •

edited

Loading