Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support transposed assay data #42

Closed
mtmorgan opened this issue Feb 27, 2021 · 18 comments
Closed

support transposed assay data #42

mtmorgan opened this issue Feb 27, 2021 · 18 comments
Labels
bug Something isn't working

Comments

@mtmorgan
Copy link
Contributor

My understanding (from @hpages) is that transposed assay data matrices should be supported in HDF files, but after downloading Human Cell Atlas h5ad data at https://data.humancellatlas.org/explore/projects/f83165c5-e2ea-4d15-a5cf-33f3550bffde/expression-matrices?catalog=dcp2 (look for the h5ad files) I see

> sce <- zellkonverter::readH5AD(file_path, use_hdf5 = TRUE)
Warning message:
In .extract_or_skip_assay(skip_assays = skip_assays, hdf5_backed = hdf5_backed,  :
  'X' matrix does not support transposition and has been skipped

and indeed the matrix is all zeros.

This can be reproduced programmatically with

BiocManager::install("Bioconductor/hca")
library("hca"); library("dplyr")

filters <- filters(
    projectId = list(is = "f83165c5-e2ea-4d15-a5cf-33f3550bffde"),
    fileFormat = list(is = "h5ad")
)
files <-
    files(filters) %>%
    head(1)

file_path <- files_download(files)

sce <- zellkonverter::readH5AD(file_path, use_hdf5 = TRUE)
@hpages
Copy link
Contributor

hpages commented Feb 27, 2021

Works fine as far as HDF5Array::H5ADMatrix() is concerned:

file_path
# 6d4fedcf-857d-5fbb-9928-8b9605500a69-2020-11-20T09:03:11.285229Z 
#               "/tmp/Rtmpvwuwxz/a5be939a99f58_a5be939a99f58.h5ad" 

library(HDF5Array)

H5ADMatrix(file_path)
# <30191 x 4891> sparse matrix of class H5ADMatrix and type "double":
#                [,1]       [,2]       [,3] ... [,4890] [,4891]
#     [1,]          0          0          0   .       0       0
#     [2,]          0          0          0   .       0       0
#     [3,]          0          0          0   .       0       0
#     [4,]          0          0          0   .       0       0
#     [5,]          0          0          0   .       0       0
#      ...          .          .          .   .       .       .
# [30187,] 0.00000000 0.00000000 0.00000000   .       0       0
# [30188,] 0.00000000 0.00000000 0.00000000   .       0       0
# [30189,] 0.06667092 0.00000000 0.00000000   .       0       0
# [30190,] 0.00000000 0.00000000 0.00000000   .       0       0
# [30191,] 0.00000000 0.00000000 0.00000000   .       0       0
# Warning message:
# In .load_h5ad_dimnames(filepath) :
#   could not find dimnames (normally stored in datasets '/var/_index' and
#   '/obs/_index') in this .h5ad file

(I still need to work on the dimnames part.)

sessionInfo():

> sessionInfo()
R Under development (unstable) (2021-02-05 r79941)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.10

Matrix products: default
BLAS:   /home/hpages/R/R-4.1.r79941/lib/libRblas.so
LAPACK: /home/hpages/R/R-4.1.r79941/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] HDF5Array_1.19.6     rhdf5_2.35.2         DelayedArray_0.17.9 
 [4] IRanges_2.25.6       S4Vectors_0.29.7     MatrixGenerics_1.3.1
 [7] matrixStats_0.58.0   BiocGenerics_0.37.1  Matrix_1.3-2        
[10] hca_0.99.0          

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6           pillar_1.5.0         compiler_4.1.0      
 [4] dbplyr_2.1.0         rhdf5filters_1.3.4   tools_4.1.0         
 [7] bit_4.0.4            lattice_0.20-41      jsonlite_1.7.2      
[10] BiocFileCache_1.15.1 RSQLite_2.2.3        memoise_2.0.0       
[13] lifecycle_1.0.0      tibble_3.1.0         pkgconfig_2.0.3     
[16] rlang_0.4.10         DBI_1.1.1            cli_2.3.1           
[19] rstudioapi_0.13      filelock_1.0.2       curl_4.3            
[22] fastmap_1.1.0        withr_2.4.1          dplyr_1.0.4         
[25] httr_1.4.2           generics_0.1.0       vctrs_0.3.6         
[28] rappdirs_0.3.3       grid_4.1.0           bit64_4.0.5         
[31] tidyselect_1.1.0     glue_1.4.2           R6_2.5.0            
[34] fansi_0.4.2          Rhdf5lib_1.13.4      purrr_0.3.4         
[37] tidyr_1.1.2          blob_1.2.1           magrittr_2.0.1      
[40] ps_1.5.0             ellipsis_0.3.1       assertthat_0.2.1    
[43] utf8_1.1.4           cachem_1.0.4         crayon_1.4.1        

@LTLA
Copy link
Contributor

LTLA commented Feb 28, 2021

I believe this issue is discussed in #37 and #41.

@hpages
Copy link
Contributor

hpages commented Feb 28, 2021

FWIW, starting with HDF5Array 1.19.7, H5ADMatrix() should always be able to find the dimnames (unless the AnnData folks have another sneaky way to hide them in the file):

file_path
# 6d4fedcf-857d-5fbb-9928-8b9605500a69-2020-11-20T09:03:11.285229Z 
#                 "/tmp/RtmpxFYLjF/a8e7d957870a_a8e7d957870a.h5ad" 

library(HDF5Array)

H5ADMatrix(file_path)
# <30191 x 4891> sparse matrix of class H5ADMatrix and type "double":
#             24087_4#281-vento18 ...  23728_8#64-vento18
# MIR1302-2HG                   0   .                   0
#     FAM138A                   0   .                   0
#  FO538757.2                   0   .                   0
#      FAM87B                   0   .                   0
#   LINC00115                   0   .                   0
#         ...                   .   .                   .
#  AP001469.3          0.00000000   .                   0
#  AC136352.2          0.00000000   .                   0
#  BX004987.1          0.06667092   .                   0
#  AC145212.1          0.00000000   .                   0
#  AC213203.2          0.00000000   .                   0

@lazappi lazappi added the bug Something isn't working label Mar 5, 2021
@lazappi
Copy link
Member

lazappi commented Mar 5, 2021

Should be fixed in devel with v1.1.5

@lazappi lazappi closed this as completed Mar 5, 2021
@royfrancis
Copy link

I have the same issue with bioconductor-zellkonverter 1.8.0.

@lazappi
Copy link
Member

lazappi commented Dec 7, 2022

@royfrancis Can you please provide some code/file to reproduce the issue? I'm hoping this exact issue is solved but you could be getting the same warning for other reasons.

@royfrancis
Copy link

royfrancis commented Dec 7, 2022

Many hours later...

Now I have a docker image to reproduce the error.

docker run --rm --user "$(id -u):$(id -g)" -v ${PWD}:/zell/work royfrancis/zellkonverter:1.8.0

The h5ad file that I am converting is unfortunately huge. Extended GBmap (11GB) from here. Rename local.h5ad to file.h5ad. Then run docker:

docker run --user "$(id -u):$(id -g)" --rm -v ${PWD}:/zell/work zell:1.8

Warning message:
'X' matrix does not support transposition and has been skipped

It worked with a few other smaller files that I tried. For example 1295 cells or 7999 cells or 141401 cells.

Maybe it's an issue with the h5ad file? 🤔


To build image from scratch, place the three files below into a directory:

Dockerfile

FROM r-base:4.1.1
LABEL Description="Docker image for zellkonverter"

RUN apt-get update \
   && apt-get install --no-install-recommends -y build-essential \
   && apt-get clean \
   && rm -rf /var/lib/apt/lists/* \
   && mkdir /zell \
   && mkdir /zell/scripts /zell/work

COPY env.yml convert.R /zell/scripts/

ENV CONDA_DIR=$HOME/miniconda3
RUN wget --quiet https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh \
  && /bin/bash ~/miniconda.sh -b -p $CONDA_DIR \
  && rm ~/miniconda.sh
ENV PATH=$CONDA_DIR/bin:$PATH

RUN conda install mamba -n base -c conda-forge
RUN mamba env create -f /zell/scripts/env.yml
WORKDIR /zell/work
CMD conda run -n zell Rscript /zell/scripts/convert.R

env.yml

name: zell
channels:
 - bioconda
 - conda-forge
dependencies:
 - python=3.7
 - bioconductor-zellkonverter
 - pip
 - pip:
    - anndata

convert.R

library(zellkonverter)
anndata <- reticulate::import("anndata")
if(!file.exists("file.h5ad")) stop(file.path(getwd(),"file.h5ad does not exist."))
message(paste0("Reading ",getwd(),"/file.h5ad..."))
adata <- anndata$read_h5ad("file.h5ad")
message("Converting ANN to SCE...")
sce <- zellkonverter::AnnData2SCE(adata)
message(paste0("Exporting ",getwd(),"/sce.rds..."))
saveRDS(sce,"sce.Rds")
message("Completed.")

Run this in that directory to build the docker image.
docker build -t zellkonverter:1.8.0 .

@lazappi
Copy link
Member

lazappi commented Dec 12, 2022

I have had a quick look at this. I think what is happening is that the dataset is too big and transposing the X matrix in R fails (probably because of running out of memory). The warning message isn't very helpful in this case so that can be improved but maybe we should also look again at doing the transposition in Python as I suggested here #9.

@mtmorgan
Copy link
Contributor Author

mtmorgan commented Dec 12, 2022

I don't know the context but if the goal is to support Bioconductor work flows does it make sense to

HDF5Array::H5SparseMatrix(glioblastoma_h5ad_file, "/X")

which has done the transformation anyway? And is more-or-less instant, since it does not actually load the entire data into memory?

Oops I see a similar comment from @hpages upstream. And also perhaps that the problem is in AnnData2SCE(), but one should just be going for the R representation directly, without python?

@lazappi
Copy link
Member

lazappi commented Dec 15, 2022

Yeah I think this would be the correct approach for the R reader but not sure it helps with the Python reader where we already have the data in memory. This kind of scalability is probably another reason to work on the R reader more, at the very least you avoid having two copies of everything.

@jenellewallace
Copy link

Was this issue ever solved? I have the same problem:

sce = readH5AD(file, use_hdf5=TRUE)
Warning: 'X' matrix does not support transposition and has been skippedℹ Using stored X_name value 'counts'

And the resulting counts matrix is all zeroes.

@royfrancis
Copy link

royfrancis commented May 18, 2023

Zellkonverter didn't work for my file of interest, so my solution was simply to export as text and convert into R format. The code I used is below. This solution is extremely memory intensive depending on the size of your dataset. There are probably better ways to do this.

H5AD -> CSV -> R

# python

import scanpy as sc
import numpy as np
import pandas as pd

print("Reading data...")
adata = sc.read_h5ad("file.h5ad")

print("Writing data...")
t=adata.raw.X.toarray()
pd.DataFrame(data=t, index=adata.obs_names, columns=adata.raw.var_names).to_csv('raw-counts.csv')
# comma separated csv with header, no rownames
pd.DataFrame(adata.obs).to_csv("metadata.csv")
# R

library(Seurat)

message("Reading counts...")
x <- read.csv("raw-counts.csv",header=TRUE)
rownames(x) <- x[,1]
x[,1] <- NULL
print(dim(x))
print(x[1:5,1:5])

message("Reading metadata...")
m <- read.csv("metadata.csv",header=TRUE)
rownames(m) <- m[,1]
colnames(m)[1] <- "sample"
print(dim(m))
print(head(m))

message("Writing seurat object...")
saveRDS(
  CreateSeuratObject(counts=t(x),meta.data=m,project="extended",min.cells=0,min.features=0),
  "seurat.Rds"
)

@jenellewallace
Copy link

@royfrancis Thanks for your response! I ended up being able to just download the raw data I needed from a public repository into R format so skipping the h5 step. Hope this issue will be fixed soon!

@lazappi
Copy link
Member

lazappi commented May 22, 2023

Was this issue ever solved? I have the same problem:

sce = readH5AD(file, use_hdf5=TRUE)
Warning: 'X' matrix does not support transposition and has been skippedℹ Using stored X_name value 'counts'

And the resulting counts matrix is all zeroes.

@jenellewallace It seems like you came up with a solution for your case but for anybody else we made updates to the R reader in the v1.10.0 release which should help with these kinds of issues (and should be further improved in the next release).

@jgilis
Copy link

jgilis commented May 22, 2023

Hi everyone,

I would like to jump in on this. I am working with the dataset from GSE174188 (GSE174188_CLUES1_adjusted.h5ad.gz).

In R/4.1.3, zellkonverter 1.4.0, I was able to simply obtain the data with

lupus <- readH5AD(file=lupus_file, use_hdf5 = TRUE, raw = TRUE, verbose = TRUE)

However, after switching to R/4.2.0, zellkoverter 1.8.0, the same code leads to following error:

Error in validObject(.Object) :
 invalid class “SummarizedExperiment” object:
  'x@assays' is not parallel to 'x'
Calls: readH5AD ... new_SummarizedExperiment -> new -> initialize -> initialize -> validObject
In addition: Warning messages:
1: raw 'X' matrix does not support transposition and has been skipped
2: The names of these selected raw var columns have been modified to match R
conventions:
'feature_types-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0'
->
'feature_types.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0'
Execution halted

It turns out that that the warning "raw 'X' matrix does not support transposition and has been skipped" is the underlying cause of the assays and rowdata dimensions being mismatched, which finally results in SummarizedExperiment complaining.

In the light of this thread, I also tried with zellkonverter 1.11.0, but this did not resolve my issues.

@jgilis
Copy link

jgilis commented May 22, 2023

> sessionInfo()
R version 4.2.0 alpha (2022-04-04 r82084)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 11.6

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    stats     graphics  grDevices datasets  utils     methods  
[8] base     

other attached packages:
 [1] dplyr_1.1.2                 scuttle_1.8.4              
 [3] SingleCellExperiment_1.20.1 SummarizedExperiment_1.28.0
 [5] Biobase_2.58.0              GenomicRanges_1.50.2       
 [7] GenomeInfoDb_1.34.9         IRanges_2.32.0             
 [9] S4Vectors_0.36.2            BiocGenerics_0.44.0        
[11] MatrixGenerics_1.10.0       matrixStats_0.63.0         
[13] reticulate_1.28             zellkonverter_1.11.0       
[15] basilisk.utils_1.10.0       basilisk_1.10.2            

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.10               here_1.0.1               
 [3] dir.expiry_1.6.0          lattice_0.21-8           
 [5] png_0.1-8                 rprojroot_2.0.3          
 [7] digest_0.6.31             utf8_1.2.3               
 [9] R6_2.5.1                  evaluate_0.20            
[11] pillar_1.9.0              sparseMatrixStats_1.10.0 
[13] zlibbioc_1.44.0           rlang_1.1.0              
[15] rstudioapi_0.14           Matrix_1.5-4             
[17] rmarkdown_2.21            BiocParallel_1.32.6      
[19] RCurl_1.98-1.12           beachmat_2.14.2          
[21] HDF5Array_1.26.0          DelayedArray_0.24.0      
[23] compiler_4.2.0            xfun_0.39                
[25] pkgconfig_2.0.3           htmltools_0.5.5          
[27] tidyselect_1.2.0          tibble_3.2.1             
[29] GenomeInfoDbData_1.2.9    codetools_0.2-19         
[31] fansi_1.0.4               withr_2.5.0              
[33] rhdf5filters_1.10.1       bitops_1.0-7             
[35] grid_4.2.0                jsonlite_1.8.4           
[37] lifecycle_1.0.3           magrittr_2.0.3           
[39] cli_3.6.1                 XVector_0.38.0           
[41] renv_0.17.3               fs_1.6.2                 
[43] DelayedMatrixStats_1.20.0 filelock_1.0.2           
[45] generics_0.1.3            vctrs_0.6.2              
[47] Rhdf5lib_1.20.0           tools_4.2.0              
[49] forcats_1.0.0             glue_1.6.2               
[51] parallel_4.2.0            fastmap_1.1.1            
[53] yaml_2.3.7                rhdf5_2.42.1             
[55] BiocManager_1.30.20       knitr_1.42 
> env <- zellkonverterAnnDataEnv(version)
> env
An object of class "BasiliskEnvironment"
Slot "envname":
[1] "zellkonverterAnnDataEnv-0.8.0"

Slot "pkgname":
[1] "zellkonverter"

Slot "packages":
 [1] "anndata==0.8.0"  "h5py==3.6.0"     "hdf5==1.12.1"    "natsort==8.1.0" 
 [5] "numpy==1.22.3"   "packaging==21.3" "pandas==1.4.2"   "python==3.8.13" 
 [9] "scipy==1.7.3"    "sqlite==3.38.2" 

Slot "channels":
[1] "conda-forge"

Slot "pip":
character(0)

Slot "paths":
character(0)

@jgilis
Copy link

jgilis commented May 22, 2023

Note that I am still able to load the data with older versions on the same machine, so I do not expect memory issues to be the culprit here...

One more hint: it seems that this line

raw_x <- zellkonverter:::.extract_or_skip_assay(skip_assays = skip_assays, 
    hdf5_backed = hdf5_backed, dims = dims, mat = adata$raw$X, 
    name = "raw 'X' matrix")
dim(raw_x$mat)

results in a 32K x 1.2M matrix in the old versions, but now only returns a 1999 x 1.2M matrix.

Let me know if I can provide any more info to resolve this!

@lazappi
Copy link
Member

lazappi commented May 23, 2023

@jgilis I think this is a separate issue so I have moved it to #96

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants