Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the count matrices don't contain counts #133

Open
jkobject opened this issue Jan 31, 2024 · 3 comments
Open

the count matrices don't contain counts #133

jkobject opened this issue Jan 31, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@jkobject
Copy link

Report

Hello,

I am seeing that the benchmark datasets used mention that you are taking the "counts" layer from these datasets. However, when looking at this layer I see values being floats instead of ints. Meaning that they are not counts.

The tool I want to benchmark only takes count matrices.

How should I get the count data?

Version information

No response

@jkobject jkobject added the bug Something isn't working label Jan 31, 2024
@jkobject
Copy link
Author

jkobject commented Feb 5, 2024

to verify that they don't contain counts: do

adata = sc.read(
    "data/lung_atlas.h5ad",
    backup_url="https://figshare.com/ndownloader/files/24539942",
)
adata.layers['counts'].sum() 

same thing for the pancreas dataset.

@adamgayoso
Copy link
Member

Just having a float dtype does not imply that they are not count data. Most datasets are stored in a float32 format.

For that particular dataset, I would encourage you to read the original scib paper methods section.

If you're using a tool like scVI, it would technically work on data with decimals, like (1.03). The question is whether the non-count data are meant to represent count data. For example, pseudoaligners can provide probabilistic count values.

@jkobject
Copy link
Author

jkobject commented Feb 5, 2024

Hello Adam,

Thanks for the reply. I understand that even raw counts are often stored as float32, but here I see that some of the datasets used in this combined dataset have values that are not raw counts (meaning data with decimals).

I have not worked with probabilistic raw counts before. Are you saying that this is the reason why most of the 10x samples have decimal values?

Reading the methods section. It is saying that some datasets were unavailable as raw counts and they used the rpkm or tpms: so the counts also contain normalized data?

I am not sure how to continue with it if the data is depth normalized. I am working with my own model that is assuming that the counts are true counts..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants