-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible issue with uint8 in bmat #19
Comments
Hi,
Thanks for reporting this. I was intentionally capping the max value to 255. The reason is two-fold:
1) we found that an extremely small portion (less than 0.1%) of the items in the matrix will have count larger than 10 or 50. Given that there are only two copies of genome in a normal cell if not considering copy number variation or chrM sequence, the items that have value larger than 100 is very likely due to the alignment errors from repetitive sequences.
2) the downstream analysis is converting the count matrix to binary matrix anyway, the absolute count value will not be considered in the downstream analysis.
…--
Rongxin Fang
Ph.D. Student, Ren Lab
Ludwig Institute for Cancer Research
University of California, San Diego
On Sep 26, 2019, at 7:38 AM, Davide Cittaro ***@***.***> wrote:
I'm trying to parse snap files into python sparse matrices. This is what I'm doing
import numpy as np
import h5py
import snaptools.utilities
import scipy.sparse as sp
f= h5py.File(myfile, 'r')
n_cells = len(f['BD/name'])
bin_dict = snaptools.utilities.getBinsFromGenomeSize(genome_dict, bin_size) #from snaptools code
n_bins = len(bin_dict)
idy = f['AM/5000/idy'][:]
idx = np.arange(n_cells + 1)
data = data = f['AM/5000/count'][:]
X = sp.csc_matrix((data, idy, idx), shape=(n_bins, n_cells))
Everything seems to work but I've noticed two things:
my data are capped at 255
there are many more zeros than I previously found with another method (outside snaptools)
as for the second I've thought that maybe I was counting wrong but reading the snap file internals I've realized counts are saved as uint8, which explains the capping to 255. The problem is that at line 55 of add_bmat.py the counter is a generic python integer
bins = collections.defaultdict(lambda : 0);
which is then casted to uint8 at time of writing (line 79).
f.create_dataset("AM/"+str(bin_size)+"/count", data=countList[bin_size], dtype="uint8", compression="gzip", compression_opts=9);
This causes the values to be set to the modulus of X % 256. I don't know if standard scATAC experiments expect read counts per bin being below 255, but this is not my case.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub <#19?email_source=notifications&email_token=ABT6GG2OU7QJT4DCCT4WIJLQLSNMNA5CNFSM4I2Y23B2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HN3SNEA>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABT6GGZ3PED22E3SNGSMKWTQLSNMNANCNFSM4I2Y23BQ>.
|
I see. Still if you have a bin counting 256 that would be set to 0, also when it is binarized. If 255 should be the max value any bin should be capped before writing the snap object |
I see your point. I agree with your this should be kept before binarizing, I will try to change it. Meanwhile, this won’t affect the standard downstream analysis (unless you are using it for other purpose) because the item in the matrix with value usually larger than 100 will be removed and set to be 0. But again, i agree with you this is an issue that needs to be fixed. Thanks for reporting
Best,
--
Rongxin Fang
Ph.D. Student, Ren Lab
Ludwig Institute for Cancer Research
University of California, San Diego
… On Oct 1, 2019, at 12:15 PM, Davide Cittaro ***@***.***> wrote:
I see. Still if you have a bin counting 256 that would be set to 0, also when it is binarized. If 255 should be the max value any bin should be capped before writing the snap object
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#19?email_source=notifications&email_token=ABT6GG42E5BIPK4H72CMWLTQMNZTVA5CNFSM4I2Y23B2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAB24MQ#issuecomment-537112114>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABT6GG2ORAINC7SMFISQQITQMNZTVANCNFSM4I2Y23BQ>.
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I'm trying to parse snap files into python sparse matrices. This is what I'm doing
Everything seems to work but I've noticed two things:
as for the second I've thought that maybe I was counting wrong but reading the snap file internals I've realized counts are saved as uint8, which explains the capping to 255. The problem is that at line 55 of
add_bmat.py
the counter is a generic python integerwhich is then casted to uint8 at time of writing (line 79).
This causes the values to be set to the modulus of X % 256. I don't know if standard scATAC experiments expect read counts per bin being below 255, but this is not my case.
The text was updated successfully, but these errors were encountered: