Possible issue with uint8 in bmat #19

dawe · 2019-09-26T11:38:46Z

I'm trying to parse snap files into python sparse matrices. This is what I'm doing

import numpy as np
import h5py
import snaptools.utilities
import scipy.sparse as sp

f= h5py.File(myfile, 'r')
n_cells = len(f['BD/name'])
bin_dict = snaptools.utilities.getBinsFromGenomeSize(genome_dict, bin_size) #from snaptools code
n_bins = len(bin_dict)
idy = f['AM/5000/idy'][:]
idx = np.arange(n_cells + 1)
data = data = f['AM/5000/count'][:]

X = sp.csc_matrix((data, idy, idx), shape=(n_bins, n_cells))

Everything seems to work but I've noticed two things:

my data are capped at 255
there are many more zeros than I previously found with another method (outside snaptools)

as for the second I've thought that maybe I was counting wrong but reading the snap file internals I've realized counts are saved as uint8, which explains the capping to 255. The problem is that at line 55 of add_bmat.py the counter is a generic python integer

            bins = collections.defaultdict(lambda : 0);

which is then casted to uint8 at time of writing (line 79).

        f.create_dataset("AM/"+str(bin_size)+"/count", data=countList[bin_size], dtype="uint8", compression="gzip", compression_opts=9);

This causes the values to be set to the modulus of X % 256. I don't know if standard scATAC experiments expect read counts per bin being below 255, but this is not my case.

The text was updated successfully, but these errors were encountered:

r3fang · 2019-09-30T03:18:45Z

Hi, Thanks for reporting this. I was intentionally capping the max value to 255. The reason is two-fold: 1) we found that an extremely small portion (less than 0.1%) of the items in the matrix will have count larger than 10 or 50. Given that there are only two copies of genome in a normal cell if not considering copy number variation or chrM sequence, the items that have value larger than 100 is very likely due to the alignment errors from repetitive sequences. 2) the downstream analysis is converting the count matrix to binary matrix anyway, the absolute count value will not be considered in the downstream analysis.

…

-- Rongxin Fang Ph.D. Student, Ren Lab Ludwig Institute for Cancer Research University of California, San Diego

On Sep 26, 2019, at 7:38 AM, Davide Cittaro ***@***.***> wrote: I'm trying to parse snap files into python sparse matrices. This is what I'm doing import numpy as np import h5py import snaptools.utilities import scipy.sparse as sp f= h5py.File(myfile, 'r') n_cells = len(f['BD/name']) bin_dict = snaptools.utilities.getBinsFromGenomeSize(genome_dict, bin_size) #from snaptools code n_bins = len(bin_dict) idy = f['AM/5000/idy'][:] idx = np.arange(n_cells + 1) data = data = f['AM/5000/count'][:] X = sp.csc_matrix((data, idy, idx), shape=(n_bins, n_cells)) Everything seems to work but I've noticed two things: my data are capped at 255 there are many more zeros than I previously found with another method (outside snaptools) as for the second I've thought that maybe I was counting wrong but reading the snap file internals I've realized counts are saved as uint8, which explains the capping to 255. The problem is that at line 55 of add_bmat.py the counter is a generic python integer bins = collections.defaultdict(lambda : 0); which is then casted to uint8 at time of writing (line 79). f.create_dataset("AM/"+str(bin_size)+"/count", data=countList[bin_size], dtype="uint8", compression="gzip", compression_opts=9); This causes the values to be set to the modulus of X % 256. I don't know if standard scATAC experiments expect read counts per bin being below 255, but this is not my case. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#19?email_source=notifications&email_token=ABT6GG2OU7QJT4DCCT4WIJLQLSNMNA5CNFSM4I2Y23B2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HN3SNEA>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABT6GGZ3PED22E3SNGSMKWTQLSNMNANCNFSM4I2Y23BQ>.

dawe · 2019-10-01T16:15:54Z

I see. Still if you have a bin counting 256 that would be set to 0, also when it is binarized. If 255 should be the max value any bin should be capped before writing the snap object

r3fang · 2019-10-02T13:13:27Z

I see your point. I agree with your this should be kept before binarizing, I will try to change it. Meanwhile, this won’t affect the standard downstream analysis (unless you are using it for other purpose) because the item in the matrix with value usually larger than 100 will be removed and set to be 0. But again, i agree with you this is an issue that needs to be fixed. Thanks for reporting Best, -- Rongxin Fang Ph.D. Student, Ren Lab Ludwig Institute for Cancer Research University of California, San Diego

…

On Oct 1, 2019, at 12:15 PM, Davide Cittaro ***@***.***> wrote: I see. Still if you have a bin counting 256 that would be set to 0, also when it is binarized. If 255 should be the max value any bin should be capped before writing the snap object — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#19?email_source=notifications&email_token=ABT6GG42E5BIPK4H72CMWLTQMNZTVA5CNFSM4I2Y23B2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAB24MQ#issuecomment-537112114>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABT6GG2ORAINC7SMFISQQITQMNZTVANCNFSM4I2Y23BQ>.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible issue with uint8 in bmat #19

Possible issue with uint8 in bmat #19

dawe commented Sep 26, 2019

r3fang commented Sep 30, 2019 via email

dawe commented Oct 1, 2019

r3fang commented Oct 2, 2019 via email

Possible issue with uint8 in bmat #19

Possible issue with uint8 in bmat #19

Comments

dawe commented Sep 26, 2019

r3fang commented Sep 30, 2019 via email

dawe commented Oct 1, 2019

r3fang commented Oct 2, 2019 via email