-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/cells argument #170
base: master
Are you sure you want to change the base?
Conversation
deployement
github issu #132
I was trying to make the code of this branch work and I observed an unexpected behaviour with your choice of 'join_asof' to correct cell barcodes: In [1]: import polars as pl
In [2]: barcodes_df = pl.DataFrame({'barcode': ["AACATATTCTTTACTG", "TAAAGGGAAGTCAAGC", "TAAATATTCTTTACTG", "TACATATTCTTTACTG", "
...: TAGAGCGAAGTCAAGC", "TAGAGGGAAGTCAAGC"], 'count': [1, 1, 1, 98, 1, 98]})
In [3]: barcode_subset_df = pl.DataFrame({'whitelist': ['TACATATTCTTTACTG', 'TAGAGGGAAGTCAAGC']})
In [4]: barcode_subset_df = barcode_subset_df.with_columns(
...: reference=pl.col("whitelist"))
In [5]: barcode_subset_df
Out[5]:
shape: (2, 2)
┌──────────────────┬──────────────────┐
│ whitelist ┆ reference │
│ --- ┆ --- │
│ str ┆ str │
╞══════════════════╪══════════════════╡
│ TACATATTCTTTACTG ┆ TACATATTCTTTACTG │
│ TAGAGGGAAGTCAAGC ┆ TAGAGGGAAGTCAAGC │
└──────────────────┴──────────────────┘
In [6]: BARCODE_COLUMN = 'barcode'
In [7]: WHITELIST_COLUMN = 'whitelist'
In [8]: temp1 = barcodes_df.sort(BARCODE_COLUMN).join_asof(
...: barcode_subset_df.sort(WHITELIST_COLUMN),
...: left_on=BARCODE_COLUMN,
...: right_on=WHITELIST_COLUMN,
...: )
In [9]: temp1
Out[9]:
shape: (6, 4)
┌──────────────────┬───────┬──────────────────┬──────────────────┐
│ barcode ┆ count ┆ whitelist ┆ reference │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ str │
╞══════════════════╪═══════╪══════════════════╪══════════════════╡
│ AACATATTCTTTACTG ┆ 1 ┆ null ┆ null │
│ TAAAGGGAAGTCAAGC ┆ 1 ┆ null ┆ null │
│ TAAATATTCTTTACTG ┆ 1 ┆ null ┆ null │
│ TACATATTCTTTACTG ┆ 98 ┆ TACATATTCTTTACTG ┆ TACATATTCTTTACTG │
│ TAGAGCGAAGTCAAGC ┆ 1 ┆ TACATATTCTTTACTG ┆ TACATATTCTTTACTG │
│ TAGAGGGAAGTCAAGC ┆ 98 ┆ TAGAGGGAAGTCAAGC ┆ TAGAGGGAAGTCAAGC │
└──────────────────┴───────┴──────────────────┴──────────────────┘ What I would expect: ┌──────────────────┬───────┬──────────────────┬──────────────────┐
│ barcode ┆ count ┆ whitelist ┆ reference │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ str │
╞══════════════════╪═══════╪══════════════════╪══════════════════╡
│ AACATATTCTTTACTG ┆ 1 ┆ TACATATTCTTTACTG ┆ TACATATTCTTTACTG │
│ TAAAGGGAAGTCAAGC ┆ 1 ┆ TAGAGGGAAGTCAAGC ┆ TAGAGGGAAGTCAAGC │
│ TAAATATTCTTTACTG ┆ 1 ┆ TACATATTCTTTACTG ┆ TACATATTCTTTACTG │
│ TACATATTCTTTACTG ┆ 98 ┆ TACATATTCTTTACTG ┆ TACATATTCTTTACTG │
│ TAGAGCGAAGTCAAGC ┆ 1 ┆ TAGAGGGAAGTCAAGC ┆ TAGAGGGAAGTCAAGC │
│ TAGAGGGAAGTCAAGC ┆ 98 ┆ TAGAGGGAAGTCAAGC ┆ TAGAGGGAAGTCAAGC │
└──────────────────┴───────┴──────────────────┴──────────────────┘ |
Good catch! Should have started with writing tests! I can go back to a simpler but slightly slower implementation or find a better way using the asof join. |
@lldelisle I rewrote Again, thank you for catching this so early! |
I've turned off UMI correction for now just to see runs go through completely. The path without cell reference and whitelist should be working. I've still to write tests for the MTX outputs. |
…g using asof_join
Tasks details
Rewrite UMI correction in polars
Current version of CSC uses
umi_tools.network.UMIClusterer()
to go through each list of UMIs per cell per feature and handles the potential UMI corrections needed. The simple implementation on polars is to usemap_elements
but this is not optimized as it's not using the polars infrastructure. There is a big potential for improvement if this step can be rewritten entirely in polars.Status on branch
UMI correction is skipped at the moment, no function available.
Rewrite fastq reading in polars
Current version of CSC reads in the fastq files and then spits out a big csv which we read using polars. Fastq files are basically text files with 4 lines per read. We can rewrite the input intake to read fastq files directly and store them into a dataframe. This reduces io operations and should be faster as well. It also would allow to extend CSC to use quality to filter reads.
Status on branch
io.write_mapping_input
is the function that reads the fastqs and writes the csv to be read later. Thenpreprocessing.split_data_input
reads the csv file and generates the dataframes necessary for processing. The idea would be to skip the intermediate step by just reading the fastqs directly into the necessary dataframes.Disambiguation of whitelist and reference
Currently CSC uses terms such as
whitelist
andreference
to distinguish a short handpicked list from users and the whole world of barcodes. But historically, reference files also have been called whitelist and this makes it confusing. I would like to change the language toreference_subset
andreference
to make it clearer that the first one is a subset of the second one.Status on branch
Delete any mention of whitelist and replace it by subset.
The two last tasks I'm going to deal with.