Feature/cells argument #170

Hoohm · 2022-07-03T13:34:42Z

Rewrite data chunking
Rewrite loading of CSVs with polars
Rewrite Mapping in polars
Rewrite barcode correction in polars
Rewrite UMI correction in polars
Rewrite fastq reading in polars
Disambiguation of whitelist and reference.
Generate parquet outputs
Deprecate csv outputs

Tasks details

Rewrite UMI correction in polars

Current version of CSC uses umi_tools.network.UMIClusterer() to go through each list of UMIs per cell per feature and handles the potential UMI corrections needed. The simple implementation on polars is to use map_elements but this is not optimized as it's not using the polars infrastructure. There is a big potential for improvement if this step can be rewritten entirely in polars.

Status on branch

UMI correction is skipped at the moment, no function available.

Rewrite fastq reading in polars

Current version of CSC reads in the fastq files and then spits out a big csv which we read using polars. Fastq files are basically text files with 4 lines per read. We can rewrite the input intake to read fastq files directly and store them into a dataframe. This reduces io operations and should be faster as well. It also would allow to extend CSC to use quality to filter reads.

Status on branch

io.write_mapping_input is the function that reads the fastqs and writes the csv to be read later. Then preprocessing.split_data_input reads the csv file and generates the dataframes necessary for processing. The idea would be to skip the intermediate step by just reading the fastqs directly into the necessary dataframes.

Disambiguation of whitelist and reference

Currently CSC uses terms such as whitelist and reference to distinguish a short handpicked list from users and the whole world of barcodes. But historically, reference files also have been called whitelist and this makes it confusing. I would like to change the language to reference_subset and reference to make it clearer that the first one is a subset of the second one.

Status on branch

Delete any mention of whitelist and replace it by subset.

The two last tasks I'm going to deal with.

deployement

github issu #132

lldelisle · 2023-11-22T16:23:31Z

I was trying to make the code of this branch work and I observed an unexpected behaviour with your choice of 'join_asof' to correct cell barcodes:

In [1]: import polars as pl

In [2]: barcodes_df = pl.DataFrame({'barcode': ["AACATATTCTTTACTG", "TAAAGGGAAGTCAAGC", "TAAATATTCTTTACTG", "TACATATTCTTTACTG", "
   ...: TAGAGCGAAGTCAAGC", "TAGAGGGAAGTCAAGC"], 'count': [1, 1, 1, 98, 1, 98]})

In [3]: barcode_subset_df = pl.DataFrame({'whitelist': ['TACATATTCTTTACTG', 'TAGAGGGAAGTCAAGC']})

In [4]: barcode_subset_df = barcode_subset_df.with_columns(
   ...:             reference=pl.col("whitelist"))

In [5]: barcode_subset_df
Out[5]: 
shape: (2, 2)
┌──────────────────┬──────────────────┐
│ whitelist        ┆ reference        │
│ ---              ┆ ---              │
│ str              ┆ str              │
╞══════════════════╪══════════════════╡
│ TACATATTCTTTACTG ┆ TACATATTCTTTACTG │
│ TAGAGGGAAGTCAAGC ┆ TAGAGGGAAGTCAAGC │
└──────────────────┴──────────────────┘

In [6]: BARCODE_COLUMN = 'barcode'

In [7]: WHITELIST_COLUMN = 'whitelist'

In [8]: temp1 = barcodes_df.sort(BARCODE_COLUMN).join_asof(
   ...:             barcode_subset_df.sort(WHITELIST_COLUMN),
   ...:             left_on=BARCODE_COLUMN,
   ...:             right_on=WHITELIST_COLUMN,
   ...:         )

In [9]: temp1
Out[9]: 
shape: (6, 4)
┌──────────────────┬───────┬──────────────────┬──────────────────┐
│ barcode          ┆ count ┆ whitelist        ┆ reference        │
│ ---              ┆ ---   ┆ ---              ┆ ---              │
│ str              ┆ i64   ┆ str              ┆ str              │
╞══════════════════╪═══════╪══════════════════╪══════════════════╡
│ AACATATTCTTTACTG ┆ 1     ┆ null             ┆ null             │
│ TAAAGGGAAGTCAAGC ┆ 1     ┆ null             ┆ null             │
│ TAAATATTCTTTACTG ┆ 1     ┆ null             ┆ null             │
│ TACATATTCTTTACTG ┆ 98    ┆ TACATATTCTTTACTG ┆ TACATATTCTTTACTG │
│ TAGAGCGAAGTCAAGC ┆ 1     ┆ TACATATTCTTTACTG ┆ TACATATTCTTTACTG │
│ TAGAGGGAAGTCAAGC ┆ 98    ┆ TAGAGGGAAGTCAAGC ┆ TAGAGGGAAGTCAAGC │
└──────────────────┴───────┴──────────────────┴──────────────────┘

What I would expect:

┌──────────────────┬───────┬──────────────────┬──────────────────┐
│ barcode          ┆ count ┆ whitelist        ┆ reference        │
│ ---              ┆ ---   ┆ ---              ┆ ---              │
│ str              ┆ i64   ┆ str              ┆ str              │
╞══════════════════╪═══════╪══════════════════╪══════════════════╡
│ AACATATTCTTTACTG ┆ 1     ┆ TACATATTCTTTACTG ┆ TACATATTCTTTACTG │
│ TAAAGGGAAGTCAAGC ┆ 1     ┆ TAGAGGGAAGTCAAGC ┆ TAGAGGGAAGTCAAGC │
│ TAAATATTCTTTACTG ┆ 1     ┆ TACATATTCTTTACTG ┆ TACATATTCTTTACTG │
│ TACATATTCTTTACTG ┆ 98    ┆ TACATATTCTTTACTG ┆ TACATATTCTTTACTG │
│ TAGAGCGAAGTCAAGC ┆ 1     ┆ TAGAGGGAAGTCAAGC ┆ TAGAGGGAAGTCAAGC │
│ TAGAGGGAAGTCAAGC ┆ 98    ┆ TAGAGGGAAGTCAAGC ┆ TAGAGGGAAGTCAAGC │
└──────────────────┴───────┴──────────────────┴──────────────────┘

Hoohm · 2023-11-22T18:21:06Z

Good catch! Should have started with writing tests!

I can go back to a simpler but slightly slower implementation or find a better way using the asof join.

Hoohm · 2023-12-28T10:10:40Z

@lldelisle I rewrote correct_barcodes_pl and I added some tests including the ones you ran. I think this time it works. Basically run the asof join twice with both strategies and only keep the ones that have a close enough hamming distance.

Again, thank you for catching this so early!

Hoohm · 2023-12-28T19:18:38Z

I've turned off UMI correction for now just to see runs go through completely. The path without cell reference and whitelist should be working. I've still to write tests for the MTX outputs.

…g using asof_join

Hoohm added 30 commits October 5, 2019 18:04

Merge tag '1.4.4' into develop

9cd8def

deployement

changed mtx output for features and small printing bug

db6da2b

Changed some verbose output

480ac20

Merge branch 'feature/mtx_format' into develop

0f0ad0b

deleted an enumerate

2fb744b

added named_tuple ref

27b6fc6

parallel umis

2c13410

Fixed tests

70e2170

Merge branch 'feature/namedtuples' into develop

ec21653

some updates to CHANGELOG

a6f595c

more changelog

5e09e19

fixed slidin_window

ebdda53

CHANGELOG update

e3ba958

got rid of second length check

b414242

integrated a pull from db for chemistry def

795a1a4

added remote downloading of definitions

9862a82

a lot of code refactoring

e5dc23d

refactoring, moved chunking to io

f4cd58c

some more changes

e5bbe59

Merge tag 'docu_error_132' into develop

42d9bc4

github issu #132

fixed chunking

7c3fbc6

fixed sprase output

e8c0ff5

fixed debugging

343f8a2

correction in README

f0598b5

dealt with merge conflicts

7d20fb5

other merge conflicts

729db14

conflicts resolved

fbf15a0

resolved more conflicts

8276494

rewrote all preprocssing tests and got rid of step for tags

1b60407

docstring updates

49cfdba

Hoohm and others added 10 commits November 21, 2021 16:54

fixed tests

564a5d0

added pyymal

14a9c61

feat: Add json template

edba219

Pyupgrade to 3.8

3e3d529

code reformatting

9e4cfea

Fix: Fix testing for preprocessing

9195df9

Fix: formatting

1548e32

Preprocessing: Code refactor with tests

a65f62b

Rewrote barcode correction

6c5f9e2

feat: rewriting mapping, barcode correction in polars

44bf849

Hoohm mentioned this pull request Nov 16, 2023

Allow to split fastq or get list of readID per cell #184

Open

Fix: python version

5f76224

lldelisle mentioned this pull request Nov 23, 2023

add --store-read-ids and --read-ids-whitelist options #186

Open

(feat): Barcode correction using asof_join

53c5e5b

(feat): Mtx writing

ced3e61

Hoohm added 11 commits December 30, 2023 16:21

(test): Tests for IO and preprocessing

292a232

(feat): Include yaml report again

b3401d8

(feat): Mapping in polars only using polars-distance

1ef8ffd

(feat): First attempt at UMI correction

993f998

(fix): Mtx writing

416d475

(fix): duplicated read_counts writing

81396b3

(feat): Top unmapped are back

ddfe048

(chore): Rename pl.Utf8 to pl.String

5545d93

(Fix): Barcode correction now iterates until it finds the best mappin…

c137d5e

…g using asof_join

(feat): Read fastq files using polars.

67d6c82

feat: New fastq reader/writer

c54dd60

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/cells argument #170

Feature/cells argument #170

Hoohm commented Jul 3, 2022 •

edited

Loading

lldelisle commented Nov 22, 2023

Hoohm commented Nov 22, 2023

Hoohm commented Dec 28, 2023

Hoohm commented Dec 28, 2023

Feature/cells argument #170

Are you sure you want to change the base?

Feature/cells argument #170

Conversation

Hoohm commented Jul 3, 2022 • edited Loading

Tasks details

Rewrite UMI correction in polars

Status on branch

Rewrite fastq reading in polars

Status on branch

Disambiguation of whitelist and reference

Status on branch

lldelisle commented Nov 22, 2023

Hoohm commented Nov 22, 2023

Hoohm commented Dec 28, 2023

Hoohm commented Dec 28, 2023

Hoohm commented Jul 3, 2022 •

edited

Loading