Reading plaintext alignments consumes a lot of memory (workaround inside) #24

tmaklin · 2023-12-01T12:45:24Z

Reading a plaintext pseudoalignment from Themisto consumes a lot more memory than is necessary because plaintext input disables the internal encoding of the pseudoalignments as a sparse vector.

Workaround: use alignment-writer to compact the alignment file and then read in the compact alignment file instead of the plaintext one.

jnalanko · 2023-12-07T19:10:32Z

I'm using this workaround, and the memory usage went from over 1TB to only 20GB. However, mSWEEP is still very slow to read the pseudoalignment file. It has now been reading the file for 11 CPU hours at 4 threads. The compacted alignment file is only about 250MB, so that does not sound right.

tmaklin · 2023-12-08T08:30:20Z

The "reading" part also includes deserializing the pseudoalignment into memory, constructing equivalence classes, and assigning the reads to the equivalence classes so it's a bit more than just reading the file, but still this probably needs some design changes to handle large input (100 000 000 reads x 60 000 references in this example) better.

tmaklin · 2023-12-08T08:44:26Z

Relevant functions for this issue:

Deserializing the file

alignment_writer::ParallelUnpack.

Memory use in plaintext data

Equivalence classes

telescope::Alignment::collapse Converting the deserialized file into equivalence classes, where...
telescope::GroupedAlignment::insert does the construction of the equivalence classes by hashing the pseudoalignment for each read represented as a boolean vector and constructing a hash map that links the pseudoalignment to data about it.

tmaklin · 2024-05-30T08:22:34Z

v2.1.0 should contain a fix for this.

I've also implemented a flag to filter out targets that have 0 alignments across reads, this can reduce the memory and cpu use significantly for sparse inputs. Filtering can be toggled with --min-hits 1. Using 1 as the threshold should produce the same results for the targets that have more than 0 alignments. Using values higher than 1 is also supported but will change the results, however something like --min-hits 1000 can be hugely beneficial for very large inputs.

tmaklin added the bug Something isn't working label Dec 1, 2023

tmaklin added this to the v2.1.0 milestone Feb 21, 2024

tmaklin mentioned this issue May 30, 2024

mSWEEP v2.1.0 #27

Merged

tmaklin closed this as completed May 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading plaintext alignments consumes a lot of memory (workaround inside) #24

Reading plaintext alignments consumes a lot of memory (workaround inside) #24

tmaklin commented Dec 1, 2023 •

edited

Loading

jnalanko commented Dec 7, 2023

tmaklin commented Dec 8, 2023

tmaklin commented Dec 8, 2023 •

edited

Loading

tmaklin commented May 30, 2024

Reading plaintext alignments consumes a lot of memory (workaround inside) #24

Reading plaintext alignments consumes a lot of memory (workaround inside) #24

Comments

tmaklin commented Dec 1, 2023 • edited Loading

jnalanko commented Dec 7, 2023

tmaklin commented Dec 8, 2023

tmaklin commented Dec 8, 2023 • edited Loading

tmaklin commented May 30, 2024

tmaklin commented Dec 1, 2023 •

edited

Loading

tmaklin commented Dec 8, 2023 •

edited

Loading