Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternatives for .skf format #47

Open
johnlees opened this issue Jul 11, 2023 · 4 comments
Open

Alternatives for .skf format #47

johnlees opened this issue Jul 11, 2023 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@johnlees
Copy link
Member

johnlees commented Jul 11, 2023

The current .skf format has the following issues:

  • It can become very large, especially for diverse datasets. As it is all deserialised (see Partial loading of skf #22 for attempts at altering this) this means any operation, including looking at metadata is potentially slow.
  • The entire object needs to fit in main memory.

This would be a large and breaking change. I think if we do this, it would make sense to have a second file format (.ska/.sklarge/.h5?) which can be selected by the user for these cases, and continue to allow the previous .skf format to work as the default.

File format alternatives

For the file format, HDF5 comes to mind (which uses btrees too), but the rust bindings require the library to be installed, which is the main disadvantage (but there is good support for building it). Range queries/slices of datasets can be returned, and it's easy to add attributes to the same file. So it definitely fits the bill here.

An alternative would be Apache Parquet which has a native rust implementation, and snap compression. This would be suitable for the kmers and variants array, but it would make more sense to store the other fields (version, k size etc) with serde as before. To keep these together as a single file, could we just use tar? Starting to feel too complex imo.

Streaming from disk

For the second issue, blocks of btrees by range, and careful merging, could allow streaming of the relevant parts during build & align phases. For example see https://github.com/wspeirs/btree

Both file formats above would work here. Arrow can read and write blocks of rows. HDF5 can take a slice/range query.

Other notes

I feel like serde should be capable of at least some of this, see e.g. https://serde.rs/stream-array.html and https://serde.rs/ignored-any.html. But intial attempts with the current format weren't working well and I'm not sure why, so if it needs to be changed anyway introducing a format more designed for streaming operations might be sensible.

@johnlees johnlees added the enhancement New feature or request label Jul 11, 2023
@johnlees johnlees self-assigned this Jul 11, 2023
@johnlees
Copy link
Member Author

@johnlees
Copy link
Member Author

Talking to Tommi, he suggests that better compression and working on the compressed structure may be preferable. Some ideas:

  • The split k-mers are difficult to compress. A tree structure would be one option, possibly having reinserted the middle base. As the number of samples grows larger it is the variant array not the split k-mer that take up more space (e.g. in the 28 Lm example, these are 31M or 18M gzipped).
  • Most of the space is taken up by the variants, and these are highly redundant, especially the constant sites (in the Lm example ~300Mb, but ~5Mb gzipped). Going back to the idea of bitvectors for A, C, G, T for these, and not explicitly storing constant sites would likely help a lot. Tommi suggests: http://bitmagic.io/ which supports some useful operations on the compressed data.

@johnlees
Copy link
Member Author

Tommi suggests:

  • Static bitvector of length 4 to encode each base in variant matrix, whole matrix is then a Vec of these (with correct slicing to get the index)
  • Separate constant sites
  • wrt bitmagic, works for I/O and streaming in this format. Advanced operations not yet available, but this aren't necessarily needed.

@johnlees
Copy link
Member Author

See branch const-refactor for a prototype
But for this to work properly it needs to either be bit packed and not many vectors, or a sparse array

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant