Output formats #5
Replies: 4 comments 9 replies
-
@GabrielSimonetto What would be your take(s) on this discussion? Which output format would you choose for now? |
Beta Was this translation helpful? Give feedback.
-
fwiw Apache Feather is the name of the Arrow IPC format. I'm also wondering if we should benchmark some other formats that are maybe not obvious HDF5, SQLite (e.g. https://github.com/mlin/GenomicSQLite). Finally, I wonder if we should rank these from most important to least, so we can easily extend the project to more formats if we have time. |
Beta Was this translation helpful? Give feedback.
-
D4 is viable - and it's a good fit, I was discussing with @arq5x how to better support it in the context of GA4GH a while ago. |
Beta Was this translation helpful? Give feedback.
-
Good read about some of the output formats we are targetting: https://blog.openml.org/openml/data/2020/03/23/Finding-a-standard-dataset-format-for-machine-learning.html |
Beta Was this translation helpful? Give feedback.
-
As raised earlier, we should decide on a desirable (subset of?) output format(s). The premises/deciders for them are:
Here are a few candidates:
Apache Arrow and/or Parquet
Might be the strongest candidate due to its (Rust) library support and having prior art being used in other sequencing tech such as nanopore. It is also a cloud native format, no expensive ETLs are needed to consume it.
Apache Avro
One of the formats supported by default by AWS Athena.
Amazon Ion
Relatively new format that has been opensourced by Amazon. Its dual text and binary representation capabilities are interesting as well as its Rust bindings and compatibility with the AWS ecosystem (i.e Athena can consume it).
Apache ORC
Still cloud native and has some interesting compression and runtime improvements when compared with Parquet. On the other hand, Rust library support might not be as mature as with other formats.
GenomicSQLite
Has a fair amount of work done and can be interesting to benchmark and interface with. Interestingly, it even has a Rust binding.
D4
Highly specific around BED/BigWig so input-to-ouput format mapping might not be even be feasible for a number of input formats.
insert and discuss further suggestions here
Beta Was this translation helpful? Give feedback.
All reactions