-
Notifications
You must be signed in to change notification settings - Fork 182
File formats
For making use of the DIAMOND output in the context of big data analytics, we recommend the Apache Parquet file format and/or the DuckDB database system.
The DuckDB Command Line Interface can be used to convert
DIAMOND tabular output format into Parquet or the DuckDB format, either using an intermediate
TSV file or directly piping the output of DIAMOND into DuckDB. For this purpose, the DIAMOND
tabular output format should be used with header lines (option --header simple
), and without
specifying an output file when using a pipe.
From TSV to Parquet:
duckdb -c "SET memory_limit='16GB'; SET threads=16; COPY(select * from read_csv_auto('input.tsv', delim='\t', header=true, parallel=true)) TO 'output.parquet' WITH (FORMAT 'PARQUET')"
From DIAMOND to Parquet:
diamond PARAMETERS | duckdb -c "SET memory_limit='16GB'; SET threads=16; COPY(select * from read_csv_auto('/dev/stdin', delim='\t', header=true, parallel=true)) TO 'output.parquet' WITH (FORMAT 'PARQUET')"
From TSV to DuckDB database:
duckdb DATABASE_NAME -c "SET memory_limit='16GB'; SET threads=16; create table alignments as select * from read_csv_auto('input.tsv', delim='\t', header=true, parallel=true)"
From DIAMOND to DuckDB Database:
diamond PARAMETERS | duckdb DATABASE_NAME -c "SET memory_limit='16GB'; SET threads=16; create table alignments as select * from read_csv_auto('/dev/stdin', delim='\t', header=true, parallel=true)"
The DuckDB memory limit and thread count may be changed depending on the system specs.
Benchmarks:
Size | TSV to Parquet | TSV to DuckDB database |
---|---|---|
12 GB | 0m33.157s | 0m30.596s |
24 GB | 1m4.645s | 0m51.964s |
48 GB | 2m4.457s | 1m37.319s |
96 GB | 3m59.649s | 3m2.770s |
192 GB | 9m26.08s | 7m20.66s |
The benchmark was run on max. 20 cores in parallel. On a MacBook, it took 6 minutes to convert a 12 GB TSV file into Parquet.