Skip to content

Insert Benchmark Instructions

Martin Nettling edited this page Dec 15, 2013 · 16 revisions

This wiki describes how to perform the benchmark on bulk inserts from the paper "DRUMS: Disk Repository with Update Management, for high throughput sequencing data." Tutorials, how DRUMS can be used with SNP data and HERV data with example files, can be found in the both tutorial packages ("herv/tutorial","snp/tutorial").

The results of this benchmark can be seen in figure 5 in the paper "DRUMS: Disk Repository with Update Management, for high throughput sequencing data."

long OUTPUT_AFTER_INSERTED_ELEMENTS = 1e6;
long insertedElements = 0;

DRUMS

HERV

Instantiate a new DRUMS Object with the provided RangeHashFunction and the provided configuration.

DRUMSParameterSet<HERV> globalParameters = new DRUMSParameterSet<HERV>("HERVExample/drums.properties", new HERV());
RangeHashFunction hashFunction = HERV.createHashFunction();
DRUMS<HERV> drums = DRUMSInstantiator.createTable(hashFunction, globalParameters);

After the DRUMS table is instantiated, inserting the data can be started.

Because we can not make the HERV data public yet, other HitFiles must be used to perform the benchmark. E.g., BLAST can be used to map random sequences to the human genome. Then the data can be downloaded and inserted using the provided HitFileParser.

...
long time = System.currentTimeMillis();
for(String file : allFilesToInsert) {
    HitFileParser parser = new HitFileParser(file, 1024 * 1024);
    HERV herv;
    while ((herv = parser.readNext()) != null) {
        drums.insertOrMerge(herv);
        if(++insertedElements % OUTPUT_AFTER_INSERTED_ELEMENTS == 0) {
            System.out.println("Inserted last " + OUTPUT_AFTER_INSERTED_ELEMENTS + " elements in " + (System.currentTimeMillis() - time) + " milli seconds");
            time = System.currentTimeMillis();
        }
    }
}
System.out.println("Inserted last elements in " + (System.currentTimeMillis() - time) + " milli seconds");
...
drums.close();

SNP

The benchmark on DRUMS for the SNP data looks quite similar. But instead of using the HERV class the SNP class must be used. E.g.:

DRUMSParameterSet<SNP> globalParameters = new DRUMSParameterSet<SNP>("SNPExample/drums.properties", new SNP());
...
The SNP data used in the paper "DRUMS: Disk Repository with Update Management, for high throughput sequencing data." can be downloaded from http://1001genomes.org/datacenter/. We also provided a parser for those files.
...
    FilteredVariantParser parser = new FilteredVariantParser(file, ecotypeId);
...

MySQL

Don't forget to configure MySQL properly.

First, the table for HERV data and SNP data must be created. The CREATE TABLE statements can be found in the corresponding resource folders (HERV,SNP).

To fill the MySQL tables from Java, build up a connection to the database with a JDBC driver (com.mysql.jdbc.Driver).

HERV

int batchsize = 10000;
SQL_INSERT = "INSERT INTO herv SET chromosome = ?, startPositionChromosome = ?, endPositionChromosome = ?, startPositionHERV = ?, endPositionHERV = ?, idHERV = ?, strand = ?, eValue = ?";
PreparedStatement statement = connection.prepareStatement(SQL_INSERT);
long time = System.currentTimeMillis();
for(String file : allFilesToInsert) {
    HitFileParser parser = new HitFileParser(file, 1024 * 1024);
    HERV herv;
    while ((herv = parser.readNext()) != null) {
        statement.setInt(1, herv.getChromosome());
        ...
        statement.addBatch();
        if(++ insertedElements % batchsize == 0) {
             statement.executeBatch();
        }

        if(insertedElements % OUTPUT_AFTER_INSERTED_ELEMENTS == 0) {
            System.out.println("Inserted last " + OUTPUT_AFTER_INSERTED_ELEMENTS + " elements in " + (System.currentTimeMillis() - time) + " milli seconds");
            time = System.currentTimeMillis();
        }
    }
}
statement.executeBatch();
System.out.println("Inserted last elements in " + (System.currentTimeMillis() - time) + " milli seconds");

SNP

To do the MySQL benchmark with SNP data, only a few adaptions to the HERV benchmark must be done. This time the SNP class instead of the HERV class must be used. The SQL INSERT statement must be filled accordingly.

...
SQL_INSERT = "INSERT INTO snp SET sequence_id = ?, position = ?, ecotype_id = ? fromBase = ? toBase = ?";
...
    FilteredVariantParser parser = new FilteredVariantParser(file, ecotypeId);
...
Clone this wiki locally