This project provides persistence modules for the invesdwin-context module system.
Releases and snapshots are deployed to this maven repository:
https://invesdwin.de/repo/invesdwin-oss-remote/
Dependency declaration:
<dependency>
<groupId>de.invesdwin</groupId>
<artifactId>invesdwin-context-persistence-jpa-hibernate</artifactId>
<version>1.0.3</version><!---project.version.invesdwin-context-persistence-parent-->
</dependency>
The invesdwin-context-persistence-jpa
module provides a way to code against JPA (Java Persistence API) without having to bind yourself to a specific ORM (Object-Relational-Mapping) framework. In theory you could use the same entities you programmed and switch between Hibernate, OpenJPA, EclipseLink, Datanucleus, and so on. In practice you will find out that not all JPA implementations are equally good. So for now we only have modules available for:
- Hibernate (see
invesdwin-context-persistence-jpa-hibernate
) as the most mature JPA implementation for relational databases in our opinion - EclipseLink (see
invesdwin-context-persistence-jpa-eclipselink
) for testing purposes - Additionally there currently is an experimental module for Datanucleus (see
invesdwin-context-persistence-jpa-datanucleus
) as it provides connectors to NoSQL databases over JPA. It is in experimental stage since it still fails some of our unit tests for JPA compliance or has some other bugs. But it can already be used when you forego advanced JPA features.
For now these modules provide a proof of concept that invesdwin-context-persistence-jpa
stays neutral to any JPA implementation specifically and provides the flexibility to give up on extended JPA features. So the rest of this documentation will tell you what you can do with a comfortable JPA implementation like Hibernate and will give you hints on how you can scale down for different requirements. The main JPA module provides the following tools:
- AEntity: this is the base class for entities that use optimistic locking, timestamps for creation and last update, as well as a simple numerical ID based on a sequence per table (currently only available in the Hibernate module), as this is the best performing strategy for high load systems. There are a few variations available that you can use as alternative base classes for fine tuning:
AEntityWithIdentity
: using the identity column feature of your database if supported as the ID generator strategyAEntityWithTableSequence
: using a table for sequences which should be supported by any databaseAEntityWithSequence
: using the high performance variation of one sequence per table in Hibernate, or the default sequence strategy on other ORMs (this is the default base class forAEntity
)
- AUnversionedEntity: this is the base class that only implements the ID column itself, without adding optimistic locking or timestamps. It is useful when you have a very large table where you want to spare you the hard disk space for uneccessary additional fields on millions of rows. As above you have the option to use other ID generator strategies by extending
AUnversionedEntityWithIdentity
orAUnversionedEntityWithTableSequence
, whileAUnversionedEntityWithSequence
is the default.
- ADao: this is the default DAO implementation. It is always bound to provide queries for one specific entity. It is the default implementation for a numerical ID as used in AEntity and its other variations. The DAO provides support for spring-data-jpa and QueryDSL while extending it with convenience methods for Query by Example. To get the most out of your DAOs, your entities should follow the following suggestions:
- only use basic Java object types as fields in your entities and convert to/from a different desired type in the getters and setters. E.g. use a Date/BigDecimal fields and convert to/from FDate/Decimal (see invesdwin-util) in the getter/setter methods. So you don't have to setup ORM specific type converters (since this is missing in the JPA specification).
- also make sure you do not use primitive data types such as long, double but instead object types like Long, Double for fields, so that the null checks can work properly in validation phase and for Query by Example convenience.
- prefer to use
@Enumerated(EnumType.STRING)
for enum fields so that you can reorder the enum values in your code without needing to migrate the ordinal values in your existing database rows - utilize BeanValidation annotations extensively to only allow clean data into your database
- put the invesdwin
@Indexes
annotation on your entities to let the framework generate them for you (since the JPA specification is missing this and every ORM has its own mechanism for this, we provide yet another mechanism) - if you want to roll your own customized entities, just do that and implement the
IEntity
interface to still be able to use our ADao base classe, which needs to have a way to retrieve the numerical IDs.
- ACustomIdDao: this DAO implementation can be used when you want to use something different as an ID in combination with your custom entity that not necessarily has a numeric ID (otherwise you could use
ADao
withIEntity
). E.g. when you want to use a composite primary key as the ID. - ARepository: for a good design, DAOs should only contain queries for their specific entity. If you want to write queries that span multiple entities, you should rather externalize them into a repository implementation, which can also reference multiple DAOs to get its work done. Or you can write a repository for a special task like complex searching over different entities that requires a whole bunch of code to write the queries. Regarding queries here are a few tips:
- favor QueryDSL over JPQL for complex queries, since that has the advantage of being type safe and thus robust against refactorings in your entities (the
Q...Entity
classes get generated via an annotation processor and will give compile errors for queries which you have to adjust) - use Query by Example for 80% of your simple query needs
- leverage the query cache and JPA Level 2 cache where appropriate instead of setting up your own cache (per default backed by EhCache in the Hibernate module)
- use the
BulkInsertEntities
class to push bulk insert speeds to the maximum. It leverages the LOAD DATA INFILE on a MySQL database or at least does proper batch inserts for other databases - put the spring
@Transactional
annotation around methods that do DML operations on the database, since per default only read only transactional attributes are applied. You should also put@Transactional
annotations around methods in your services/beans that do database operations over multiple repositories and DAOs, so that all operations share the same transaction and get rolled back together. Multiple persistence units are also supported, since theContextDelegatingTransactionManager
takes care of shared transactions for you. Sinceinvesdwin-context
is using AspectJ you can even put@Transactional
inside your@Configurable
non-bean objects that you instantiate yourself.
- favor QueryDSL over JPQL for complex queries, since that has the advantage of being type safe and thus robust against refactorings in your entities (the
- JpaRepository: since spring-data-jpa is used, you can also define query interfaces that extend
JpaRepository
and leverage the magic spring provides. Theinvesdwin-context
application bootstrap will generate the spring-xml configuration for all interfaces in the classpath that extendJpaRepository
(which do not extendIDao
) automatically for you and make them reference the correct persistence unit for transactions depending on the entity class given in theJpaRepository
generic type arguments. It will look up the@PersistenceUnit
in the entity class itself to determine the bean names for the datasource, transaction manager and so on. See the next paragraphs for more details on this automated configuration mechanism.
- persistence.log: per default our transaction manager logs SQL statements into a log file. It adds information about transactions, so you can always troubleshoot your persistence issues properly. If you want to gain some additional performance during production use, just change the log level of
p6spy
toOFF
in your logback configuration or better remove the dependency to p6spy:p6spy. - @PersistenceTest: per default all JUnit tests run in an in-memory H2 database. Add this annotation to your test case to switch to a real server during testing (e.g. when you want to test against real data you seeded during development of a website). The default non in-memory server is supposed to be a local MySQL instance that is setup with the following sql script:
USE mysql;
CREATE USER 'invesdwin'@'localhost' IDENTIFIED BY 'invesdwin';
GRANT ALL ON invesdwin.* TO 'invesdwin'@'localhost';
GRANT SUPER ON *.* TO 'invesdwin'@'localhost';
CREATE DATABASE invesdwin;
- Install DB: to install a MySQL instance on ubuntu, use the following commands:
sudo apt-get install mysql-server mysql-workbench
# increase some limits
sudo sed -i s/max_connections.*/max_connections=1000/ /etc/mysql/my.cnf
echo "innodb_file_format=Barracuda" | sudo tee -a /etc/mysql/my.cnf
sudo /etc/init.d/mysql restart
- Reset DB: to reset the test database schema (deleting all tables), run the following command on the pom.xml of
invesdwin-context-persistence-jpa
:
mvn -Preset-database-schema process-resources
- Change DB Config: to use a different database as a test server, just put a file called
/META-INF/env/${USERNAME}.properties
into your classpath. This file allows you to override the modules default configuration depending on your local development environment (for more information and on how to define distribution specific settings, see the invesdwin-context documentation. Just put the following properties with changed values there (and make sure the connection driver is added as a maven dependency and is available in the classpath, you can also deploy an embedded database like this):
de.invesdwin.context.persistence.jpa.PersistenceUnitContext.CONNECTION_DRIVER@default_pu=com.mysql.jdbc.Driver
de.invesdwin.context.persistence.jpa.PersistenceUnitContext.CONNECTION_URL@default_pu=jdbc:mysql://localhost:3306/invesdwin
de.invesdwin.context.persistence.jpa.PersistenceUnitContext.CONNECTION_USER@default_pu=invesdwin
de.invesdwin.context.persistence.jpa.PersistenceUnitContext.CONNECTION_PASSWORD@default_pu=invesdwin
de.invesdwin.context.persistence.jpa.PersistenceUnitContext.CONNECTION_DIALECT@default_pu=MYSQL
- Multiple Persistence Units: notice the
@default_pu
string in the properties? This defines the persistence unit this configuration is for. Duplicating these values and changing the name of the persistence unit allows you to setup multiple databases to be used in one application. Depending on the dialect and which persistence provider modules you have in your classpath (e.g. Hibernate, Datanucleus) you can in theory even mix Hibernate for relational databases and Datanucleus for HBase as multiple ORMs in the same application. Though we still have to define a few moreinvesdwin-persistence-jpa-*
modules and create an actual proof of concept for this feature, but it is an interesting idea regarding polyglot persistence. For now this just allows you to e.g. use multiple databases with the Hibernate module.- you can annotate your entity class with
@PersistenceUnit("another_pu")
to tell the invesdwin-context bootstrap to associate this entity with the specific persistence unit configuration - the
ACustomIdDao
/ADao
class will lookup its entity class to determine the persistence unit (seePersistenceProperties
class) to be used for its entity manager and transactions. SinceARepository
does not have an associated entity, you have to provide the proper persistence unit name by overriding thegetPersistenceUnitName()
method. - the
IPersistenceUnitAware
interface defines this method andARepository
implements this interface. You can actually also implement this interface inside any of your spring Beans/Services to add this information. It tells theContextDelegatingTransactionManager
to use this specific persistence unit for the transactions defined inside this class - you can have shared transactions spanned over multiple persistence units simply by nesting those transactions in one another. If the nested transaction gets rolled back, the outer transaction will be rolled back as well. Still you have to be careful about nested transactions being committed before the outer transaction gets committed, so choose the nesting levels wisely and test your rollback scenarios sufficiently. JTA with a two-phase-commit would have been an interesting alternative here, but setting this up with support for NoSQL databases and multiple ORMs is a nightmare sadly.
- additionally you can override/specify the persistence unit to use directly with a
@Transactional(value="another_tm")
annotation. This defines a specific transaction manager bean to be used for this transaction definition. It leverages the naming convention of all persistence unit related beans following a specific naming pattern:another_pu
defines the persistence unit nameanother
(default_pu
being thedefault
persistence unit) and is required to always carry the suffix_pu
so it can be identified as suchanother_tm
defines the transaction manager bean and can be used in@Transactional
annotations as above or to lookup the specificPlatformTransactionManager
bean to do manual transaction handling by calling itsgetTransaction
/commit
/rollback
methodsanother_emf
defines theEntityManagerFactory
bean with which you can retrieveEntityManager
instances. You can make them threadsafe with springsSharedEntityManagerCreator
or directly usePersistenceProperties.getPersistenceUnitContext("another_pu").getEntityManager()
to get a thread safe instance.another_ds
defines theDataSource
which normally is a HikariCP connection pool instance that can be used to gain direct JDBC access to the database. If you don't need the performance or just want to debug some database deadlocks, you can add a dependency to com.mchange:c3p0 and it will be used instead. c3p0 provides a statement cache (not all databases can be configured for this with HikariCP) and it provides helpful error messages for problems like database deadlocks.
- you can annotate your entity class with
- Data Management in Tests: you can extend
APersistenceTestPreparer
to build some fixture generators that you can use inside your setup of unit tests to get some test data. By defining a separate class for the preparer, you can even run it as a unit test itself to populate a test database (configured by annotating the test class with@PersistenceTest(PersistenceTestContext.SERVER)
). You can also runnew PersistenceTestHelper().clearAllTables()
to reset the database inside your unit test as often as you want (e.g. in your setUp() or setUpOnce() method). It goes over all spring beans that implementIClearAllTablesAware
and deletes all tables rows with this since the ADao default implementations do that for each entity (as long as each enity has an ADao implementation, in the other cases you should provide your ownIClearAllTablesAware
bean). Be cautious here with circular dependencies between entities using foreign keys. You might have to customizedeleteAll()
method of yourADao
/IClearAllTablesAware
bean to resolve that before deletion. Check thepersistence.log
if your unit tests seem to hang, it might be filled with warnings regarding circular foreign keys or the test might abort with aStackOverflowError
because of this. - Schema Generation: per default during testing the schema generation feature of the ORMs is enabled, while in production the schema only gets validated, you can change this behavior with the
de.invesdwin.context.persistence.jpa.PersistenceProperties.DEFAULT_CONNECTION_AUTO_SCHEMA=(VALIDATE|UPDATE|CREATE|CREATE_DROP)
property globally or on a persistence unit basis viade.invesdwin.context.persistence.jpa.PersistenceUnitContext.CONNECTION_AUTO_SCHEMA@default_pu=(VALIDATE|UPDATE|CREATE|CREATE_DROP)
We had the need to do some advanced database customization for a very large database table which might be interesting as an example here:
- You can let your DAO implement
IStartupHook
to run some native statements for customization of the automatically generated tables. E.g. the following startup hook changes a MariaDB table to use the TokuDB storage engine and define partitioning for a very large database table:
@Named
@ThreadSafe
public class ConfigurableTradingDayRatingDao extends
ACustomIdDao<ConfigurableTradingDayRatingEntity, ConfigurableTradingDayRatingEntityId> implements IStartupHook {
...
@Transactional
@Override
public void startup() throws Exception {
if (PersistenceProperties.getPersistenceUnitContext(getPersistenceUnitName())
.getConnectionDialect() == ConnectionDialect.MYSQL && !ContextProperties.IS_TEST_ENVIRONMENT
&& isEmpty()) {
getEntityManager()
.createNativeQuery("alter table " + ConfigurableTradingDayRatingEntity.class.getSimpleName()
+ " ENGINE=TokuDB, COMPRESSION=TOKUDB_LZMA")
.executeUpdate();
getEntityManager()
.createNativeQuery("alter table " + ConfigurableTradingDayRatingEntity.class.getSimpleName()
+ " partition by hash(" + ConfigurableTradingDayRatingEntity_.ratingConfig_id.getName()
+ ") partitions " + RatingUpdaterProperties.generateExpectedRatingConfigs().size())
.executeUpdate();
}
}
...
}
- Also note that we used a custom Id type for this specifically large table to even spare us the additional long Id column that does not get us anywhere with TokuDB, since it does not support foreign keys anyway. See the following code as an advanced example on how to setup indexes and custom entities:
@Entity
@NotThreadSafe
// unique constraint is modified in later on startup as primary key
@IdClass(ConfigurableTradingDayRatingEntityId.class)
@Table(uniqueConstraints = @UniqueConstraint(columnNames = { "company_id", "ratingConfig_id", "date" }))
// we also need some additional indexes to improve query speeds
@Indexes({ @Index(columnNames = { "date", "ratingConfig_id", "companyCurrentlyStable" }),
@Index(columnNames = { "company_id", "ratingConfig_id", "date" }),
@Index(columnNames = { "date", "ratingConfig_id", "companyCurrentlyStableOptimistic" }),
@Index(columnNames = { "date", "ratingConfig_id", "companyCurrentlyUndervalued" }) })
public class ConfigurableTradingDayRatingEntity extends AValueObject {
public static class ConfigurableTradingDayRatingEntityId extends AValueObject {
private Date date;
private Long company_id;
private Long ratingConfig_id;
//demonstrating type conversion here from Date to FDate
public FDate getDate() {
return FDate.valueOf(date);
}
public void setDate(final FDate date) {
this.date = FDate.toDate(date);
}
public Long getCompany_id() {
return company_id;
}
public void setCompany_id(final Long company_id) {
this.company_id = company_id;
}
public Long getRatingConfig_id() {
return ratingConfig_id;
}
public void setRatingConfig_id(final Long ratingConfig_id) {
this.ratingConfig_id = ratingConfig_id;
}
}
//workaround for missing foreign keys, the transient instance gets filled by the DAO implementation
@Transient
private ConfigurableReportRatingEntity configurableReportRating;
@Column(nullable = false)
private Long configurableReportRating_id;
//denormalized ID fields
@Id
@Column(nullable = false)
@Temporal(TemporalType.DATE)
private Date date;
@Id
@Column(nullable = false)
private Long company_id;
@Id
@Column(nullable = false)
private Long ratingConfig_id;
...
}
There are a few NoSQL integration available. Most notably:
The invesdwin-context-persistence-ezdb
module provides support for the popular LevelDB key-value storage. The actual LevelDB API is very low level, so we use EZDB for a bit more comfort. Our module provides the following accessories:
-
ADelegateRangeTable: this is a wrapper around the RangeTable from EZDB and provides the following benefits:
- read-write-locking for database synchronization to support multi-threaded access
shouldPurgeTable()
callback to reset the database (e.g. daily or on some condition)- improved performance via controlling usage of basic features (you can override the callback methods manually to enable this feature if you can live with the performance penalty, but it should be an explicit decision so they are disabled per default):
- Catching
NoSuchElementException
instead of checkingiterator.hasNext()
improves iteration speed with LevelDB over large datasets. SoallowHasNext()
will throw an exception (if not overridden) when this is forgotten by the developer. - Using the
newBatch().put
/delete
/flush()
mechanism instead of direct put/delete calls improves speed of data manipulation tasks on large data sets. Though no exception will be thrown for this misusage.
- Catching
getOrLoad(...)
methods allow you to lazily load missing data via a callback function. E.g. when LevelDB is used as a cache for web requests that download financial data. If it is end-of-day data you might want to refresh them daily thus you could extendADailyExpiringDelegateRangeTable
which implementsshouldPurgeTable()
to reset the database daily. The next call togetOrLoad(...)
would thus download up-to-date financial data daily using your callback function.- the table does some sanity checks during initialization and purges the table if that failed so you can reload the data from a web service. E.g. when you changed the serialization structure of the data and the client caches need to be refreshed, this automates the process (beware that this is not storage meant for permanent data, it is rather a solution for caching here, or else this feature would need to be disabled for specific use cases)
TypeDelegateSerde
as the default serializer/deserializer with support for most common types with a fallback to FST for complex types. To get faster serialization/deserialization you should consider providing your ownSerde
implementation via thenewHashKeySerde
/newRangeKeySerde
/newValueSerde()
callbacks. UtilizingByteBuffer
for custom serialization/deserialization is in most cases the fastest and most compact way, but requires a few lines of manual coding.
-
APersistentMap/APersistentNavigableMap: if you need storage of large elements, then LevelDB might not bring the best performance because it will reorganize itself often during insertions which makes it very slow. If you need the ordered values from LevelDB, then you can use it as an index and store your large entries in a ChronicleMap or MapDB with this. You should consider to compress large values with
FastLZ4CompressionFactory.INSTANCE.maybeWrap(serde)
to reduce the file size at a little cpu cost. Though this should be cancelled out by improved IO speed which is more often the bottleneck. This database performs better than LevelDB for databases with large entries, since no reordering on disk is performed. ALargePersistentMap automates this separate index approach and goes a bit further. You can use LevelDB/ChronicleMap/MapDB as an index and store the large values in a custom storage suitable for very large values. You can decide between storing each value in a separate file withFileChunkStorage
or storing all values in one large random access fileMappedChunkStorage
(default). Here some issues that you might encounter with different approaches for large value storage:- ChronicleMap might cause segment overflow exceptions when storing very large values or values of significantly different lengths.
- MapDB might corrupt very large values (can be detected via checksum errors of LZ4).
- LevelDB gets very slow due to significant write amplification caused by reordering.
- => Thus a simpler and thus more robust storage is preferable in such cases.
This is a custom designed NoSQL database that is optimized for backtesting and live running trading strategies. It supports efficient storage and retrieval of tick data while providing identical performance on granular bar intervals.
- ATimeSeriesDB: this is a binary data system that stores continuous time series data. It uses LevelDB as an index for the file chunks which it saves separately into random access files per 10,000 entries which are compressed via LZ4 (which is the fastest and most compact compression algorithm we could determine for our use cases). By chunking the data and looking up the files to process via LevelDB date range lookup, we can provide even faster iteration over large financialdata tick series than would be possible when storing everything LevelDB. Random access is possible, but you should rather let the in-memory
AGapHistoricalCache
from invesdwin-util handle that and only usegetLatestValue()
here to hook up the callbacks of that cache. This is because always going to the file storage can be really slow, thus an in-memory cache should be put before it. The raw insert and iteration speed for financial tick data was measured to be about 13 times faster withATimeSeriesDB
in comparison to directly using LevelDB (both with performance tuned serializer/deserializer implementations). In 2016 we were able to process more than 2,000,000 ticks per second with this setup instead of just around 150,000 ticks per second with only LevelDB in place. Today there is also a FileBufferCache that replicates the OS file cache for uncompressed and deserialized data. This makes reverse iteration faster and reduces file access significantly in parallel backtesting scenarios. A more synthetic performance test with a single thread and simpler data structures (which are included in the modules test cases) resulted in the below numbers. Though first some terminology:- Heap: Plain Old Java Objects (POJOs) on JVM managed Heap memory with an impact on GC. No serialization overhead. No Persistence.
- Memory: Serialized bytes on OffHeap memory outside of the JVM without GC impact. Includes serialization overhead. No Persistence.
- Disk: Serialized bytes on Disk/SSD with Persistence enabled. Includes serialization overhead.
- Fast: Using LZ4 Fast Compression in ATimeSeriesDB.
- High: Using LZ4 High Compression in ATimeSeriesDB.
- None: Using Disabled Compression in ATimeSeriesDB.
- Cached: Using a warmed up Heap memory cache in front of the Disk storage. This uses ATimeSeriedDB's FileBufferCache, not AGapHistoricalCache which is indifferent to the storage and thus not used in the benchmarks.
Old Benchmarks (2016, Core i7-4790K with SSD, Java 8):
LevelDB-JNI (Disk) 1,000,000 Writes: 100.68/ms in 9,932 ms
LevelDB-JNI (Disk) 10,000,000 Reads: 373.15/ms in 26,799 ms
ATimeSeriesDB (Disk, High) 1,000,000 Writes: 3,344.48/ms in 299 ms => ~33 times faster (with High Compression)
ATimeSeriesDB (Disk, High) 10,000,000 Reads: 14,204.55/ms in 704 ms => ~38 times faster (with High Compression)
New Benchmarks (2021, Core i9-9900k with SSD, Java 16):
CDB (Disk) Writes (WriteAll): 39.37/ms => ~83% slower
ezdb-RocksDB-JNI (Disk) Writes (PutBatch): 44.86/ms => ~80% slower
CQEngine (Disk) Writes (PutBatch): 48.79/ms => ~79% slower (uses SQLite internally; expensive I/O and huge size for large databases)
DuckDB (Disk) Writes (PutBatch): 54.62/ms => ~76% slower
Derby (Disk) Writes (PutBatch): 64.98/ms => ~72% slower
ezdb-LevelDB-JNI (Disk) Writes (PutBatch): 63.33/ms => ~72% slower
CQEngine (Memory) Writes (PutBatch): 65.97/ms => ~71% slower
ezdb-LMDB-JNR (Disk) Writes (PutBatch): 152.05/ms => ~33% slower
Indeed-RecordLogDirectory (Disk) Writes (Append): 199.19/ms => ~13% slower
MapDB (Disk) Writes (Put): 214.20/ms => ~6% slower
TreeMapDB (Disk) Writes (Put): 223.12/ms => ~2% slower
ezdb-LevelDB-Java (Disk) Writes (PutBatch): 228.07/ms => using this as baseline
InfluxDB-1.x (Disk) Writes (PutBatch): 252.08/ms => ~10% faster
H2 (Disk) Writes (PutBatch): 291.49/ms => ~28% faster
Hsqldb (Disk) Writes (PutBatch): 330.23/ms => ~45% faster
ChronicleQueue (Disk) Writes (Append): 442.48/ms => ~90% faster
ezdb-LsmTree (Disk) Writes (Put): 457.96/ms => ~94% faster
Indeed-MphTable (Disk) Writes (WriteAll): 491.50/ms => ~2.15 times as fast (immutable after creation)
Krati (Disk) Writes (Put): 570.13/ms => ~2.5 times as fast
SQLite (Disk) Writes (PutBatch): 631.39/ms => ~2.77 times as fast
tkrzw-HashDBM (Disk) Writes (Put): 792.46/ms => ~3.5 times as fast (speed degrades with larger tables)
tokyocabinet-HDB (Disk) Writes (Put): 805.80/ms => ~3.5 times as fast (speed degrades with larger tables)
tkrzw-SkipDBM (Disk) Writes (Put): 872.30/ms => ~3.8 times as fast (speed degrades with larger tables)
tokyocabinet-BDB (Disk) Writes (Put): 963.02/ms => ~4.2 times as fast
ezdb-BTreeMap (Heap) Writes (Put): 1,048.22/ms => ~4.6 times as fast
kyotocabinet-HashDB (Disk) Writes (Put): 1,064.96/ms => ~4.7 times as fast (speed degrades with larger tables)
CQEngine (Heap) Writes (PutBatch): 1,083.07/ms => ~4.7 times as fast
ChronicleMap (Disk) Writes (Put): 1,233.05/ms => ~5.4 times as fast
tkrzw-TreeDBM (Disk) Writes (Put): 1,270.97/ms => ~5.6 times as fast
kyotocabinet-TreeDB (Disk) Writes (Put): 1,277.14/ms => ~5.6 times as fast
ChronicleMap (Memory) Writes (Put): 1,475.36/ms => ~6.5 times as fast
ezdb-TreeMap (Heap) Writes (Put): 1,902.77/ms => ~8.3 times as fast
ezdb-ConcurrentSkipListMap (Heap) Writes (Put): 2,358.21/ms => ~10.3 times as fast
Indeed-BasicRecordFile (Disk) Writes (Append): 2,403.85/ms => ~10.5 times as fast
FastUtilHashMap (Heap) Writes (Put): 2,796.50/ms => ~12.3 times as fast
AgronaHashMap (Heap) Writes (Put): 3,088.23/ms => ~13.5 times as fast
CountDB (Disk) Writes (Put): 4,081.63/ms => ~17.9 times as fast
Indeed-BlockCompRecordFile (Disk) Writes (Append): 4,273.32/ms => ~18.7 times as fast
ConcurrentHashMap (Heap) Writes (Put): 10,260.62/ms => ~45 times as fast
Caffeine (Heap) Writes (Put): 10,378.83/ms => ~45.5 times as fast
HashMap (Heap) Writes (Put): 14,695.08/ms => ~64.4 times as fast
ATimeSeriesDB (Disk, High) Writes (Append): 31,063.62/ms => ~136.2 times as fast (High Compression)
QuestDB (Disk) Writes (Append): 36,191.23/ms => ~158.7 times as fast (tested on Java 11; 4x Size of ATimeSeriesDB with Compression)
ATimeSeriesDB (Disk, Fast) Writes (Append): 52,721.10/ms => ~231.2 times as fast (Fast Compression)
ATimeSeriesDB (Disk, None) Writes (Append): 56,980.06/ms => ~249.8 times as fast (Disabled Compression; 2x Size of ATimeSeriesDB with Compression)
CDB (Disk) Reads (Get): 1.94/ms => ~99% slower
DuckDB (Disk) Reads (Get): 5.44/ms => ~98% slower
QuestDB (Disk) Reads (Get): 49.64/ms => ~80% slower
ezdb-RocksDB-JNI (Disk) Reads (Get): 58.71/ms => ~76% slower
ezdb-LevelDB-JNI (Disk) Reads (Get): 81.00/ms => ~67% slower
Derby (Disk) Reads (Get): 112.02/ms => ~54% slower
SQLite (Disk) Reads (Get): 173.99/ms => ~29% slower
ezdb-LMDB-JNR (Disk) Reads (Get): 186.07/ms => ~24% slower
ezdb-LevelDB-Java (Disk) Reads (Get): 244.29/ms => using this as baseline
TreeMapDB (Disk) Reads (Get): 627.79/ms => ~2.6 times as fast
tkrzw-SkipDBM (Disk) Reads (Get): 665.73/ms => ~2.7 times as fast
H2 (Disk) Reads (Get): 802.57/ms => ~3.3 times as fast
tokyocabinet-BDB (Disk) Reads (Get): 1,151.94/ms => ~4.7 times as fast
MapDB (Disk) Reads (Get): 1,220.11/ms => ~5 times as fast
tokyocabinet-HDB (Disk) Reads (Get): 1,363.88/ms => ~5.6 times as fast
Hsqldb (Disk) Reads (Get): 1,570.11/ms => ~6.4 times as fast
kyotocabinet-TreeDB (Disk) Reads (Get): 1,592.36/ms => ~6.5 times as fast
kyotocabinet-HashDB (Disk) Reads (Get): 1,788.59/ms => ~7.3 times as fast
ezdb-LsmTree (Disk) Reads (Get): 1,846.59/ms => ~7.6 times as fast
tkrzw-TreeDBM (Disk) Reads (Get): 2,047.50/ms => ~8.4 times as fast
tkrzw-HashDBM (Disk) Reads (Get): 2,528.45/ms => ~10.3 times as fast
ezdb-ConcurrentSkipListMap (Heap) Reads (Get): 2,695.42/ms => ~11 times as fast
Indeed-MphTable (Disk) Reads (Get): 2,773.93/ms => ~11.35 times as fast
ChronicleMap (Disk) Reads (Get): 2,836.80/ms => ~11.6 times as fast
ezdb-BTreeMap (Heap) Reads (Get): 2,908.67/ms => ~11.9 times as fast
ChronicleMap (Memory) Reads (Get): 3,037.94/ms => ~12.4 times as fast
CountDB (Disk) Reads (Get): 3,495.89/ms => ~14.3 times as fast
Krati (Disk) Reads (Get): 4,333.13/ms => ~17.7 times as fast
ezdb-TreeMap (Heap) Reads (Get): 4,875.67/ms => ~20 times as fast
CQEngine (Disk) Reads (Get): 5,656.11/ms => ~23.15 times as fast
CQEngine (Heap) Reads (Get): 5,732.63/ms => ~23.5 times as fast
CQEngine (Memory) Reads (Get): 5,813.95/ms => ~23.8 times as fast
tkrzw-SkipDBM (Disk) Reads (Get): 6,035.00/ms => ~24.7 times as fast
AgronaHashMap (Heap) Reads (Get): 7,019.02/ms => ~28.7 times as fast
FastUtilHashMap (Heap) Reads (Get): 7,517.95/ms => ~30.8 times as fast
HashMap (Heap) Reads (Get): 40,929.29/ms => ~167.5 times as fast
ConcurrentHashMap (Heap) Reads (Get): 57,822.57/ms => ~236.7 times as fast
Caffeine (Heap) Reads (Get): 58,897.77/ms => ~241.1 times as fast
DuckDB (Disk) Reads (GetLatest): 3.30/ms => ~98.0% slower (using "SELECT max(key)"; "ORDER BY key DESC" is half as fast)
CQEngine (Disk) Reads (GetLatest): 5.59/ms => ~96.7% slower (using "ORDER BY key DESC")
QuestDB (Disk) Reads (GetLatest): 23.63/ms => ~86% slower (using "SELECT max(key)"; "ORDER BY key DESC" results in ~91/s which is 99.95% slower)
ezdb-RocksDB-JNI (Disk) Reads (GetLatest): 56.12/ms => ~66.5% slower
ezdb-LevelDB-JNI (Disk) Reads (GetLatest): 72.34/ms => ~57% slower
Derby (Disk) Reads (GetLatest): 105.08/ms => ~37% slower (using "ORDER BY key DESC")
ezdb-LMDB-JNR (Disk) Reads (GetLatest): 150.01/ms => ~10% slower
SQLite (Disk) Reads (GetLatest): 165.51/ms => ~1% slower
ezdb-LevelDB-Java (Disk) Reads (GetLatest): 167.48/ms => using this as baseline
CQEngine (Memory) Reads (GetLatest): 300.70/ms => ~80% faster (using "ORDER BY key DESC")
CQEngine (Heap) Reads (GetLatest): 314.20/ms => ~90% faster (using "ORDER BY key DESC")
H2 (Disk) Reads (GetLatest): 578.94/ms => ~3.4 times as fast (using "ORDER BY key DESC")
TreeMapDB (Heap) Reads (GetLatest): 586.34/ms => ~3.5 times as fast
ezdb-ConcurrentSkipListMap (Heap) Reads (GetLatest): 985.03/ms => ~5.9 times as fast
ezdb-LsmTree (Disk) Reads (GetLatest): 1,202.41/ms => ~7.2 times as fast
Hsqldb (Disk) Reads (GetLatest): 1,649.62/ms => ~9.8 times as fast (using "ORDER BY key DESC")
ATimeSeriesDB (Disk) Reads (GetLatest): 2,005.66/ms => ~12 times as fast (after initialization, uses ChronicleMap as lazy index)
ezdb-BTreeMap (Heap) Reads (GetLatest): 2,783.96/ms => ~16.6 times as fast
ezdb-TreeMap (Heap) Reads (GetLatest): 3,235.20/ms => ~19.3 times as fast
CQEngine (Disk) Reads (Iterator): 224.01/ms => ~89.5% slower (using "Query All")
CQEngine (Memory) Reads (Iterator): 230.05/ms => ~89.2% slower (using "Query All")
ezdb-LevelDB-JNI (Disk) Reads (Iterator): 327.20/ms => ~84.6% slower
MapDB (Disk) Reads (Iterator): 351.85/ms => ~83.4% slower (unordered)
InfluxDB-1.x (Disk) Reads (Iterator): 649.31/ms => ~69.4% slower
ezdb-RocksDB-JNI (Disk) Reads (Iterator): 672.35/ms => ~68.7% slower
tokyocabinet-BDB (Disk) Reads (Iterator): 675.90/ms => ~68.1% slower
tkrzw-HashDBM (Disk) Reads (Iterator): 801.15/ms => ~62.3% slower (unordered)
ezdb-LsmTree (Disk) Reads (Iterator): 879.74/ms => ~58.6% slower
tokyocabinet-HDB (Disk) Reads (Iterator): 1,020.93/ms => ~51.9% slower
Derby (Disk) Reads (Iterator): 1,146.77/ms => ~46.1% slower
kyotocabinet-TreeDB (Disk) Reads (Iterator): 1,385.23/ms => ~34.8% slower
tkrzw-TreeDBM (Disk) Reads (Iterator): 1,697.50/ms => ~20.1% slower
kyotocabinet-HashDB (Disk) Reads (Iterator): 1,890.36/ms => ~11% slower (unordered)
ezdb-LevelDB-Java (Disk) Reads (Iterator): 2,125.29/ms => using this as baseline
H2 (Disk) Reads (Iterator): 2,365.26/ms => ~11% faster
CQEngine (Heap) Reads (Iterator): 3,345.38/ms => ~60% faster (using "Query All")
Indeed-BasicRecordFile (Disk) Reads (Iterator): 4,735.45/ms => ~2.2 times as fast
SQLite (Disk) Reads (Iterator): 4,805.83/ms => ~2.3 times as fast
ChronicleMap (Disk) Reads (Iterator): 5,816.66/ms => ~2.7 times as fast (unordered)
ChronicleMap (Memory) Reads (Iterator): 6,766.36/ms => ~3.2 times as fast (unordered)
Hsqldb (Disk) Reads (Iterator): 7,038.44/ms => ~3.3 times as fast
Indeed-RecordLogDirectory (Disk) Reads (Iterator): 8,701.71/ms => ~4.1 times as fast
CDB (Disk) Reads (Iterator): 9,074.41/ms => ~4.3 times as fast (unordered)
Indeed-BlockCompRecordFile (Disk) Reads (Iterator): 9,285.14/ms => ~4.37 times as fast
ezdb-LMDB-JNR (Disk) Reads (Iterator): 9,330.80/ms => ~4.4 times as fast
Krati (Disk) Reads (Iterator): 14,394.70/ms => ~6.8 times as fast (unordered)
ChronicleQueue (Disk) Reads (Iterator): 16,583.75/ms => ~7.8 times as fast
CountDB (Disk) Reads (Iterator): 17,841.21/ms => ~8.4 times as fast (unordered)
TreeMapDB (Disk) Reads (Iterator): 21,551.72/ms => ~10.1 times as fast
DuckDB (Disk) Reads (Iterator): 25,536.26/ms => ~12 times as fast
ezdb-TreeMap (Heap) Reads (Iterator): 27,463.10/ms => ~12.9 times as fast
ezdb-BTreeMap (Heap) Reads (Iterator): 30,313.13/ms => ~14.3 times as fast
ezdb-ConcurrentSkipListMap (Heap) Reads (Iterator): 32,629.62/ms => ~15.35 times as fast
ATimeSeriesDB (Disk, Fast) Reads (Iterator): 34,292.38/ms => ~16.1 times as fast (no caching)
Indeed-MphTable (Disk) Reads (Iterator): 34,411.56/ms => ~16.2 times as fast (unordered)
ATimeSeriesDB (Disk, High) Reads (Iterator): 34,494.65/ms => ~16.2 times as fast (no caching)
ATimeSeriesDB (Disk, None) Reads (Iterator): 38,303.90/ms => ~18 times as fast (no caching)
AgronaHashMap (Heap) Reads (Iterator): 38,481.98/ms => ~18.1 times as fast (unordered)
Caffeine (Heap) Reads (Iterator): 73,735.96/ms => ~34.7 times as fast (unordered)
FastUtilHashMap (Heap) Reads (Iterator): 90,854.03/ms => ~42.7 times as fast (unordered)
ATimeSeriesDB (Disk, Cached) Reads (Iterator): 98,872.85/ms => ~46.5 times as fast (internal FileBufferCache keeps hot segments on Heap)
HashMap (Heap) Reads (Iterator): 119,310.39/ms => ~56.1 times as fast (unordered)
QuestDB (Disk) Reads (Iterator): 125,410.57/ms => ~59 times as fast (flyweight pattern)
ConcurrentHashMap (Heap) Reads (Iterator): 128,915.82/ms => ~60.7 times as fast (unordered)
- ATimeSeriesUpdater: this is a helper class with which one can handle large inserts/updates into an instance of
ATimeSeriesDB
. This handles the creation of separate chunk files and writing them to disk in the most efficient way. - SerializingCollection: this collection implementation is used to store and retrieve each file chunk. It supports two modes of serialization. The default and slower one supports variable length objects by prepending the size of the serialized bytes. The second and faster approach can be enabled by overriding
getFixedLength()
which allows the collection to skip reading the size and instead just count the bytes to separate each element. Though as this suggests, it only works with fixed length serialization/deserialization which you can provide by overriding thenewSerde()
callback method (which use FST per default). You can also deviate from the default LZ4 high compression algorithm by overriding thenewCompressor
/newDecompressor
callback methods. Despite efficiently storing financial data, this collection can be used to move any kind of data out of memory into a file to preserve precious memory instead of wasting it on metadata that is only rarely used (e.g. during a backtests we can record all sorts of information in a serialized fashion and load it back from file when generating our reports once. This allows us to run more backtests in parallel which would otherwise be limited by tight memory). - ASegmentedTimeSeriesDB: if it is undesirable to always have the whole time series updated inside the
ATimeSeriesDB
, use this class to split it into segments. You provide an algorithm with which the segment time ranges can be calculated (e.g. monthly viaPeriodicalSegmentFinder.newCache(new Duration(1, FTimeUnit.MONTHS))
) and define the limits of your time series (e.g. from 2001-01-01 to 2018-01-01). The database will request the data for the individual segments when they are first needed. Thus theATimeSeriesUpdater
is handled by the database itself. This is helpful when you only need a few parts of the series and require those parts relatively fast without the overhead of calculating segments before or after. For example when calculating bars from ticks for a chart that shows only a small time range. Be aware that segments are immutable once they are created. If you need support for incomplete segments, look atALiveSegmentedTimeSeriesDB
next. - ALiveSegmentedTimeSeriesDB: this is the same as the
ASegmentedTimeSeriesDB
with the addition that it supports collecting elements into an incomplete segment before persisting it into an immutable state. The so called live segment is always the latest available segment (e.g. with above monthly segments from 2018-01-01 to 2018-02-01) after the defined limits for the historical segments (which was from 2001-01-01 to 2018-01-01). The historical limits define where the segments are allowed to be initialized via pulling the data for the segments as normally done byASegmentedTimeSeriesDB
. The live segment is filled initially and updated via pushing values by callingputNextLiveValue(...)
in the proper order of arrival. As soon as a value arrives that marks another segment beyond the current live one, the live segment gets persisted as a historical segment and a new live segment gets collected. You just have to make sure that you add the live values without any missing gaps and properly calculate the historical limits with the arrival of new segments. The rest is handled by the database. This is helpful when you still want to make the latest calculated bars available inside a chart without the segment being complete in real time. Once the live values are added, the chart can request further data. Though be aware that an element that was added to the live segment can not be replaced later. Thus an in-progress bar that is updated with each tick needs to be handled from the outside. Only complete bars are allowed to be added to the live segment.
In order to share data between processes and properly coordinate handling of updates have a look at Synchronous Channels. They can also provide a way to create manageable and composable data pipelines between threads, processes and networks. The idea is to build high level services around this storage instead of exposing the data storage itself as a service. This way performance can be maximized based on business requirements instead of protocol requirements.
If you need further assistance or have some ideas for improvements and don't want to create an issue here on github, feel free to start a discussion in our invesdwin-platform mailing list.