Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Point-in-time Data Operation #343

Merged
merged 34 commits into from
Mar 10, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
3e92169
add period ops class
bxdd Mar 9, 2021
fead243
black format
bxdd Mar 9, 2021
61720c2
add pit data read
bxdd Mar 10, 2021
a0959a9
fix bug in period ops
bxdd Mar 10, 2021
bd46d14
update ops runnable
bxdd Mar 10, 2021
9f1cc64
update PIT test example
bxdd Mar 10, 2021
63e4895
black format
bxdd Mar 10, 2021
88b7926
update PIT test
bxdd Mar 10, 2021
a2dae5c
update tets_PIT
bxdd Mar 12, 2021
99db80d
update code format
bxdd Mar 12, 2021
c4bbe6b
add check_feature_exist
bxdd Mar 12, 2021
20bcf25
black format
bxdd Mar 12, 2021
6e23ff7
optimize the PIT Algorithm
bxdd Mar 12, 2021
88a0d3d
fix bug
bxdd Mar 12, 2021
f52462a
update example
bxdd Mar 12, 2021
b794e65
update test_PIT name
bxdd Mar 17, 2021
255ed0b
Merge https://github.com/microsoft/qlib
bxdd Apr 7, 2021
9df1fbd
add pit collector
bxdd Apr 7, 2021
71d5640
black format
bxdd Apr 7, 2021
ebe277b
fix bugs
bxdd Apr 8, 2021
655ff51
fix try
bxdd Apr 8, 2021
f6ca4d2
fix bug & add dump_pit.py
bxdd Apr 9, 2021
566a8f9
Successfully run and understand PIT
you-n-g Mar 4, 2022
63b5ed4
Merge remote-tracking branch 'origin/main' into PIT
you-n-g Mar 4, 2022
4997389
Add some docs and remove a bug
you-n-g Mar 4, 2022
561be64
Merge remote-tracking branch 'origin/main' into PIT
you-n-g Mar 8, 2022
6811a07
Merge remote-tracking branch 'origin/main' into PIT
you-n-g Mar 8, 2022
cf77cd0
mv crypto collector
you-n-g Mar 8, 2022
79422a1
black format
you-n-g Mar 8, 2022
48ea2c5
Run succesfully after merging master
you-n-g Mar 8, 2022
9c67303
Pass test and fix code
you-n-g Mar 10, 2022
69cf2ab
remove useless PIT code
you-n-g Mar 10, 2022
de8d6cb
fix PYlint
you-n-g Mar 10, 2022
2671dc2
Rename
you-n-g Mar 10, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@
Recent released features
| Feature | Status |
| -- | ------ |
| Point-in-Time database | :hammer: [Rleased](https://github.com/microsoft/qlib/pull/343) on Mar 10, 2022 |
| Arctic Provider Backend & Orderbook data example | :hammer: [Rleased](https://github.com/microsoft/qlib/pull/744) on Jan 17, 2022 |
| Arctic Provider Backend & Orderbook data example | :hammer: [Rleased](https://github.com/microsoft/qlib/pull/744) on Jan 17, 2022 |
| Meta-Learning-based framework & DDG-DA | :chart_with_upwards_trend: :hammer: [Released](https://github.com/microsoft/qlib/pull/743) on Jan 10, 2022 |
| Planning-based portfolio optimization | :hammer: [Released](https://github.com/microsoft/qlib/pull/754) on Dec 28, 2021 |
Expand Down Expand Up @@ -95,9 +97,8 @@ For more details, please refer to our paper ["Qlib: An AI-oriented Quantitative
# Plans
New features under development(order by estimated release time).
Your feedbacks about the features are very important.
| Feature | Status |
| -- | ------ |
| Point-in-Time database | Under review: https://github.com/microsoft/qlib/pull/343 |
<!-- | Feature | Status | -->
<!-- | -- | ------ | -->

# Framework of Qlib

Expand Down
133 changes: 133 additions & 0 deletions docs/advanced/PIT.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
.. _pit:

===========================
(P)oint-(I)n-(T)ime Database
===========================
.. currentmodule:: qlib


Introduction
------------
Point-in-time data is a very important consideration when performing any sort of historical market analysis.

For example, let’s say we are backtesting a trading strategy and we are using the past five years of historical data as our input.
Our model is assumed to trade once a day, at the market close, and we’ll say we are calculating the trading signal for 1 January 2020 in our backtest. At that point, we should only have data for 1 January 2020, 31 December 2019, 30 December 2019 etc.

In financial data (especially financial reports), the same piece of data may be amended for multiple times overtime. If we only use the latest version for historical backtesting, data leakage will happen.
Point-in-time database is designed for solving this problem to make sure user get the right version of data at any historical timestamp. It will keep the performance of online trading and historical backtesting the same.



Data Preparation
----------------

Qlib provides a crawler to help users to download financial data and then a converter to dump the data in Qlib format.
Please follow `scripts/data_collector/pit/README.md` to download and convert data.


File-based design for PIT data
------------------------------

Qlib provides a file-based storage for PIT data.

For each feature, it contains 4 columns, i.e. date, period, value, _next.
Each row corresponds to a statement.

The meaning of each feature with filename like `XXX_a.data`
- `date`: the statement's date of publication.
- `period`: the period of the statement. (e.g. it will be quarterly frequency in most of the markets)
- If it is an annual period, it will be an integer corresponding to the year
- If it is an quarterly periods, it will be an integer like `<year><index of quarter>`. The last two decimal digits represents the index of quarter. Others represent the year.
- `value`: the described value
- `_next`: the byte index of the next occurance of the field.

Besides the feature data, an index `XXX_a.index` is included to speed up the querying performance

The statements are soted by the `date` in ascending order from the beginning of the file.

.. code-block:: python

# the data format from XXXX.data
array([(20070428, 200701, 0.090219 , 4294967295),
(20070817, 200702, 0.13933 , 4294967295),
(20071023, 200703, 0.24586301, 4294967295),
(20080301, 200704, 0.3479 , 80),
(20080313, 200704, 0.395989 , 4294967295),
(20080422, 200801, 0.100724 , 4294967295),
(20080828, 200802, 0.24996801, 4294967295),
(20081027, 200803, 0.33412001, 4294967295),
(20090325, 200804, 0.39011699, 4294967295),
(20090421, 200901, 0.102675 , 4294967295),
(20090807, 200902, 0.230712 , 4294967295),
(20091024, 200903, 0.30072999, 4294967295),
(20100402, 200904, 0.33546099, 4294967295),
(20100426, 201001, 0.083825 , 4294967295),
(20100812, 201002, 0.200545 , 4294967295),
(20101029, 201003, 0.260986 , 4294967295),
(20110321, 201004, 0.30739301, 4294967295),
(20110423, 201101, 0.097411 , 4294967295),
(20110831, 201102, 0.24825101, 4294967295),
(20111018, 201103, 0.318919 , 4294967295),
(20120323, 201104, 0.4039 , 420),
(20120411, 201104, 0.403925 , 4294967295),
(20120426, 201201, 0.112148 , 4294967295),
(20120810, 201202, 0.26484701, 4294967295),
(20121026, 201203, 0.370487 , 4294967295),
(20130329, 201204, 0.45004699, 4294967295),
(20130418, 201301, 0.099958 , 4294967295),
(20130831, 201302, 0.21044201, 4294967295),
(20131016, 201303, 0.30454299, 4294967295),
(20140325, 201304, 0.394328 , 4294967295),
(20140425, 201401, 0.083217 , 4294967295),
(20140829, 201402, 0.16450299, 4294967295),
(20141030, 201403, 0.23408499, 4294967295),
(20150421, 201404, 0.319612 , 4294967295),
(20150421, 201501, 0.078494 , 4294967295),
(20150828, 201502, 0.137504 , 4294967295),
(20151023, 201503, 0.201709 , 4294967295),
(20160324, 201504, 0.26420501, 4294967295),
(20160421, 201601, 0.073664 , 4294967295),
(20160827, 201602, 0.136576 , 4294967295),
(20161029, 201603, 0.188062 , 4294967295),
(20170415, 201604, 0.244385 , 4294967295),
(20170425, 201701, 0.080614 , 4294967295),
(20170728, 201702, 0.15151 , 4294967295),
(20171026, 201703, 0.25416601, 4294967295),
(20180328, 201704, 0.32954201, 4294967295),
(20180428, 201801, 0.088887 , 4294967295),
(20180802, 201802, 0.170563 , 4294967295),
(20181029, 201803, 0.25522 , 4294967295),
(20190329, 201804, 0.34464401, 4294967295),
(20190425, 201901, 0.094737 , 4294967295),
(20190713, 201902, 0. , 1040),
(20190718, 201902, 0.175322 , 4294967295),
(20191016, 201903, 0.25581899, 4294967295)],
dtype=[('date', '<u4'), ('period', '<u4'), ('value', '<f8'), ('_next', '<u4')])
# - each row contains 20 byte


# The data format from XXXX.index. It consists of two parts
# 1) the start index of the data. So the first part of the info will be like
2007
# 2) the remain index data will be like information below
# - The data indicate the **byte index** of first data update of a period.
# - e.g. Because the info at both byte 80 and 100 corresponds to 200704. The byte index of first occurance (i.e. 100) is recorded in the data.
array([ 0, 20, 40, 60, 100,
120, 140, 160, 180, 200,
220, 240, 260, 280, 300,
320, 340, 360, 380, 400,
440, 460, 480, 500, 520,
540, 560, 580, 600, 620,
640, 660, 680, 700, 720,
740, 760, 780, 800, 820,
840, 860, 880, 900, 920,
940, 960, 980, 1000, 1020,
1060, 4294967295], dtype=uint32)




Known limitations
- Currently, the PIT database is designed for quarterly or annually factors, which can handle fundamental data of financial reports in most markets.
Qlib leverage the file name to identify the type of the data. File with name like `XXX_q.data` corresponds to quarterly data. File with name like `XXX_a.data` corresponds to annual data
- The caclulation of PIT is not performed in the optimal way. There is great potential to boost the performance of PIT data calcuation.
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ Document Structure
Online & Offline mode <advanced/server.rst>
Serialization <advanced/serial.rst>
Task Management <advanced/task_management.rst>
Point-In-Time database <advanced/PIT.rst>

.. toctree::
:maxdepth: 3
Expand Down
32 changes: 13 additions & 19 deletions qlib/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,7 @@ def set_conf_from_C(self, config_c):
"calendar_provider": "LocalCalendarProvider",
"instrument_provider": "LocalInstrumentProvider",
"feature_provider": "LocalFeatureProvider",
"pit_provider": "LocalPITProvider",
"expression_provider": "LocalExpressionProvider",
"dataset_provider": "LocalDatasetProvider",
"provider": "LocalProvider",
Expand All @@ -108,7 +109,6 @@ def set_conf_from_C(self, config_c):
"provider_uri": "",
# cache
"expression_cache": None,
"dataset_cache": None,
"calendar_cache": None,
# for simple dataset cache
"local_cache_path": None,
Expand Down Expand Up @@ -171,6 +171,18 @@ def set_conf_from_C(self, config_c):
"default_exp_name": "Experiment",
},
},
"pit_record_type": {
"date": "I", # uint32
"period": "I", # uint32
"value": "d", # float64
"index": "I", # uint32
},
"pit_record_nan": {
"date": 0,
"period": 0,
"value": float("NAN"),
"index": 0xFFFFFFFF,
},
# Default config for MongoDB
"mongo": {
"task_url": "mongodb://localhost:27017/",
Expand All @@ -184,46 +196,28 @@ def set_conf_from_C(self, config_c):

MODE_CONF = {
"server": {
# data provider config
"calendar_provider": "LocalCalendarProvider",
"instrument_provider": "LocalInstrumentProvider",
"feature_provider": "LocalFeatureProvider",
"expression_provider": "LocalExpressionProvider",
"dataset_provider": "LocalDatasetProvider",
"provider": "LocalProvider",
# config it in qlib.init()
"provider_uri": "",
# redis
"redis_host": "127.0.0.1",
"redis_port": 6379,
"redis_task_db": 1,
"kernels": NUM_USABLE_CPU,
# cache
"expression_cache": DISK_EXPRESSION_CACHE,
"dataset_cache": DISK_DATASET_CACHE,
"local_cache_path": Path("~/.cache/qlib_simple_cache").expanduser().resolve(),
"mount_path": None,
},
"client": {
# data provider config
"calendar_provider": "LocalCalendarProvider",
"instrument_provider": "LocalInstrumentProvider",
"feature_provider": "LocalFeatureProvider",
"expression_provider": "LocalExpressionProvider",
"dataset_provider": "LocalDatasetProvider",
"provider": "LocalProvider",
# config it in user's own code
"provider_uri": "~/.qlib/qlib_data/cn_data",
# cache
# Using parameter 'remote' to announce the client is using server_cache, and the writing access will be disabled.
# Disable cache by default. Avoid introduce advanced features for beginners
"expression_cache": None,
"dataset_cache": None,
# SimpleDatasetCache directory
"local_cache_path": Path("~/.cache/qlib_simple_cache").expanduser().resolve(),
"calendar_cache": None,
# client config
"kernels": NUM_USABLE_CPU,
"mount_path": None,
"auto_mount": False, # The nfs is already mounted on our server[auto_mount: False].
# The nfs should be auto-mounted by qlib on other
Expand Down
1 change: 1 addition & 0 deletions qlib/data/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
LocalCalendarProvider,
LocalInstrumentProvider,
LocalFeatureProvider,
LocalPITProvider,
LocalExpressionProvider,
LocalDatasetProvider,
ClientCalendarProvider,
Expand Down
59 changes: 47 additions & 12 deletions qlib/data/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,20 @@
from __future__ import print_function

import abc

import pandas as pd
from ..log import get_module_logger


class Expression(abc.ABC):
"""Expression base class"""
"""
Expression base class

Expression is designed to handle the calculation of data with the format below
data with two dimension for each instrument,
- feature
- time: it could be observation time or period time.
- period time is designed for Point-in-time database. For example, the period time maybe 2014Q4, its value can observed for multiple times(different value may be observed at different time due to amendment).
"""

def __str__(self):
return type(self).__name__
Expand Down Expand Up @@ -124,8 +132,18 @@ def __ror__(self, other):

return Or(other, self)

def load(self, instrument, start_index, end_index, freq):
def load(self, instrument, start_index, end_index, *args):
"""load feature
This function is responsible for loading feature/expression based on the expression engine.

The concerate implementation will be seperated by two parts
1) caching data, handle errors.
- This part is shared by all the expressions and implemented in Expression
2) processing and calculating data based on the specific expression.
- This part is different in each expression and implemented in each expression

Expresion Engine is shared by different data.
Different data will have different extra infomation for `args`.

Parameters
----------
Expand All @@ -135,8 +153,15 @@ def load(self, instrument, start_index, end_index, freq):
feature start index [in calendar].
end_index : str
feature end index [in calendar].
freq : str
feature frequency.

*args may contains following information;
1) if it is used in basic experssion engine data, it contains following arguments
freq : str
feature frequency.

2) if is used in PIT data, it contains following arguments
cur_pit:
it is designed for the point-in-time data.

Returns
----------
Expand All @@ -146,26 +171,26 @@ def load(self, instrument, start_index, end_index, freq):
from .cache import H # pylint: disable=C0415

# cache
args = str(self), instrument, start_index, end_index, freq
if args in H["f"]:
return H["f"][args]
cache_key = str(self), instrument, start_index, end_index, *args
if cache_key in H["f"]:
return H["f"][cache_key]
if start_index is not None and end_index is not None and start_index > end_index:
raise ValueError("Invalid index range: {} {}".format(start_index, end_index))
try:
series = self._load_internal(instrument, start_index, end_index, freq)
series = self._load_internal(instrument, start_index, end_index, *args)
except Exception as e:
get_module_logger("data").debug(
f"Loading data error: instrument={instrument}, expression={str(self)}, "
f"start_index={start_index}, end_index={end_index}, freq={freq}. "
f"start_index={start_index}, end_index={end_index}, args={args}. "
f"error info: {str(e)}"
)
raise
series.name = str(self)
H["f"][args] = series
H["f"][cache_key] = series
return series

@abc.abstractmethod
def _load_internal(self, instrument, start_index, end_index, freq):
def _load_internal(self, instrument, start_index, end_index, *args) -> pd.Series:
raise NotImplementedError("This function must be implemented in your newly defined feature")

@abc.abstractmethod
Expand Down Expand Up @@ -225,6 +250,16 @@ def get_extended_window_size(self):
return 0, 0


class PFeature(Feature):
def __str__(self):
return "$$" + self._name

def _load_internal(self, instrument, start_index, end_index, cur_time):
from .data import PITD # pylint: disable=C0415

return PITD.period_feature(instrument, str(self), start_index, end_index, cur_time)


class ExpressionOps(Expression):
"""Operator Expression

Expand Down
Loading