-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ibovespa index support #990
Merged
Merged
Changes from 15 commits
Commits
Show all changes
20 commits
Select commit
Hold shift + click to select a range
3419ec9
feat: download ibovespa index historic composition
igor17400 c2f933b
fix: typo error instead of end_date, it was written end_ate
igor17400 09b8ad9
feat: adds support for downloading stocks historic prices from Brazil…
igor17400 77107f3
fix: code formatted with black.
igor17400 3aaf1df
wip: Creating code logic for brazils stock market data normalization
igor17400 9ceb592
docs: brazils stock market data normalization code documentation
igor17400 d1b73b3
fix: code formatted the with black
igor17400 cc0e126
docs: fixed typo
igor17400 95938ea
docs: more info about python version used to generate requirements.tx…
igor17400 b0aafa2
docs: added BeautifulSoup requirements
igor17400 592559a
feat: removed debug prints
igor17400 92aa003
feat: added ibov_index_composition variable as a class attribute of I…
igor17400 4903845
feat: added increment to generate the four month period used by the i…
igor17400 6db33ef
refactor: Added get_instruments() method inside utils.py for better c…
igor17400 ae6380a
refactor: improve brazils stocks download speed
igor17400 1d80c4c
fix: added __main__ at the bottom of the script
igor17400 dc72c6b
refactor: changed interface inside each index
igor17400 6cc96cc
refactor: implemented class interface retry into YahooCollectorBR
igor17400 1cbfb5c
docs: added BR as a possible region into the documentation
igor17400 c313804
refactor: make retry attribute part of the interface
igor17400 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
# iBOVESPA History Companies Collection | ||
|
||
## Requirements | ||
|
||
- Install the libs from the file `requirements.txt` | ||
|
||
```bash | ||
pip install -r requirements.txt | ||
``` | ||
- `requirements.txt` file was generated using python3.8 | ||
|
||
## For the ibovespa (IBOV) index, we have: | ||
|
||
<hr/> | ||
|
||
### Method `get_new_companies` | ||
|
||
#### <b>Index start date</b> | ||
|
||
- The ibovespa index started on 2 January 1968 ([wiki](https://en.wikipedia.org/wiki/%C3%8Dndice_Bovespa)). In order to use this start date in our `bench_start_date(self)` method, two conditions must be satisfied: | ||
1) APIs used to download brazilian stocks (B3) historical prices must keep track of such historic data since 2 January 1968 | ||
|
||
2) Some website or API must provide, from that date, the historic index composition. In other words, the companies used to build the index . | ||
|
||
As a consequence, the method `bench_start_date(self)` inside `collector.py` was implemented using `pd.Timestamp("2003-01-03")` due to two reasons | ||
|
||
1) The earliest ibov composition that have been found was from the first quarter of 2003. More informations about such composition can be seen on the sections below. | ||
|
||
2) Yahoo finance, one of the libraries used to download symbols historic prices, keeps track from this date forward. | ||
|
||
- Within the `get_new_companies` method, a logic was implemented to get, for each ibovespa component stock, the start date that yahoo finance keeps track of. | ||
|
||
#### <b>Code Logic</b> | ||
|
||
The code does a web scrapping into the B3's [website](https://sistemaswebb3-listados.b3.com.br/indexPage/day/IBOV?language=pt-br), which keeps track of the ibovespa stocks composition on the current day. | ||
|
||
Other approaches, such as `request` and `Beautiful Soup` could have been used. However, the website shows the table with the stocks with some delay, since it uses a script inside of it to obtain such compositions. | ||
Alternatively, `selenium` was used to download this stocks' composition in order to overcome this problem. | ||
|
||
Futhermore, the data downloaded from the selenium script was preprocessed so it could be saved into the `csv` format stablished by `scripts/data_collector/index.py`. | ||
|
||
<hr/> | ||
|
||
### Method `get_changes` | ||
|
||
No suitable data source that keeps track of ibovespa's history stocks composition has been found. Except from this [repository](https://github.com/igor17400/IBOV-HCI) which provide such information have been used, however it only provides the data from the 1st quarter of 2003 to 3rd quarter of 2021. | ||
|
||
With that reference, the index's composition can be compared quarter by quarter and year by year and then generate a file that keeps track of which stocks have been removed and which have been added each quarter and year. | ||
|
||
<hr/> | ||
|
||
### Collector Data | ||
|
||
```bash | ||
# parse instruments, using in qlib/instruments. | ||
python collector.py --index_name IBOV --qlib_dir ~/.qlib/qlib_data/br_data --method parse_instruments --market_index br_index | ||
|
||
# parse new companies | ||
python collector.py --index_name IBOV --qlib_dir ~/.qlib/qlib_data/br_data --method save_new_companies --market_index br_index | ||
``` | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,276 @@ | ||
# Copyright (c) Microsoft Corporation. | ||
# Licensed under the MIT License. | ||
import sys | ||
from pathlib import Path | ||
import importlib | ||
import datetime | ||
|
||
import fire | ||
import pandas as pd | ||
from tqdm import tqdm | ||
from loguru import logger | ||
|
||
CUR_DIR = Path(__file__).resolve().parent | ||
sys.path.append(str(CUR_DIR.parent.parent)) | ||
|
||
from data_collector.index import IndexBase | ||
from data_collector.utils import get_instruments | ||
|
||
quarter_dict = {"1Q": "01-03", "2Q": "05-01", "3Q": "09-01"} | ||
|
||
|
||
class IBOVIndex(IndexBase): | ||
|
||
ibov_index_composition = "https://raw.githubusercontent.com/igor17400/IBOV-HCI/main/historic_composition/{}.csv" | ||
years_4_month_periods = [] | ||
|
||
def __init__( | ||
self, | ||
index_name: str, | ||
qlib_dir: [str, Path] = None, | ||
freq: str = "day", | ||
request_retry: int = 5, | ||
retry_sleep: int = 3, | ||
): | ||
super(IBOVIndex, self).__init__( | ||
index_name=index_name, qlib_dir=qlib_dir, freq=freq, request_retry=request_retry, retry_sleep=retry_sleep | ||
) | ||
|
||
self.today: datetime = datetime.date.today() | ||
self.current_4_month_period = self.get_current_4_month_period(self.today.month) | ||
self.year = str(self.today.year) | ||
self.years_4_month_periods = self.get_four_month_period() | ||
|
||
@property | ||
def bench_start_date(self) -> pd.Timestamp: | ||
""" | ||
The ibovespa index started on 2 January 1968 (wiki), however, | ||
no suitable data source that keeps track of ibovespa's history | ||
stocks composition has been found. Except from the repo indicated | ||
in README. Which keeps track of such information starting from | ||
the first quarter of 2003 | ||
""" | ||
return pd.Timestamp("2003-01-03") | ||
|
||
def get_current_4_month_period(self, current_month: int): | ||
""" | ||
This function is used to calculated what is the current | ||
four month period for the current month. For example, | ||
If the current month is August 8, its four month period | ||
is 2Q. | ||
|
||
OBS: In english Q is used to represent *quarter* | ||
which means a three month period. However, in | ||
portuguese we use Q to represent a four month period. | ||
In other words, | ||
|
||
Jan, Feb, Mar, Apr: 1Q | ||
May, Jun, Jul, Aug: 2Q | ||
Sep, Oct, Nov, Dez: 3Q | ||
|
||
Parameters | ||
---------- | ||
month : int | ||
Current month (1 <= month <= 12) | ||
|
||
Returns | ||
------- | ||
current_4m_period:str | ||
Current Four Month Period (1Q or 2Q or 3Q) | ||
""" | ||
if current_month < 5: | ||
return "1Q" | ||
if current_month < 9: | ||
return "2Q" | ||
if current_month <= 12: | ||
return "3Q" | ||
else: | ||
return -1 | ||
|
||
def get_four_month_period(self): | ||
""" | ||
The ibovespa index is updated every four months. | ||
Therefore, we will represent each time period as 2003_1Q | ||
which means 2003 first four mount period (Jan, Feb, Mar, Apr) | ||
""" | ||
four_months_period = ["1Q", "2Q", "3Q"] | ||
init_year = 2003 | ||
now = datetime.datetime.now() | ||
current_year = now.year | ||
current_month = now.month | ||
for year in [item for item in range(init_year, current_year)]: | ||
for el in four_months_period: | ||
self.years_4_month_periods.append(str(year)+"_"+el) | ||
# For current year the logic must be a little different | ||
current_4_month_period = self.get_current_4_month_period(current_month) | ||
for i in range(int(current_4_month_period[0])): | ||
self.years_4_month_periods.append(str(current_year) + "_" + str(i+1) + "Q") | ||
return self.years_4_month_periods | ||
|
||
|
||
def format_datetime(self, inst_df: pd.DataFrame) -> pd.DataFrame: | ||
"""formatting the datetime in an instrument | ||
|
||
Parameters | ||
---------- | ||
inst_df: pd.DataFrame | ||
inst_df.columns = [self.SYMBOL_FIELD_NAME, self.START_DATE_FIELD, self.END_DATE_FIELD] | ||
|
||
Returns | ||
------- | ||
inst_df: pd.DataFrame | ||
|
||
""" | ||
logger.info("Formatting Datetime") | ||
if self.freq != "day": | ||
inst_df[self.END_DATE_FIELD] = inst_df[self.END_DATE_FIELD].apply( | ||
lambda x: (pd.Timestamp(x) + pd.Timedelta(hours=23, minutes=59)).strftime("%Y-%m-%d %H:%M:%S") | ||
) | ||
else: | ||
inst_df[self.START_DATE_FIELD] = inst_df[self.START_DATE_FIELD].apply( | ||
lambda x: (pd.Timestamp(x)).strftime("%Y-%m-%d") | ||
) | ||
|
||
inst_df[self.END_DATE_FIELD] = inst_df[self.END_DATE_FIELD].apply( | ||
lambda x: (pd.Timestamp(x)).strftime("%Y-%m-%d") | ||
) | ||
return inst_df | ||
|
||
def format_quarter(self, cell: str): | ||
""" | ||
Parameters | ||
---------- | ||
cell: str | ||
It must be on the format 2003_1Q --> years_4_month_periods | ||
|
||
Returns | ||
---------- | ||
date: str | ||
Returns date in format 2003-03-01 | ||
""" | ||
cell_split = cell.split("_") | ||
return cell_split[0] + "-" + quarter_dict[cell_split[1]] | ||
|
||
def get_changes(self): | ||
""" | ||
Access the index historic composition and compare it quarter | ||
by quarter and year by year in order to generate a file that | ||
keeps track of which stocks have been removed and which have | ||
been added. | ||
|
||
The Dataframe used as reference will provided the index | ||
composition for each year an quarter: | ||
pd.DataFrame: | ||
symbol | ||
SH600000 | ||
SH600001 | ||
. | ||
. | ||
. | ||
|
||
Parameters | ||
---------- | ||
self: is used to represent the instance of the class. | ||
|
||
Returns | ||
---------- | ||
pd.DataFrame: | ||
symbol date type | ||
SH600000 2019-11-11 add | ||
SH600001 2020-11-10 remove | ||
dtypes: | ||
symbol: str | ||
date: pd.Timestamp | ||
type: str, value from ["add", "remove"] | ||
""" | ||
logger.info("Getting companies changes in {} index ...".format(self.index_name)) | ||
|
||
try: | ||
df_changes_list = [] | ||
for i in tqdm(range(len(self.years_4_month_periods) - 1)): | ||
df = pd.read_csv(self.ibov_index_composition.format(self.years_4_month_periods[i]), on_bad_lines="skip")["symbol"] | ||
df_ = pd.read_csv(self.ibov_index_composition.format(self.years_4_month_periods[i + 1]), on_bad_lines="skip")["symbol"] | ||
|
||
## Remove Dataframe | ||
remove_date = self.years_4_month_periods[i].split("_")[0] + "-" + quarter_dict[self.years_4_month_periods[i].split("_")[1]] | ||
list_remove = list(df[~df.isin(df_)]) | ||
df_removed = pd.DataFrame( | ||
{ | ||
"date": len(list_remove) * [remove_date], | ||
"type": len(list_remove) * ["remove"], | ||
"symbol": list_remove, | ||
} | ||
) | ||
|
||
## Add Dataframe | ||
add_date = self.years_4_month_periods[i + 1].split("_")[0] + "-" + quarter_dict[self.years_4_month_periods[i + 1].split("_")[1]] | ||
list_add = list(df_[~df_.isin(df)]) | ||
df_added = pd.DataFrame( | ||
{"date": len(list_add) * [add_date], "type": len(list_add) * ["add"], "symbol": list_add} | ||
) | ||
|
||
df_changes_list.append(pd.concat([df_added, df_removed], sort=False)) | ||
df = pd.concat(df_changes_list).reset_index(drop=True) | ||
df["symbol"] = df["symbol"].astype(str) + ".SA" | ||
|
||
return df | ||
|
||
except Exception as E: | ||
logger.error("An error occured while downloading 2008 index composition - {}".format(E)) | ||
|
||
def get_new_companies(self): | ||
""" | ||
Get latest index composition. | ||
The repo indicated on README has implemented a script | ||
to get the latest index composition from B3 website using | ||
selenium. Therefore, this method will download the file | ||
containing such composition | ||
|
||
Parameters | ||
---------- | ||
self: is used to represent the instance of the class. | ||
|
||
Returns | ||
---------- | ||
pd.DataFrame: | ||
symbol start_date end_date | ||
RRRP3 2020-11-13 2022-03-02 | ||
ALPA4 2008-01-02 2022-03-02 | ||
dtypes: | ||
symbol: str | ||
start_date: pd.Timestamp | ||
end_date: pd.Timestamp | ||
""" | ||
logger.info("Getting new companies in {} index ...".format(self.index_name)) | ||
|
||
try: | ||
## Get index composition | ||
|
||
df_index = pd.read_csv( | ||
self.ibov_index_composition.format(self.year + "_" + self.current_4_month_period), on_bad_lines="skip" | ||
) | ||
df_date_first_added = pd.read_csv( | ||
self.ibov_index_composition.format("date_first_added_" + self.year + "_" + self.current_4_month_period), | ||
on_bad_lines="skip", | ||
) | ||
df = df_index.merge(df_date_first_added, on="symbol")[["symbol", "Date First Added"]] | ||
df[self.START_DATE_FIELD] = df["Date First Added"].map(self.format_quarter) | ||
|
||
# end_date will be our current quarter + 1, since the IBOV index updates itself every quarter | ||
df[self.END_DATE_FIELD] = self.year + "-" + quarter_dict[self.current_4_month_period] | ||
df = df[["symbol", self.START_DATE_FIELD, self.END_DATE_FIELD]] | ||
df["symbol"] = df["symbol"].astype(str) + ".SA" | ||
|
||
return df | ||
|
||
except Exception as E: | ||
logger.error("An error occured while getting new companies - {}".format(E)) | ||
|
||
def filter_df(self, df: pd.DataFrame) -> pd.DataFrame: | ||
if "Código" in df.columns: | ||
return df.loc[:, ["Código"]].copy() | ||
|
||
|
||
|
||
if __name__ == "__main__": | ||
fire.Fire(get_instruments) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
async-generator==1.10 | ||
attrs==21.4.0 | ||
certifi==2021.10.8 | ||
cffi==1.15.0 | ||
charset-normalizer==2.0.12 | ||
cryptography==36.0.1 | ||
fire==0.4.0 | ||
h11==0.13.0 | ||
idna==3.3 | ||
loguru==0.6.0 | ||
lxml==4.8.0 | ||
multitasking==0.0.10 | ||
numpy==1.22.2 | ||
outcome==1.1.0 | ||
pandas==1.4.1 | ||
pycoingecko==2.2.0 | ||
pycparser==2.21 | ||
pyOpenSSL==22.0.0 | ||
PySocks==1.7.1 | ||
python-dateutil==2.8.2 | ||
pytz==2021.3 | ||
requests==2.27.1 | ||
requests-futures==1.0.0 | ||
six==1.16.0 | ||
sniffio==1.2.0 | ||
sortedcontainers==2.4.0 | ||
termcolor==1.1.0 | ||
tqdm==4.63.0 | ||
trio==0.20.0 | ||
trio-websocket==0.9.2 | ||
urllib3==1.26.8 | ||
wget==3.2 | ||
wsproto==1.1.0 | ||
yahooquery==2.2.15 |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.