A standalone package to scrape financial data from listed Vietnamese companies via Vietstock. If you are looking for raw financial data from listed Vietnamese companies, this may help you.
- Recent Changes
- Prerequisites
- Run within Docker Compose
- Run on Host
- Scrape Results
- Debugging and How It Works
- Limitations and Lessons Learned
- Disclaimer
- September 2021: Vietstock has implemented request verification tokens for API requests, making it more difficult to access them. You will have to manually obtain the tokens from your browser (see this section), which may take some time depending on how comfortable you are with the browser's inspection tool. Currently, there is no information on how long the tokens will be valid for and I have not found a way to automatically obtain them.
- July 2021: I have removed my own implementation of proxies for this project. The reason will be stated in the Lession Learned section below. If you really want to use proxies, make your changes that can be reflected in this constants configuration file (more details are included there).
Because the core components of this project runs on Docker.
Because you will have to build the image from source. I have not released this project's image on Docker Hub yet.
How to get them:
- Sign on to finance.vietstock.vn
- Hover over "Corporate"/"Doanh nghiệp", and choose "Corporate A-Z"/"Doanh nghiệp A-Z"
- Click on any ticker
- Open your browser's Inspect console
- Go to the
Network
tab, filter onlyXHR
orFetch/XHR
requests - On the list of
XHR
requests, search for the one namedfinanceinfo
, then go to theCookies
tab underneath; if you cannot findfinanceinfo
, you can try clicking on the company's "Financials"/"Tài chính" tab and look for it - Take note of the value of the
vts_usr_lg
cookie, which is theUSER_COOKIE
environment variable in the config - Take note of the value of the
__RequestVerificationToken
cookie, which is theREQ_VER_TOKEN_COOKIE
environment variable in the config - Go back to the request's
Header
tab, look for the form data - Take note of the value of the
__RequestVerificationToken
parameter, which is theREQ_VER_TOKEN_POST
environment variable in the config - Done
Report type code | Meaning |
---|---|
CTKH |
Financial targets/Chỉ Tiêu Kế Hoạch |
CDKT |
Balance sheet/Cân Đối Kế Toán |
KQKD |
Income statement/Kết Quả Kinh Doanh |
LC |
Cash flow statement/Lưu Chuyển (Tiền Tệ) |
CSTC |
Financial ratios/Chỉ Số Tài Chính |
Report term code | Meaning |
---|---|
1 |
Annually |
2 |
Quarterly |
All core functions are located within the functions_vietstock
folder and so are the scraped files; thus, from now on, references to the functions_vietstock
folder will be simply put as ./
.
It should be in this area:
...
functions-vietstock:
build: .
container_name: functions-vietstock
command: wait-for-it -s scraper-redis:6379 -t 600 -- bash
stdin_open: true
tty: true
environment:
- REDIS_HOST=scraper-redis
- REQ_VER_TOKEN_POST=
- REQ_VER_TOKEN_COOKIE=
- USER_COOKIE=
...
At the project folder, run:
docker-compose build --no-cache && docker-compose up -d
Next, open the scraper container in another terminal:
docker exec -it functions-vietstock ./userinput.sh
Note: To stop the scraping, stop the userinput script terminal, then open another terminal and run:
docker exec -it functions-vietstock ./celery_stop.sh
to clean everything related to the scraping process (local scraped files are intact).
Some quesitons require you to answer in a specific syntax, as follows:
Do you wish to scrape by a specific business type-industry or by tickers? [y for business type-industry/n for tickers]
- If you enter
y
, the next prompt is:Enter business type ID and industry ID combination in the form of businesstype_id;industry_id:
- If you chose to scrape a list of all business types-industries and their respective tickers, you should have the file
bizType_ind_tickers.csv
in the scrape result folder (./localData/overview
). - Then you answer this prompt by entering a business type ID and industry ID combination in the form of
businesstype_id;industry_id
.
- If you chose to scrape a list of all business types-industries and their respective tickers, you should have the file
- If you enter
n
, the next prompts ask for ticker(s)- Again, suppose you have the
bizType_ind_tickers.csv
file - Then you answer the prompts as follows:
ticker
: a ticker symbol or a list of ticker symbols of your choice. You can enter eitherticker_1
orticker_1,ticker_2
- Again, suppose you have the
- Whether you chose scrape by business type-industry or tickers, you will receive a prompt for report type(s), report term(s) and page:
-
report_type
andreport_term
: use the report type codes and report term codes in the following tables (which was already mentioned above). You can enter eitherreport_type_1
orreport_type_1,report_type_2
. Same goes for report term.Report type code Meaning CTKH
Financial targets/Chỉ Tiêu Kế Hoạch CDKT
Balance sheet/Cân Đối Kế Toán KQKD
Income statement/Kết Quả Kinh Doanh LC
Cash flow statement/Lưu Chuyển (Tiền Tệ) CSTC
Financial ratios/Chỉ Số Tài Chính Report term code Meaning 1
Annually 2
Quarterly -
page
: the page number for the scrape, this is optional. If omitted, the scraper will start from page 1
-
- If you enter
Maybe you do not want to spend time building the image, and just want to play around with the code.
In your virtual environment of choice, install all requirements:
pip install -r requirements.txt
Nagivate to the functions_vietstock
folder, create a file named .env
with the following content (you can use the .example_env
file as an example):
REDIS_HOST=localhost
REQ_VER_TOKEN_POST=YOUR_REQ_VER_TOKEN_POST
REQ_VER_TOKEN_COOKIE=YOUR_REQ_VER_TOKEN_COOKIE
USER_COOKIE=YOUR_USER_COOKIE
You still need to run the Redis server inside a container:
docker run -d -p 6379:6379 --rm --name scraper-redis redis:6.2
Go to the functions_vietstock
folder:
cd functions_vietstock
Run the celery_stop.sh
script:
./celery_stop.sh
Use the ./userinput.sh
script to scrape as in the previous section.
If you chose to scrape a list of all business types, industries and their tickers, the result is stored in the ./localData/overview
folder, under the file name bizType_ind_tickers.csv
.
ticker,biztype_id,bizType_title,ind_id,ind_name
BID,3,Bank,1000,Finance and Insurance
CTG,3,Bank,1000,Finance and Insurance
VCB,3,Bank,1000,Finance and Insurance
TCB,3,Bank,1000,Finance and Insurance
...
FinanceInfo results are stored in the ./localData/financeInfo
folder, and each file is the form ticker_reportType_reportTermName_page.json
, representing a ticker - report type - report term - page instance.
[
[
{
"ID": 4,
"Row": 4,
"CompanyID": 2541,
"YearPeriod": 2017,
"TermCode": "N",
"TermName": "Năm",
"TermNameEN": "Year",
"ReportTermID": 1,
"DisplayOrdering": 1,
"United": "HN",
"AuditedStatus": "KT",
"PeriodBegin": "201701",
"PeriodEnd": "201712",
"TotalRow": 14,
"BusinessType": 1,
"ReportNote": null,
"ReportNoteEn": null
},
{
"ID": 3,
"Row": 3,
"CompanyID": 2541,
"YearPeriod": 2018,
"TermCode": "N",
"TermName": "Năm",
"TermNameEN": "Year",
"ReportTermID": 1,
"DisplayOrdering": 1,
"United": "HN",
"AuditedStatus": "KT",
"PeriodBegin": "201801",
"PeriodEnd": "201812",
"TotalRow": 14,
"BusinessType": 1,
"ReportNote": null,
"ReportNoteEn": null
},
{
"ID": 2,
"Row": 2,
"CompanyID": 2541,
"YearPeriod": 2019,
"TermCode": "N",
"TermName": "Năm",
"TermNameEN": "Year",
"ReportTermID": 1,
"DisplayOrdering": 1,
"United": "HN",
"AuditedStatus": "KT",
"PeriodBegin": "201901",
"PeriodEnd": "201912",
"TotalRow": 14,
"BusinessType": 1,
"ReportNote": null,
"ReportNoteEn": null
},
{
"ID": 1,
"Row": 1,
"CompanyID": 2541,
"YearPeriod": 2020,
"TermCode": "N",
"TermName": "Năm",
"TermNameEN": "Year",
"ReportTermID": 1,
"DisplayOrdering": 1,
"United": "HN",
"AuditedStatus": "KT",
"PeriodBegin": "202001",
"PeriodEnd": "202112",
"TotalRow": 14,
"BusinessType": 1,
"ReportNote": null,
"ReportNoteEn": null
}
],
{
"Balance Sheet": [
{
"ID": 1,
"ReportNormID": 2995,
"Name": "TÀI SẢN ",
"NameEn": "ASSETS",
"NameMobile": "TÀI SẢN ",
"NameMobileEn": "ASSETS",
"CssStyle": "MaxB",
"Padding": "Padding1",
"ParentReportNormID": 2995,
"ReportComponentName": "Cân đối kế toán",
"ReportComponentNameEn": "Balance Sheet",
"Unit": null,
"UnitEn": null,
"OrderType": null,
"OrderingComponent": null,
"RowNumber": null,
"ReportComponentTypeID": null,
"ChildTotal": 0,
"Levels": 0,
"Value1": null,
"Value2": null,
"Value3": null,
"Value4": null,
"Vl": null,
"IsShowData": true
},
{
"ID": 2,
"ReportNormID": 3000,
"Name": "A. TÀI SẢN NGẮN HẠN",
"NameEn": "A. SHORT-TERM ASSETS",
"NameMobile": "A. TÀI SẢN NGẮN HẠN",
"NameMobileEn": "A. SHORT-TERM ASSETS",
"CssStyle": "LargeB",
"Padding": "Padding1",
"ParentReportNormID": 2996,
"ReportComponentName": "Cân đối kế toán",
"ReportComponentNameEn": "Balance Sheet",
"Unit": null,
"UnitEn": null,
"OrderType": null,
"OrderingComponent": null,
"RowNumber": null,
"ReportComponentTypeID": null,
"ChildTotal": 25,
"Levels": 1,
"Value1": 4496051.0,
"Value2": 4971364.0,
"Value3": 3989369.0,
"Value4": 2142717.0,
"Vl": null,
"IsShowData": true
},
...
Please note that you have to determine whether the order of the financial values match the order of the periods
Scrape logs are stored in the ./logs
folder, in the form of scrapySpiderName_log_verbose.log
.
Error logs are stored in the ./logs
folder, in the form of scrapySpiderName_reportType_spidererrors_short.log
. For now, error logs are used only for financeInfo Spider.
"Redis is an open source (BSD licensed), in-memory data structure store, used as a database, cache, and message broker." See: https://redis.io/. In this project, Redis serves as a message broker and an in-memory queue for Scrapy. No non-standard Redis configurations were made for this project.
To open an interactive shell with Redis, you have to enter the container first:
docker exec -it functions-vietstock bash
Then:
redis-cli -h scraper-redis
To open an interactive shell with Redis:
docker exec -it scraper-redis redis-cli
Look inside each log file.
This scraper utilizes scrapy-redis and Redis to crawl and scrape tickers' information from a top-down approach (going from business types, then industries, then tickers in each business type-industry combination) by passing necessary information into Redis queues for different Spiders to consume.
- When talking about a crawler/scraper, one must consider speed, among other things. That said, I haven't run a benchmark for this scraper project.
- There are about 3000 tickers on the market, each with its own set of available report types, report terms and pages.
- Scraping all historical financials of all those 3000 tickers will, I believe, be pretty slow, because there are many pages for a ticker-report type-report term combination and an auto-throttle policy has been added using Scrapy's AutoThrottle extension.
- Scrape results are written on disk, so that is also a bottleneck if you want to mass-scrape. Of course, this point is different if you only scrape one or two tickers.
- To mass-scrape, a distributed scraping architecture is desirable, not only for speed, but also for anonymity (not entirely if you use the same user cookie across machines). However, one should respect the API service provider (i.e., Vietstock) and avoid bombarding them with tons of requests in a short period of time.
- Possibility of being banned on Vietstock? Yes.
- Each request has a unique Vietstock user cookie on it, and thus you are identifiable when making each request.
- As of now (May 2021), I still don't know how many concurrent requests can Vietstock server handle at any given point. While this API is publicly open, it's not documented on Vietstock. Because of this, I recently added a throttling feature to the financeInfo Spider to avoid bombarding Vietstock's server. See financeInfo's configuration file.
- Constantly changing Tor circuit maybe harmful to the Tor network.
- Looking at this link on Tor metrics, we see that the number of exit nodes is below 2000. By changing the circuits as we scrape, we will eventually expose almost all of these available exit nodes to the Vietstock server, which in turn undermines the purpose of avoiding ban.
- In addition, in an unlikely circumstance, interested users who want to use Tor network to view a Vietstock page may not be able to do so, because the exit node may have been banned.
- Scrape results are as-is and not processed.
- As mentioned, scrape results are currently stored on disk as JSONs, and a unified format for financial statements has not been produced. Thus, to fully integrate this scraping process with an analysis project, you must do a lot of data standardization.
- There is no user-friendly interface to monitor Redis queue, and I haven't looked much into this.
- Utilizing Redis creates a nice and smooth workflow for mass scraping data, provided that the paths to data can be logically determined (e.g., in the form of pagination).
- Using proxies cannot offer the best anonymity while scraping, because you have to use a user cookie to have access to data anyway.
- Packing inter-dependent services with Docker Compose helps create a cleaner and more professional-looking code base.
- Why have I removed my implementation of proxies? The reason is that I believe whoever uses this software is responsible for creating and maintaining their own mechanism to avoid IP-ban. Additionally, openly pre-providing an IP-ban mechanism may expose it to being overused or even abused; and I do not want to take that risk.
- This project is completed for educational and non-commercial purposes only.
- The scrape results are as-is from Vietstock APIs, which are publicly available. You are responsible for your own use of the data scraped using this project.
- Vietstock has all the rights to modify or remove access to the API used in this project in their own way, without any notice. I am not responsible for updating access to their API in a promptly manner and any consequences to your use of this project resulting from such mentioned change.